Discovery production sends your client identifiers to opposing counsel, BPO reviewers, and case law databases. Black boxes break AI tools and expose text layers. Synthetic replacement swaps every identifier with consistent synthetic data. Same layout. Same Bates stamps. Fully searchable.
These are not hypothetical failure modes. They are documented incidents and structural limitations that affect every discovery production, court filing, and DSAR response.
A Federal Judicial Center study of public PACER filings found 22,391 Social Security numbers sitting unredacted in federal court documents. In each case, a black box covered the on-screen rendering. But the underlying text data stream survived in the PDF. Any researcher, journalist, or bad actor with a copy button could extract the SSNs in seconds.
“The court was not aware the data was accessible until outside researchers reported it.”
Federal courts now issue explicit PDF redaction guidance because this failure is systemic, not exceptional. Courts have issued sanctions, including attorney fee awards and case dismissal, for inadvertent disclosure caused by exactly this pattern.
In January 2019, lawyers for Paul Manafort filed a response to Special Counsel Mueller with black overlays over sensitive passages. Within hours, reporters including those from BuzzFeed News discovered they could copy-paste the text directly from the PDF. The same principle appeared in the Sony disclosures during FTC v. Microsoft proceedings, where confidential game development budgets were redacted with a physical pen but remained legible when the document was scanned.
Courts have issued sanctions including attorney fee awards and case dismissal for these failures. State bar ethics rules treat inadvertent disclosure of protected information as a professional conduct violation, not a procedural error. The consequences can extend to disbarment proceedings.
PDFs have two distinct layers: the visual image layer (what you see on screen) and the text data stream (what a program reads). Adobe Acrobat and most redaction tools paint a black rectangle over the image layer. The text data stream, the actual PII, often remains fully intact. Copy-paste, screen readers, and AI systems all extract from the text layer, not the image layer.
Case law databases built on black-box redactions cannot be full-text searched. RAG systems trained on [REDACTED] tokens return degraded answers. LLM perplexity increases 144% when processing redacted versus clean baseline text. BPO teams processing blacked-out files cannot extract the data they need. DSARs with redacted sections cannot be used in court. Technically compliant. Operationally broken.
When a 40,000-document production contains the same plaintiff SSN across hundreds of separate documents, each one requires an individual human decision. In complex matters, fully-loaded manual review costs, including attorney time, platform fees, project management, and quality control, can reach $15-$25 per document. At that scale, error rates climb. The same SSN gets caught in most documents and missed in a handful. Courts treat inadvertent disclosure as a breach regardless of volume. One missed identifier in a 40,000-document production carries the same sanctions exposure as wholesale non-compliance. Re-Doc processes the entire batch with a single entity map. Every instance of every identifier, consistently handled across every duplicate, in one pass.
One API call. De-identified, layout-perfect output every time. Plugs into your existing eDiscovery workflow without changing how your team operates.
Deposition transcripts, scanned exhibits, native PDFs. Any format from any eDiscovery platform.
Visual processing reads every pixel in scanned exhibits. A context-aware model detects all PII. Synthetic replacement or true redaction applied to both image and text layers.
Same pagination, same Bates stamps, same exhibit numbering. Formatting preserved. The document looks exactly as it did, minus the real identities.
Opposing counsel, regulators, case law databases, RAG systems, BPO teams. Production ZIP includes .DAT and .OPT load files for direct Relativity import. Designed for FRCP 34 production workflows.
Legal document repositories mix scanned paper records from the 1990s with native digital contracts from last week. Re-Doc has a purpose-built pipeline for each.
For scanned court exhibits and image-based PDFs. Visual processing reads the image layer and extracts all text, including handwriting and fax artifacts. A context-aware model identifies PII in context. Black boxes are drawn precisely at the pixel level over both the image and the text data stream. The copy-paste attack does not work on Re-Doc redactions.
For native digital documents. All PII is replaced with consistent synthetic data. John Michael Smith becomes Robert James Wilson on every page of every document in the production set. The same mapping applied across the entire batch. Case facts, dates, contract terms, and legal arguments are fully preserved. Opposing counsel receives a complete, coherent record.
The legal content is preserved. The parties are gone. The document is usable, searchable, and safe to produce.
Plaintiff: John Michael Smith
SSN: 423-88-1924
DOB: March 4, 1981
Address: 1247 Oak Ave, Dallas TX 75201
Counsel: Davis & Whitmore LLP
Plaintiff John Michael Smith alleges that on March 4, 2023, defendant caused injury at the above address. Counsel Davis & Whitmore LLP filed on behalf of the plaintiff in Dallas County District Court.
Plaintiff: Robert James Wilson
SSN: 571-34-6289
DOB: July 19, 1979
Address: 834 Pine St, Austin TX 78701
Counsel: Harmon & Burke LLP
Plaintiff Robert James Wilson alleges that on March 4, 2023, defendant caused injury at the above address. Counsel Harmon & Burke LLP filed on behalf of the plaintiff in Dallas County District Court.
All PII fields replaced with consistent synthetic data. Legal arguments, dates, venue, and cause of action preserved. Bates stamps and exhibit numbers unchanged.
These are the actual high-volume workflows where legal operations teams spend the most on manual de-identification, with fully-loaded costs in complex productions reaching $15-$25 per document when attorney time, platform fees, and quality control are included.
A typical commercial litigation production involves 30,000-100,000 documents. Each document must be reviewed for PII, redacted or de-identified, and produced with Bates stamps intact. In complex high-volume matters, fully-loaded manual review costs, including attorney time, hosting, QC, and project management, can reach $15-$25 per document. At 40,000 documents, that fully-loaded cost can run $600,000-$1,000,000 for a single production. Re-Doc processes the same production via API batch at a fraction of the cost, preserves all Bates stamps and exhibit markers, and produces output that satisfies FRCP 34 obligations. What takes a team of contract attorneys six weeks takes Re-Doc overnight.
Legal research databases need to be searchable. A case law repository built on black-box redacted court filings cannot surface “all cases involving a defendant from Dallas County”. The data is literally hidden behind paint. RAG-based legal research tools fed on redacted documents return 144% worse outputs because the model is pattern-matching against [REDACTED] tokens rather than coherent legal narrative. Text anonymization preserves the full factual and legal narrative. The parties are synthetic. The precedent is intact. Every document remains fully indexed and searchable.
Data Subject Access Requests under GDPR and CCPA require third-party PII to be redacted before the responsive records are provided to the requestor. M&A second requests from the DOJ or FTC can demand 50,000-100,000 pages of company documents within 30 days, a non-negotiable federal deadline. There are no extensions for being manually overwhelmed. In both cases, every document must have third-party personal information removed before production while the substantive business content is preserved. Manual review at those volumes breaks down at exactly the moment it cannot afford to. Re-Doc processes the batch via API overnight, de-identifies third-party PII, and returns documents with the business information fully intact and readable.
Most tools in eDiscovery apply cosmetic black boxes that leave text data intact. Re-Doc was built specifically for discovery production workflows where documents must actually be safe to share.
Typical tools
in the market
Re-Doc
Purpose-built for legal
Cosmetic black boxes. Underlying text copy-pasteable via PDF extract.
True text removal. PII permanently gone from the file, not just visually covered.
Breaks document readability. LLM perplexity jumps 144% on redacted text.
Text anonymization preserves narrative structure. Documents stay searchable and usable.
Manual, per-document process. Unusable for 40,000-document productions.
Batch API processes entire discovery productions in a single overnight run.
Cannot handle scanned exhibits and fax-originated deposition transcripts.
Visual processing pipeline handles scanned exhibits, deposition transcripts, and legacy formats.
No API access. Every document must be processed manually one at a time.
REST API and batch upload plug into any existing eDiscovery workflow.
No load file output. Processed documents cannot be imported into Relativity without manual re-ingestion work.
Production ZIP includes .DAT and .OPT load files. Drop the output directly into Relativity, Reveal, or any Concordance-compatible review platform.
higher LLM perplexity on masked text vs. clean baseline
Building an AI legal research tool or RAG system over case law? Black-box redaction degrades training and inference quality. Redaction tokens are noise. Synthetic replacement is signal. Source: arXiv 2411.05978 + Firstsource.
References & Sources