Legal / eDiscovery

Produce 40,000 Documents Without Exposing a Single SSN.

Discovery production sends your client identifiers to opposing counsel, BPO reviewers, and case law databases. Black boxes break AI tools and expose text layers. Synthetic replacement swaps every identifier with consistent synthetic data. Same layout. Same Bates stamps. Fully searchable.

Try Re-Doc Free
Legal Filing
Case No. 2024-CV-5821
Processing
PLAINTIFF
John Michael SmithRobert James Wilson
SSN
423-88-1924571-34-6289
DATE OF BIRTH
March 4, 1981July 19, 1979
ADDRESS
1247 Oak Ave, Dallas TX834 Pine St, Austin TX
COUNSEL
Davis & Whitmore LLPHarmon & Burke LLP
Bates: RD000001-RD000003
.DAT + .OPT ready
$0B
Global eDiscovery market in 2024
0
SSNs found unredacted in PACER court filings
0%
Law firms not yet using AI document tools
0%
Higher LLM perplexity on masked vs. clean text
The Problem

Five ways black-box redaction fails in legal practice

These are not hypothetical failure modes. They are documented incidents and structural limitations that affect every discovery production, court filing, and DSAR response.

Documented Incident

PACER: 22,391 SSNs in Public Filings

A Federal Judicial Center study of public PACER filings found 22,391 Social Security numbers sitting unredacted in federal court documents. In each case, a black box covered the on-screen rendering. But the underlying text data stream survived in the PDF. Any researcher, journalist, or bad actor with a copy button could extract the SSNs in seconds.

“The court was not aware the data was accessible until outside researchers reported it.”

Federal courts now issue explicit PDF redaction guidance because this failure is systemic, not exceptional. Courts have issued sanctions, including attorney fee awards and case dismissal, for inadvertent disclosure caused by exactly this pattern.

Career Risk

Manafort Filing: Journalists Copy-Pasted Through the Redactions

In January 2019, lawyers for Paul Manafort filed a response to Special Counsel Mueller with black overlays over sensitive passages. Within hours, reporters including those from BuzzFeed News discovered they could copy-paste the text directly from the PDF. The same principle appeared in the Sony disclosures during FTC v. Microsoft proceedings, where confidential game development budgets were redacted with a physical pen but remained legible when the document was scanned.

Courts have issued sanctions including attorney fee awards and case dismissal for these failures. State bar ethics rules treat inadvertent disclosure of protected information as a professional conduct violation, not a procedural error. The consequences can extend to disbarment proceedings.

Bar ethics violationText layer survivedFront-page breach
Technical Root Cause

The Image Layer vs. The Text Layer

PDFs have two distinct layers: the visual image layer (what you see on screen) and the text data stream (what a program reads). Adobe Acrobat and most redaction tools paint a black rectangle over the image layer. The text data stream, the actual PII, often remains fully intact. Copy-paste, screen readers, and AI systems all extract from the text layer, not the image layer.

Visual Layer
Black box painted over
Text Data Stream
PII still accessible
Operational Failure

Redacted Documents Are Useless Downstream

Case law databases built on black-box redactions cannot be full-text searched. RAG systems trained on [REDACTED] tokens return degraded answers. LLM perplexity increases 144% when processing redacted versus clean baseline text. BPO teams processing blacked-out files cannot extract the data they need. DSARs with redacted sections cannot be used in court. Technically compliant. Operationally broken.

144%higher LLM perplexity: black boxes vs. clean baseline (arXiv + Firstsource)
Scale Problem

Redaction Fatigue: Inconsistent Coverage Across Duplicate Documents

When a 40,000-document production contains the same plaintiff SSN across hundreds of separate documents, each one requires an individual human decision. In complex matters, fully-loaded manual review costs, including attorney time, platform fees, project management, and quality control, can reach $15-$25 per document. At that scale, error rates climb. The same SSN gets caught in most documents and missed in a handful. Courts treat inadvertent disclosure as a breach regardless of volume. One missed identifier in a 40,000-document production carries the same sanctions exposure as wholesale non-compliance. Re-Doc processes the entire batch with a single entity map. Every instance of every identifier, consistently handled across every duplicate, in one pass.

$15-$25
Fully-loaded cost per doc in complex productions
1 missed
Same sanctions exposure as full non-compliance
Batch
Re-Doc applies one entity map across all duplicates
How It Works

Re-Doc sits between your source documents and every downstream recipient

One API call. De-identified, layout-perfect output every time. Plugs into your existing eDiscovery workflow without changing how your team operates.

1
01

Source Documents

Deposition transcripts, scanned exhibits, native PDFs. Any format from any eDiscovery platform.

2
02

Re-Doc Processes

Visual processing reads every pixel in scanned exhibits. A context-aware model detects all PII. Synthetic replacement or true redaction applied to both image and text layers.

3
03

Layout-Perfect Output

Same pagination, same Bates stamps, same exhibit numbering. Formatting preserved. The document looks exactly as it did, minus the real identities.

4
04

Safe to Produce and Share

Opposing counsel, regulators, case law databases, RAG systems, BPO teams. Production ZIP includes .DAT and .OPT load files for direct Relativity import. Designed for FRCP 34 production workflows.

Two Pipelines

Every document format in litigation. Covered.

Legal document repositories mix scanned paper records from the 1990s with native digital contracts from last week. Re-Doc has a purpose-built pipeline for each.

Redaction Pipeline

True redaction: removes both image and text layers

For scanned court exhibits and image-based PDFs. Visual processing reads the image layer and extracts all text, including handwriting and fax artifacts. A context-aware model identifies PII in context. Black boxes are drawn precisely at the pixel level over both the image and the text data stream. The copy-paste attack does not work on Re-Doc redactions.

FOIA responses: scanned government records
Court filings with scanned exhibits and affidavits
Scanned medical records in personal injury litigation
Old paper documents digitized for discovery production
Processing: Enterprise-grade visual processing pipeline with automatic fallback. Handles scanned exhibits of any quality or DPI.
Recommended for discovery production

Text Anonymization Pipeline

Synthetic data swap. Document stays searchable and usable.

For native digital documents. All PII is replaced with consistent synthetic data. John Michael Smith becomes Robert James Wilson on every page of every document in the production set. The same mapping applied across the entire batch. Case facts, dates, contract terms, and legal arguments are fully preserved. Opposing counsel receives a complete, coherent record.

Discovery production: FRCP 34 compliant, 40,000 documents
Case law databases: full-text search preserved
RAG / AI legal research systems: no degraded LLM output
BPO and claims processing teams: no blacked-out data
DSARs and M&A second requests: 100,000 pages in 30 days
Consistency guarantee: John Smith maps to Robert Wilson on page 1, 47, and 301, and across all documents in a batch production.
Before / After

What a de-identified legal document actually looks like

The legal content is preserved. The parties are gone. The document is usable, searchable, and safe to produce.

BeforeContains real PII. Not safe to produce.

Plaintiff: John Michael Smith

SSN: 423-88-1924

DOB: March 4, 1981

Address: 1247 Oak Ave, Dallas TX 75201

Counsel: Davis & Whitmore LLP


Plaintiff John Michael Smith alleges that on March 4, 2023, defendant caused injury at the above address. Counsel Davis & Whitmore LLP filed on behalf of the plaintiff in Dallas County District Court.

AfterSynthetic data. Safe to produce. Fully searchable.

Plaintiff: Robert James Wilson

SSN: 571-34-6289

DOB: July 19, 1979

Address: 834 Pine St, Austin TX 78701

Counsel: Harmon & Burke LLP


Plaintiff Robert James Wilson alleges that on March 4, 2023, defendant caused injury at the above address. Counsel Harmon & Burke LLP filed on behalf of the plaintiff in Dallas County District Court.

All PII fields replaced with consistent synthetic data. Legal arguments, dates, venue, and cause of action preserved. Bates stamps and exhibit numbers unchanged.

Use Cases

Three workflows where Re-Doc replaces manual redaction

These are the actual high-volume workflows where legal operations teams spend the most on manual de-identification, with fully-loaded costs in complex productions reaching $15-$25 per document when attorney time, platform fees, and quality control are included.

eDiscovery Production

FRCP 34Bates Stamps PreservedAPI BatchRelativity-Ready

A typical commercial litigation production involves 30,000-100,000 documents. Each document must be reviewed for PII, redacted or de-identified, and produced with Bates stamps intact. In complex high-volume matters, fully-loaded manual review costs, including attorney time, hosting, QC, and project management, can reach $15-$25 per document. At 40,000 documents, that fully-loaded cost can run $600,000-$1,000,000 for a single production. Re-Doc processes the same production via API batch at a fraction of the cost, preserves all Bates stamps and exhibit markers, and produces output that satisfies FRCP 34 obligations. What takes a team of contract attorneys six weeks takes Re-Doc overnight.

$600K-$1M
Fully-loaded cost in complex high-volume matters, 40,000 docs
6 weeks → overnight
Re-Doc API batch. Same output. No manual review bottleneck.
FRCP 34
Bates stamps, exhibit markers, and layout fully preserved
Available via REST API for batch processing workflowsBates stamps and exhibit numbering untouchedConsistent synthetic identities across all 40,000 documentsProduction ZIP includes .DAT + .OPT load files for direct Relativity import

Case Law Database and Precedent Analysis

Full-text SearchRAG / AI Ready

Legal research databases need to be searchable. A case law repository built on black-box redacted court filings cannot surface “all cases involving a defendant from Dallas County”. The data is literally hidden behind paint. RAG-based legal research tools fed on redacted documents return 144% worse outputs because the model is pattern-matching against [REDACTED] tokens rather than coherent legal narrative. Text anonymization preserves the full factual and legal narrative. The parties are synthetic. The precedent is intact. Every document remains fully indexed and searchable.

Legal narrative and holdings fully preservedFull-text search and vector embeddings work correctlyAI research tools get coherent input, not redaction holes

DSARs and M&A Second Requests

100,000 Pages30-Day DeadlineGDPR / CCPA

Data Subject Access Requests under GDPR and CCPA require third-party PII to be redacted before the responsive records are provided to the requestor. M&A second requests from the DOJ or FTC can demand 50,000-100,000 pages of company documents within 30 days, a non-negotiable federal deadline. There are no extensions for being manually overwhelmed. In both cases, every document must have third-party personal information removed before production while the substantive business content is preserved. Manual review at those volumes breaks down at exactly the moment it cannot afford to. Re-Doc processes the batch via API overnight, de-identifies third-party PII, and returns documents with the business information fully intact and readable.

100K
Pages in a typical M&A second request
30 days
DOJ / FTC response deadline
API
Batch processing. No manual review at volume.
How Re-Doc Compares

Built for legal documents. Not an afterthought.

Most tools in eDiscovery apply cosmetic black boxes that leave text data intact. Re-Doc was built specifically for discovery production workflows where documents must actually be safe to share.

Typical tools

in the market

Re-Doc

Purpose-built for legal

Cosmetic black boxes. Underlying text copy-pasteable via PDF extract.

True text removal. PII permanently gone from the file, not just visually covered.

Breaks document readability. LLM perplexity jumps 144% on redacted text.

Text anonymization preserves narrative structure. Documents stay searchable and usable.

Manual, per-document process. Unusable for 40,000-document productions.

Batch API processes entire discovery productions in a single overnight run.

Cannot handle scanned exhibits and fax-originated deposition transcripts.

Visual processing pipeline handles scanned exhibits, deposition transcripts, and legacy formats.

No API access. Every document must be processed manually one at a time.

REST API and batch upload plug into any existing eDiscovery workflow.

No load file output. Processed documents cannot be imported into Relativity without manual re-ingestion work.

Production ZIP includes .DAT and .OPT load files. Drop the output directly into Relativity, Reveal, or any Concordance-compatible review platform.

0%

higher LLM perplexity on masked text vs. clean baseline

Building an AI legal research tool or RAG system over case law? Black-box redaction degrades training and inference quality. Redaction tokens are noise. Synthetic replacement is signal. Source: arXiv 2411.05978 + Firstsource.

Baseline (clean text)1.16
Masked / redacted text2.83
Differential privacy noise4.87