Improve extraction accuracy
When the Data Extractor returns wrong, missing, or inconsistent values, almost always one of six things is the cause. This article is the troubleshooting checklist — work through it top to bottom.
Quick diagnosis: what does "wrong" look like?
Symptom | Most likely cause | Fix |
|---|---|---|
Empty values for some columns | Column name doesn't match document language | Rename columns to match what's actually on the page |
Wrong values mixed with right values | Multi-row turned on for a summary document | Turn Multi-row OFF |
Returns "I don't see this" or refuses | Document is encrypted or AI safety filter triggered | OCR first, or split the document, or use a different extractor config |
Bad values from scanned PDFs only | No text layer | Add OCR PDF before extraction |
Inconsistent across vendors | Single config trying to handle too many formats | Split into one extractor per vendor type, or add Document Classifier |
Numbers off by orders of magnitude | Currency or thousands-separator confusion | Add an instruction: "Numbers may use European format with comma as decimal separator" |
Dates in the wrong format | Date locale ambiguous | Add an instruction: "Use ISO format YYYY-MM-DD for all dates" |
Random failures with no clear pattern | Source documents vary too widely | Pre-classify with Document Classifier and route to specialized extractors |
Most issues fix at the configuration level (column names + instructions). Reach for OCR or classifier nodes only after those don't work.
Fix 1: tighten column names
The single highest-leverage thing you can do.
Before: Number, Date, Amount, Name
After: Invoice Number, Invoice Date, Total Amount (incl. tax), Vendor Name
Specific names disambiguate. The model has to guess less, so it gets it right more.
If your destination system has a strict schema, name columns to match it exactly. The Data Extractor will produce values that drop right into your sheet without renaming.
Fix 2: add instructions
Every Data Extractor configuration accepts an Instructions field. This is the most underused setting. Use it for:
- Number formats: "Total Amount is the grand total including tax — not the subtotal." "Numbers may use European format with comma as decimal separator and period as thousands."
- Date formats: "Use ISO format YYYY-MM-DD for all dates." "If the date is ambiguous (e.g., 03/04/2026), assume DD/MM/YYYY."
- Field disambiguation: "Vendor Name is the company sending the invoice — not the recipient."
- What to skip: "Skip subtotals, taxes, and discount lines — only extract individual line items."
- What to do when missing: "If the Invoice Number isn't visible, return an empty value (don't guess)."
Keep instructions specific to the document type. Generic instructions ("be accurate") don't help.
Fix 3: turn Multi-row on or off appropriately
Document type | Multi-row |
|---|---|
Single-summary invoice (just totals) | OFF |
Multi-line invoice (one row per line item) | ON |
Receipt with itemized purchases | ON |
Bank statement with many transactions | ON |
Application form with single set of fields | OFF |
Contract with single party info | OFF |
Multi-row ON tells the model: "the document contains a table; produce one extracted row per row in that table." If the document doesn't actually have a table, the model will invent rows.
Fix 4: OCR scanned PDFs first
A scanned PDF looks like text but is actually an image of text. The Data Extractor can work on scans (the AI vision models read images), but quality is much higher when there's a real text layer.
In a workflow: add an OCR PDF node between the trigger and the Data Extractor.
In the spreadsheet: use OCRMYPDF(file_url, "eng", output_ref) to add a text layer first.
Signs you have a scan: file size disproportionately large for page count, you can't select text in the PDF viewer, copying text gives garbage.
Fix 5: limit page range
If the data you need is always on the first page (or pages 2–5), set the Page Range parameter to 1 (or 2-5). This:
- Speeds up extraction.
- Reduces page consumption against your plan.
- Improves accuracy by keeping the model focused on the relevant content.
Page Range syntax: 1-3, 2,5,7, or single pages like 1.
Fix 6: split the job by document type
If you're trying to handle invoices, receipts, and contracts with one extractor, you'll get inconsistent results. Each document type has its own structure, terminology, and edge cases.
Pattern:
[Trigger]
│
▼
[Document Classifier] // returns "Invoice", "Receipt", or "Contract"
│
▼
[Switch on classification]
│
├─ Invoice ─→ [Data Extractor: invoice config]
├─ Receipt ─→ [Data Extractor: receipt config]
└─ Contract ─→ [Data Extractor: contract config]
Each extractor is now tightly tuned for one document type. Accuracy goes up across the board.
Worked example: dialing in invoice extraction
A team gets bad results extracting from invoices. Here's the iteration:
v1: columns Number, Date, Amount, Vendor. Multi-row ON. Result: 60% accuracy.
v2: rename to Invoice Number, Invoice Date, Total Amount, Vendor Name. Multi-row OFF (these are summary invoices). Result: 80% accuracy.
v3: add instructions: "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name is the company sending the invoice." Result: 92% accuracy.
v4: for the remaining 8% (mostly scans), add an OCR PDF node before the extractor. Result: 98% accuracy.
v5: for the remaining 2% (a single foreign-language vendor), split off a second extractor with its own instructions. Result: 99%+.
This is the playbook: tighten names, then add instructions, then OCR, then split. Don't jump to the harder fixes until the easier ones are exhausted.
When extraction is fundamentally wrong, not just inaccurate
A few document patterns just don't work well with column-based extraction:
- Free-form text with no consistent structure (handwritten notes, emails). Use
GPTorCLAUDEformula instead with a custom prompt. - Documents that are mostly tables of mostly numbers (financial statements, balance sheets). The Data Extractor handles these but may miss footnotes; verify a sample by hand.
- Highly variable forms with inconsistent labels. Use Document Classifier to route to specialized extractors, or use AI formulas with custom prompts.
Tips
- Test with at least 5 real samples that span the range of what you'll process.
- Track failure rates over time. If accuracy drops, something upstream changed (vendor changed their template, scanner changed quality).
- Save bad samples for re-testing when you change configuration.
- Use the
status_refparameter to write extraction status to a column so failures are obvious in the sheet. - Set up an error notification. Wire the Data Extractor's error output to a Send Slack node so you hear about failures in real time.
Common mistakes
- Treating one bad result as proof the tool doesn't work. Iterate. Most "bad" results are actually configuration issues.
- Adding a thousand instructions instead of fixing column names first. Names do most of the work; instructions handle the edge cases.
- Forgetting that Multi-row inverts the meaning. A summary document with Multi-row ON returns garbage. Always check this when accuracy collapses.
- Not using a Document Classifier when you should. If you're extracting from 5+ types of documents, classify first, route second.
- Skipping OCR on scans because "the AI can read images." It can — just not as well as it can read text.
Related articles
- Extract data from PDFs and documents
- Automate extraction with workflows
- Build your first workflow
- AI columns and formulas
- Nodes reference (overview) — Document Classifier, OCR PDF, Switch
Updated on: 16/04/2026
Thank you!