Articles on: AI & extraction

Improve extraction accuracy

When the Data Extractor returns wrong, missing, or inconsistent values, almost always one of six things is the cause. This article is the troubleshooting checklist — work through it top to bottom.

Quick diagnosis: what does "wrong" look like?

Symptom	Most likely cause	Fix
Empty values for some columns	Column name doesn't match document language	Rename columns to match what's actually on the page
Wrong values mixed with right values	Multi-row turned on for a summary document	Turn Multi-row OFF
Returns "I don't see this" or refuses	Document is encrypted or AI safety filter triggered	OCR first, or split the document, or use a different extractor config
Bad values from scanned PDFs only	No text layer	Add OCR PDF before extraction
Inconsistent across vendors	Single config trying to handle too many formats	Split into one extractor per vendor type, or add Document Classifier
Numbers off by orders of magnitude	Currency or thousands-separator confusion	Add an instruction: "Numbers may use European format with comma as decimal separator"
Dates in the wrong format	Date locale ambiguous	Add an instruction: "Use ISO format YYYY-MM-DD for all dates"
Random failures with no clear pattern	Source documents vary too widely	Pre-classify with Document Classifier and route to specialized extractors

Most issues fix at the configuration level (column names + instructions). Reach for OCR or classifier nodes only after those don't work.

Fix 1: tighten column names

The single highest-leverage thing you can do.

Before: Number, Date, Amount, Name

After: Invoice Number, Invoice Date, Total Amount (incl. tax), Vendor Name

Specific names disambiguate. The model has to guess less, so it gets it right more.

If your destination system has a strict schema, name columns to match it exactly. The Data Extractor will produce values that drop right into your sheet without renaming.

Fix 2: add instructions

Every Data Extractor configuration accepts an Instructions field. This is the most underused setting. Use it for:

Number formats: "Total Amount is the grand total including tax — not the subtotal." "Numbers may use European format with comma as decimal separator and period as thousands."
Date formats: "Use ISO format YYYY-MM-DD for all dates." "If the date is ambiguous (e.g., 03/04/2026), assume DD/MM/YYYY."
Field disambiguation: "Vendor Name is the company sending the invoice — not the recipient."
What to skip: "Skip subtotals, taxes, and discount lines — only extract individual line items."
What to do when missing: "If the Invoice Number isn't visible, return an empty value (don't guess)."

Keep instructions specific to the document type. Generic instructions ("be accurate") don't help.

Fix 3: turn Multi-row on or off appropriately

Document type	Multi-row
Single-summary invoice (just totals)	OFF
Multi-line invoice (one row per line item)	ON
Receipt with itemized purchases	ON
Bank statement with many transactions	ON
Application form with single set of fields	OFF
Contract with single party info	OFF

Multi-row ON tells the model: "the document contains a table; produce one extracted row per row in that table." If the document doesn't actually have a table, the model will invent rows.

Fix 4: OCR scanned PDFs first

A scanned PDF looks like text but is actually an image of text. The Data Extractor can work on scans (the AI vision models read images), but quality is much higher when there's a real text layer.

In a workflow: add an OCR PDF node between the trigger and the Data Extractor.

In the spreadsheet: use OCRMYPDF(file_url, "eng", output_ref) to add a text layer first.

Signs you have a scan: file size disproportionately large for page count, you can't select text in the PDF viewer, copying text gives garbage.

Fix 5: limit page range

If the data you need is always on the first page (or pages 2–5), set the Page Range parameter to 1 (or 2-5). This:

Speeds up extraction.
Reduces page consumption against your plan.
Improves accuracy by keeping the model focused on the relevant content.

Page Range syntax: 1-3, 2,5,7, or single pages like 1.

Fix 6: split the job by document type

If you're trying to handle invoices, receipts, and contracts with one extractor, you'll get inconsistent results. Each document type has its own structure, terminology, and edge cases.

Pattern:

[Trigger]
   │
   ▼
[Document Classifier]    // returns "Invoice", "Receipt", or "Contract"
   │
   ▼
[Switch on classification]
   │
   ├─ Invoice  ─→ [Data Extractor: invoice config]
   ├─ Receipt  ─→ [Data Extractor: receipt config]
   └─ Contract ─→ [Data Extractor: contract config]

Each extractor is now tightly tuned for one document type. Accuracy goes up across the board.

Worked example: dialing in invoice extraction

A team gets bad results extracting from invoices. Here's the iteration:

v1: columns Number, Date, Amount, Vendor. Multi-row ON. Result: 60% accuracy.

v2: rename to Invoice Number, Invoice Date, Total Amount, Vendor Name. Multi-row OFF (these are summary invoices). Result: 80% accuracy.

v3: add instructions: "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name is the company sending the invoice." Result: 92% accuracy.

v4: for the remaining 8% (mostly scans), add an OCR PDF node before the extractor. Result: 98% accuracy.

v5: for the remaining 2% (a single foreign-language vendor), split off a second extractor with its own instructions. Result: 99%+.

This is the playbook: tighten names, then add instructions, then OCR, then split. Don't jump to the harder fixes until the easier ones are exhausted.

When extraction is fundamentally wrong, not just inaccurate

A few document patterns just don't work well with column-based extraction:

Free-form text with no consistent structure (handwritten notes, emails). Use GPT or CLAUDE formula instead with a custom prompt.
Documents that are mostly tables of mostly numbers (financial statements, balance sheets). The Data Extractor handles these but may miss footnotes; verify a sample by hand.
Highly variable forms with inconsistent labels. Use Document Classifier to route to specialized extractors, or use AI formulas with custom prompts.

Tips

Test with at least 5 real samples that span the range of what you'll process.
Track failure rates over time. If accuracy drops, something upstream changed (vendor changed their template, scanner changed quality).
Save bad samples for re-testing when you change configuration.
Use the status_ref parameter to write extraction status to a column so failures are obvious in the sheet.
Set up an error notification. Wire the Data Extractor's error output to a Send Slack node so you hear about failures in real time.

Common mistakes

Treating one bad result as proof the tool doesn't work. Iterate. Most "bad" results are actually configuration issues.
Adding a thousand instructions instead of fixing column names first. Names do most of the work; instructions handle the edge cases.
Forgetting that Multi-row inverts the meaning. A summary document with Multi-row ON returns garbage. Always check this when accuracy collapses.
Not using a Document Classifier when you should. If you're extracting from 5+ types of documents, classify first, route second.
Skipping OCR on scans because "the AI can read images." It can — just not as well as it can read text.

Extract data from PDFs and documents
Automate extraction with workflows
Build your first workflow
AI columns and formulas
Nodes reference (overview) — Document Classifier, OCR PDF, Switch

Updated on: 16/04/2026

Was this article helpful?

Thank you!