Articles on: AI & extraction

Improve extraction accuracy

When the Data Extractor returns wrong, missing, or inconsistent values, almost always one of six things is the cause. This article is the troubleshooting checklist — work through it top to bottom.



Quick diagnosis: what does "wrong" look like?


Symptom

Most likely cause

Fix

Empty values for some columns

Column name doesn't match document language

Rename columns to match what's actually on the page

Wrong values mixed with right values

Multi-row turned on for a summary document

Turn Multi-row OFF

Returns "I don't see this" or refuses

Document is encrypted or AI safety filter triggered

OCR first, or split the document, or use a different extractor config

Bad values from scanned PDFs only

No text layer

Add OCR PDF before extraction

Inconsistent across vendors

Single config trying to handle too many formats

Split into one extractor per vendor type, or add Document Classifier

Numbers off by orders of magnitude

Currency or thousands-separator confusion

Add an instruction: "Numbers may use European format with comma as decimal separator"

Dates in the wrong format

Date locale ambiguous

Add an instruction: "Use ISO format YYYY-MM-DD for all dates"

Random failures with no clear pattern

Source documents vary too widely

Pre-classify with Document Classifier and route to specialized extractors


Most issues fix at the configuration level (column names + instructions). Reach for OCR or classifier nodes only after those don't work.



Fix 1: tighten column names


The single highest-leverage thing you can do.


Before: Number, Date, Amount, Name


After: Invoice Number, Invoice Date, Total Amount (incl. tax), Vendor Name


Specific names disambiguate. The model has to guess less, so it gets it right more.


If your destination system has a strict schema, name columns to match it exactly. The Data Extractor will produce values that drop right into your sheet without renaming.



Fix 2: add instructions


Every Data Extractor configuration accepts an Instructions field. This is the most underused setting. Use it for:


  • Number formats: "Total Amount is the grand total including tax — not the subtotal." "Numbers may use European format with comma as decimal separator and period as thousands."
  • Date formats: "Use ISO format YYYY-MM-DD for all dates." "If the date is ambiguous (e.g., 03/04/2026), assume DD/MM/YYYY."
  • Field disambiguation: "Vendor Name is the company sending the invoice — not the recipient."
  • What to skip: "Skip subtotals, taxes, and discount lines — only extract individual line items."
  • What to do when missing: "If the Invoice Number isn't visible, return an empty value (don't guess)."


Keep instructions specific to the document type. Generic instructions ("be accurate") don't help.



Fix 3: turn Multi-row on or off appropriately


Document type

Multi-row

Single-summary invoice (just totals)

OFF

Multi-line invoice (one row per line item)

ON

Receipt with itemized purchases

ON

Bank statement with many transactions

ON

Application form with single set of fields

OFF

Contract with single party info

OFF


Multi-row ON tells the model: "the document contains a table; produce one extracted row per row in that table." If the document doesn't actually have a table, the model will invent rows.



Fix 4: OCR scanned PDFs first


A scanned PDF looks like text but is actually an image of text. The Data Extractor can work on scans (the AI vision models read images), but quality is much higher when there's a real text layer.


In a workflow: add an OCR PDF node between the trigger and the Data Extractor.


In the spreadsheet: use OCRMYPDF(file_url, "eng", output_ref) to add a text layer first.


Signs you have a scan: file size disproportionately large for page count, you can't select text in the PDF viewer, copying text gives garbage.



Fix 5: limit page range


If the data you need is always on the first page (or pages 2–5), set the Page Range parameter to 1 (or 2-5). This:


  • Speeds up extraction.
  • Reduces page consumption against your plan.
  • Improves accuracy by keeping the model focused on the relevant content.


Page Range syntax: 1-3, 2,5,7, or single pages like 1.



Fix 6: split the job by document type


If you're trying to handle invoices, receipts, and contracts with one extractor, you'll get inconsistent results. Each document type has its own structure, terminology, and edge cases.


Pattern:


[Trigger]


[Document Classifier] // returns "Invoice", "Receipt", or "Contract"


[Switch on classification]

├─ Invoice ─→ [Data Extractor: invoice config]
├─ Receipt ─→ [Data Extractor: receipt config]
└─ Contract ─→ [Data Extractor: contract config]


Each extractor is now tightly tuned for one document type. Accuracy goes up across the board.



Worked example: dialing in invoice extraction


A team gets bad results extracting from invoices. Here's the iteration:


v1: columns Number, Date, Amount, Vendor. Multi-row ON. Result: 60% accuracy.


v2: rename to Invoice Number, Invoice Date, Total Amount, Vendor Name. Multi-row OFF (these are summary invoices). Result: 80% accuracy.


v3: add instructions: "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name is the company sending the invoice." Result: 92% accuracy.


v4: for the remaining 8% (mostly scans), add an OCR PDF node before the extractor. Result: 98% accuracy.


v5: for the remaining 2% (a single foreign-language vendor), split off a second extractor with its own instructions. Result: 99%+.


This is the playbook: tighten names, then add instructions, then OCR, then split. Don't jump to the harder fixes until the easier ones are exhausted.



When extraction is fundamentally wrong, not just inaccurate


A few document patterns just don't work well with column-based extraction:


  • Free-form text with no consistent structure (handwritten notes, emails). Use GPT or CLAUDE formula instead with a custom prompt.
  • Documents that are mostly tables of mostly numbers (financial statements, balance sheets). The Data Extractor handles these but may miss footnotes; verify a sample by hand.
  • Highly variable forms with inconsistent labels. Use Document Classifier to route to specialized extractors, or use AI formulas with custom prompts.



Tips


  • Test with at least 5 real samples that span the range of what you'll process.
  • Track failure rates over time. If accuracy drops, something upstream changed (vendor changed their template, scanner changed quality).
  • Save bad samples for re-testing when you change configuration.
  • Use the status_ref parameter to write extraction status to a column so failures are obvious in the sheet.
  • Set up an error notification. Wire the Data Extractor's error output to a Send Slack node so you hear about failures in real time.



Common mistakes


  • Treating one bad result as proof the tool doesn't work. Iterate. Most "bad" results are actually configuration issues.
  • Adding a thousand instructions instead of fixing column names first. Names do most of the work; instructions handle the edge cases.
  • Forgetting that Multi-row inverts the meaning. A summary document with Multi-row ON returns garbage. Always check this when accuracy collapses.
  • Not using a Document Classifier when you should. If you're extracting from 5+ types of documents, classify first, route second.
  • Skipping OCR on scans because "the AI can read images." It can — just not as well as it can read text.




  • Extract data from PDFs and documents
  • Automate extraction with workflows
  • Build your first workflow
  • AI columns and formulas
  • Nodes reference (overview) — Document Classifier, OCR PDF, Switch

Updated on: 16/04/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!