Node deep dive: Data Extractor
The Data Extractor is the most-used node in Lido workflows. It takes a file or email and returns structured data using the same engine that powers the spreadsheet UI. This article covers every parameter, the most common configurations, and how to debug when extraction goes sideways.
For the higher-level concept (when to use extraction at all, what it can extract), see Extract data from PDFs and documents. This article assumes you've decided to use it and now need to configure it inside a workflow.
What it does
Given a file (PDF, image, Office document) or an email, the Data Extractor:
- Reads the source.
- Runs OCR if needed (for scanned PDFs or images).
- Sends content + your column definitions + your instructions to an AI model.
- Returns structured JSON: one row per extracted item.
Output shape depends on the Response Format parameter (see below).
Source Type
Choose where the document comes from.
Source Type | When to use | Reference expression |
|---|---|---|
File | The trigger or upstream node produced a file (Drive, OneDrive, manual upload, API webhook) | |
The trigger was an email (Outlook Trigger, Lido Mailbox); extract from body and/or attachments | | |
Worksheet | The source is a tab in an Excel/Google Sheets file produced upstream | |
For an email with multiple attachments where you want to process each one individually, use a Loop node before the Data Extractor and set Source Type: File with {{$item.data.attachments[$loop.index].url}}.
Columns
A list of field names — one per piece of data you want extracted. Examples:
Vendor Name,Invoice Number,Total Amount,Due DatePatient Name,DOB,Diagnosis Code,Treatment DateCounterparty,Effective Date,Termination Date,Notice Period
Naming guidance:
- Use names that match what's on the document (the AI matches semantically, but consistent names help).
- Be specific.
Dateis ambiguous;Invoice Date,Due Date,Service Dateare clearer. - Avoid all-caps and special characters.
Instructions
Free-form guidance to the AI alongside the columns. Use sparingly but deliberately. Most accuracy improvements come from instructions, not from re-naming columns.
Good instructions:
- "Total Amount is the grand total including tax, not the subtotal."
- "Use ISO format YYYY-MM-DD for all dates."
- "Phone numbers in E.164 format (+1 555 555 5555 → +15555555555)."
- "For Status, return one of: Active, Pending, Closed."
- "If a field is missing from the document, return an empty string, not 'N/A'."
See Improve extraction accuracy for a deeper treatment.
Multi-row vs. summary
A single boolean. The most consequential setting in the node.
Setting | Returns | Use for |
|---|---|---|
Multi-row OFF (default) | One row of summary fields | Invoices, contracts, receipts — extract Vendor/Total/Date once |
Multi-row ON | One row per item in a table | Line items, transactions, employee rosters — return many rows |
Often you want both: extract header fields (single-row) AND line items (multi-row) from the same document. Run two Data Extractor nodes side by side.
Page Range
Limit which pages to process. Saves money (page count = pages processed) and improves speed.
Value | Meaning |
|---|---|
(empty) | Process all pages |
| Page 1 only |
| Pages 1, 2, 3 |
| Pages 2, 5, 7 |
| Pages 1, 3, 4, 5, 8 |
If you only need data from page 1 of a 50-page PDF, set Page Range to 1 and you pay for 1 page, not 50.
Response Format
How the extracted data is returned to downstream nodes.
Format | Output shape | When to use |
|---|---|---|
Array | A single item with a | You want downstream nodes to receive everything as one item and process in bulk |
Objects (default) | One item per extracted row | You want to fan out — e.g., loop over each row and write to a database |
For multi-row extraction with downstream processing per row, Objects is almost always what you want. The downstream node sees N items, runs once per item.
For multi-row extraction where you want to write all rows to a single sheet in one shot, Array is cleaner.
Split Rows as Items
When Response Format is Objects, this further controls fan-out:
- Split ON (recommended for multi-row): each row becomes its own workflow item.
- Split OFF: you get one item with all rows nested.
For 99% of multi-row pipelines, leave it ON.
Worksheet (Excel and Google Sheets)
When the source is an Excel file, you can specify which worksheet (tab) to extract from.
- Default: the first/active worksheet.
- Named: specify by name (e.g.,
Sheet1,2026 Invoices). - Range within worksheet: combine with Page Range (treated as row range).
For multi-tab workbooks where each tab is its own document, use a Loop node over the tab list and pass the worksheet name to the extractor.
Worked example: invoice processing pipeline
Goal: new invoices land in a Drive folder; extracted data into a sheet; notify Slack on each.
Google Drive Trigger (folder: "Inbox")
↓
Data Extractor
- Source Type: File
- Source: {{$item.data.file}}
- Columns: Vendor Name, Invoice Number, Total Amount, Due Date
- Instructions: "Use ISO date format. Total includes tax."
- Multi-row: OFF
- Page Range: 1
- Response Format: Objects
↓
Spreadsheet (Add Row to Table)
- Table: Invoices
- Map fields: Vendor Name → A, Invoice Number → B, Total Amount → C, Due Date → D
↓
Send Slack
- Channel: #invoices
- Message: "New invoice: {{$item.previous.data.Vendor Name}} for ${{$item.previous.data.Total Amount}}"
Worked example: line items + header
Goal: for each invoice, extract header (Vendor, Total, Date) AND line items (Description, Qty, Price).
Trigger
↓
Data Extractor #1 (header)
- Multi-row: OFF
- Columns: Vendor Name, Invoice Number, Total Amount, Due Date
↓ ↓
Data Extractor #2 (line items) Spreadsheet (write header to Invoices table)
- Multi-row: ON
- Columns: Description, Qty, Price
↓
Loop over rows
↓
Spreadsheet (write each line item to LineItems table, with reference to invoice ID)
This pattern parallels the Pre-Header + Line Items configuration in the spreadsheet UI.
Debugging extraction in workflows
When the Data Extractor returns wrong data:
- Open the workflow run. Find the Data Extractor node's output.
- Check the input. Was the right file passed in? (Wrong file is the most common cause of wrong output.)
- Click "View extracted data". See the raw response.
- Compare to what you expected. Is the field missing? Wrong value? Wrong type?
- Reproduce in the spreadsheet UI. Open Lido sheet, drop the same file in a Data Extractor with the same config, see if the issue reproduces. Almost always it does — and the UI is faster to iterate in.
- Improve. Adjust columns, instructions, or page range. Test again.
- Carry the change back into the workflow. Re-run.
Tips
- Always test a real document first. Don't write 10 columns and pray.
- Set Page Range to limit cost. Most extraction work touches just 1–3 pages.
- Use Objects + Split ON for fan-out, Array for single-batch downstream.
- Persist the source file URL alongside extracted data. Audit trail when something looks wrong.
- For sensitive documents, remember the 23-hour deletion: extracted data stays in your sheet, but the original is gone after a day.
- Set instructions in normalized form. "ISO date" beats "Excel date format". The AI understands ISO better.
Common mistakes
- Multi-row OFF when extracting line items (returns one row per page instead of one row per line).
- Multi-row ON when extracting summary fields (returns one row per perceived "item" — usually one — but with weird shape).
- No Page Range on a 100-page PDF that has data on page 1. Burns 100 pages of allowance.
- Vague column names like "Date" or "Number" with no instructions to disambiguate.
- Source Type: File when the trigger was an email. Use Source Type: Email and reference the email; or Loop over attachments and pass each as File.
- Forgetting to test before activating. Misconfigured workflows can burn a week's allowance overnight.
Related articles
- Extract data from PDFs and documents
- Improve extraction accuracy
- Automate extraction with workflows
- Build your first workflow
- Triggers: how workflows start
Updated on: 16/04/2026
Thank you!