Articles on: Workflows

Node deep dive: Data Extractor

The Data Extractor is the most-used node in Lido workflows. It takes a file or email and returns structured data using the same engine that powers the spreadsheet UI. This article covers every parameter, the most common configurations, and how to debug when extraction goes sideways.

For the higher-level concept (when to use extraction at all, what it can extract), see Extract data from PDFs and documents. This article assumes you've decided to use it and now need to configure it inside a workflow.

What it does

Given a file (PDF, image, Office document) or an email, the Data Extractor:

Reads the source.
Runs OCR if needed (for scanned PDFs or images).
Sends content + your column definitions + your instructions to an AI model.
Returns structured JSON: one row per extracted item.

Output shape depends on the Response Format parameter (see below).

Source Type

Choose where the document comes from.

Source Type	When to use	Reference expression
File	The trigger or upstream node produced a file (Drive, OneDrive, manual upload, API webhook)	`{{$item.data.file}}`
Email	The trigger was an email (Outlook Trigger, Lido Mailbox); extract from body and/or attachments	`{{$item.data.email}}`
Worksheet	The source is a tab in an Excel/Google Sheets file produced upstream	`{{$item.data.worksheet}}`

For an email with multiple attachments where you want to process each one individually, use a Loop node before the Data Extractor and set Source Type: File with {{$item.data.attachments[$loop.index].url}}.

Columns

A list of field names — one per piece of data you want extracted. Examples:

Vendor Name, Invoice Number, Total Amount, Due Date
Patient Name, DOB, Diagnosis Code, Treatment Date
Counterparty, Effective Date, Termination Date, Notice Period

Naming guidance:

Use names that match what's on the document (the AI matches semantically, but consistent names help).
Be specific. Date is ambiguous; Invoice Date, Due Date, Service Date are clearer.
Avoid all-caps and special characters.

Instructions

Free-form guidance to the AI alongside the columns. Use sparingly but deliberately. Most accuracy improvements come from instructions, not from re-naming columns.

Good instructions:

"Total Amount is the grand total including tax, not the subtotal."
"Use ISO format YYYY-MM-DD for all dates."
"Phone numbers in E.164 format (+1 555 555 5555 → +15555555555)."
"For Status, return one of: Active, Pending, Closed."
"If a field is missing from the document, return an empty string, not 'N/A'."

See Improve extraction accuracy for a deeper treatment.

Multi-row vs. summary

A single boolean. The most consequential setting in the node.

Setting	Returns	Use for
Multi-row OFF (default)	One row of summary fields	Invoices, contracts, receipts — extract Vendor/Total/Date once
Multi-row ON	One row per item in a table	Line items, transactions, employee rosters — return many rows

Often you want both: extract header fields (single-row) AND line items (multi-row) from the same document. Run two Data Extractor nodes side by side.

Page Range

Limit which pages to process. Saves money (page count = pages processed) and improves speed.

Value	Meaning
(empty)	Process all pages
`1`	Page 1 only
`1-3`	Pages 1, 2, 3
`2,5,7`	Pages 2, 5, 7
`1,3-5,8`	Pages 1, 3, 4, 5, 8

If you only need data from page 1 of a 50-page PDF, set Page Range to 1 and you pay for 1 page, not 50.

Page Range vs. @exclude_pages — these are billed differently.

Page Range is the cheap option: you specify the pages to process up front, and you're billed only for those pages.
@exclude_pages (a directive in Extra Instructions) still bills you for every page in the document, because Lido has to read each page to intelligently decide which ones to exclude. Use it only when you can't predict which pages to include ahead of time.

Reach for @exclude_pages only when the pages you want to drop can't be identified by number — e.g., "skip any page containing 'Terms and Conditions'". If you can describe the keepers as a page range, Page Range is always cheaper.

Response Format

How the extracted data is returned to downstream nodes.

Format	Output shape	When to use
Array	A single item with a `rows` field containing all extracted rows	You want downstream nodes to receive everything as one item and process in bulk
Objects (default)	One item per extracted row	You want to fan out — e.g., loop over each row and write to a database

For multi-row extraction with downstream processing per row, Objects is almost always what you want. The downstream node sees N items, runs once per item.

For multi-row extraction where you want to write all rows to a single sheet in one shot, Array is cleaner.

Split Rows as Items

When Response Format is Objects, this further controls fan-out:

Split ON (recommended for multi-row): each row becomes its own workflow item.
Split OFF: you get one item with all rows nested.

For 99% of multi-row pipelines, leave it ON.

Worksheet (Excel and Google Sheets)

When the source is an Excel file, you can specify which worksheet (tab) to extract from.

Default: the first/active worksheet.
Named: specify by name (e.g., Sheet1, 2026 Invoices).
Range within worksheet: combine with Page Range (treated as row range).

For multi-tab workbooks where each tab is its own document, use a Loop node over the tab list and pass the worksheet name to the extractor.

Worked example: invoice processing pipeline

Goal: new invoices land in a Drive folder; extracted data into a sheet; notify Slack on each.

Google Drive Trigger (folder: "Inbox")
   ↓
Data Extractor
   - Source Type: File
   - Source: {{$item.data.file}}
   - Columns: Vendor Name, Invoice Number, Total Amount, Due Date
   - Instructions: "Use ISO date format. Total includes tax."
   - Multi-row: OFF
   - Page Range: 1
   - Response Format: Objects
   ↓
Spreadsheet (Add Row to Table)
   - Table: Invoices
   - Map fields: Vendor Name → A, Invoice Number → B, Total Amount → C, Due Date → D
   ↓
Send Slack
   - Channel: #invoices
   - Message: "New invoice: {{$item.previous.data.Vendor Name}} for ${{$item.previous.data.Total Amount}}"

Worked example: line items + header

Goal: for each invoice, extract header (Vendor, Total, Date) AND line items (Description, Qty, Price).

Trigger
   ↓
Data Extractor #1 (header)
   - Multi-row: OFF
   - Columns: Vendor Name, Invoice Number, Total Amount, Due Date
   ↓                              ↓
Data Extractor #2 (line items)   Spreadsheet (write header to Invoices table)
   - Multi-row: ON
   - Columns: Description, Qty, Price
   ↓
Loop over rows
   ↓
Spreadsheet (write each line item to LineItems table, with reference to invoice ID)

This pattern parallels the Pre-Header + Line Items configuration in the spreadsheet UI.

Debugging extraction in workflows

When the Data Extractor returns wrong data:

Open the workflow run. Find the Data Extractor node's output.
Check the input. Was the right file passed in? (Wrong file is the most common cause of wrong output.)
Click "View extracted data". See the raw response.
Compare to what you expected. Is the field missing? Wrong value? Wrong type?
Reproduce in the spreadsheet UI. Open Lido sheet, drop the same file in a Data Extractor with the same config, see if the issue reproduces. Almost always it does — and the UI is faster to iterate in.
Improve. Adjust columns, instructions, or page range. Test again.
Carry the change back into the workflow. Re-run.

Tips

Always test a real document first. Don't write 10 columns and pray.
Set Page Range to limit cost. Most extraction work touches just 1–3 pages.
Use Objects + Split ON for fan-out, Array for single-batch downstream.
Persist the source file URL alongside extracted data. Audit trail when something looks wrong.
For sensitive documents, remember the 23-hour deletion: extracted data stays in your sheet, but the original is gone after a day.
Set instructions in normalized form. "ISO date" beats "Excel date format". The AI understands ISO better.

Common mistakes

Multi-row OFF when extracting line items (returns one row per page instead of one row per line).
Multi-row ON when extracting summary fields (returns one row per perceived "item" — usually one — but with weird shape).
No Page Range on a 100-page PDF that has data on page 1. Burns 100 pages of allowance.
Vague column names like "Date" or "Number" with no instructions to disambiguate.
Source Type: File when the trigger was an email. Use Source Type: Email and reference the email; or Loop over attachments and pass each as File.
Forgetting to test before activating. Misconfigured workflows can burn a week's allowance overnight.

Extract data from PDFs and documents
Improve extraction accuracy
Automate extraction with workflows
Build your first workflow
Triggers: how workflows start

Updated on: 13/05/2026

Was this article helpful?

Thank you!