Articles on: Workflows

Node deep dive: Data Extractor

The Data Extractor is the most-used node in Lido workflows. It takes a file or email and returns structured data using the same engine that powers the spreadsheet UI. This article covers every parameter, the most common configurations, and how to debug when extraction goes sideways.


For the higher-level concept (when to use extraction at all, what it can extract), see Extract data from PDFs and documents. This article assumes you've decided to use it and now need to configure it inside a workflow.



What it does


Given a file (PDF, image, Office document) or an email, the Data Extractor:


  1. Reads the source.
  2. Runs OCR if needed (for scanned PDFs or images).
  3. Sends content + your column definitions + your instructions to an AI model.
  4. Returns structured JSON: one row per extracted item.


Output shape depends on the Response Format parameter (see below).



Source Type


Choose where the document comes from.


Source Type

When to use

Reference expression

File

The trigger or upstream node produced a file (Drive, OneDrive, manual upload, API webhook)

{{$item.data.file}}

Email

The trigger was an email (Outlook Trigger, Lido Mailbox); extract from body and/or attachments

{{$item.data.email}}

Worksheet

The source is a tab in an Excel/Google Sheets file produced upstream

{{$item.data.worksheet}}


For an email with multiple attachments where you want to process each one individually, use a Loop node before the Data Extractor and set Source Type: File with {{$item.data.attachments[$loop.index].url}}.



Columns


A list of field names — one per piece of data you want extracted. Examples:


  • Vendor Name, Invoice Number, Total Amount, Due Date
  • Patient Name, DOB, Diagnosis Code, Treatment Date
  • Counterparty, Effective Date, Termination Date, Notice Period


Naming guidance:


  • Use names that match what's on the document (the AI matches semantically, but consistent names help).
  • Be specific. Date is ambiguous; Invoice Date, Due Date, Service Date are clearer.
  • Avoid all-caps and special characters.



Instructions


Free-form guidance to the AI alongside the columns. Use sparingly but deliberately. Most accuracy improvements come from instructions, not from re-naming columns.


Good instructions:


  • "Total Amount is the grand total including tax, not the subtotal."
  • "Use ISO format YYYY-MM-DD for all dates."
  • "Phone numbers in E.164 format (+1 555 555 5555 → +15555555555)."
  • "For Status, return one of: Active, Pending, Closed."
  • "If a field is missing from the document, return an empty string, not 'N/A'."


See Improve extraction accuracy for a deeper treatment.



Multi-row vs. summary


A single boolean. The most consequential setting in the node.


Setting

Returns

Use for

Multi-row OFF (default)

One row of summary fields

Invoices, contracts, receipts — extract Vendor/Total/Date once

Multi-row ON

One row per item in a table

Line items, transactions, employee rosters — return many rows


Often you want both: extract header fields (single-row) AND line items (multi-row) from the same document. Run two Data Extractor nodes side by side.



Page Range


Limit which pages to process. Saves money (page count = pages processed) and improves speed.


Value

Meaning

(empty)

Process all pages

1

Page 1 only

1-3

Pages 1, 2, 3

2,5,7

Pages 2, 5, 7

1,3-5,8

Pages 1, 3, 4, 5, 8


If you only need data from page 1 of a 50-page PDF, set Page Range to 1 and you pay for 1 page, not 50.



Response Format


How the extracted data is returned to downstream nodes.


Format

Output shape

When to use

Array

A single item with a rows field containing all extracted rows

You want downstream nodes to receive everything as one item and process in bulk

Objects (default)

One item per extracted row

You want to fan out — e.g., loop over each row and write to a database


For multi-row extraction with downstream processing per row, Objects is almost always what you want. The downstream node sees N items, runs once per item.


For multi-row extraction where you want to write all rows to a single sheet in one shot, Array is cleaner.



Split Rows as Items


When Response Format is Objects, this further controls fan-out:


  • Split ON (recommended for multi-row): each row becomes its own workflow item.
  • Split OFF: you get one item with all rows nested.


For 99% of multi-row pipelines, leave it ON.



Worksheet (Excel and Google Sheets)


When the source is an Excel file, you can specify which worksheet (tab) to extract from.


  • Default: the first/active worksheet.
  • Named: specify by name (e.g., Sheet1, 2026 Invoices).
  • Range within worksheet: combine with Page Range (treated as row range).


For multi-tab workbooks where each tab is its own document, use a Loop node over the tab list and pass the worksheet name to the extractor.



Worked example: invoice processing pipeline


Goal: new invoices land in a Drive folder; extracted data into a sheet; notify Slack on each.


Google Drive Trigger (folder: "Inbox")

Data Extractor
- Source Type: File
- Source: {{$item.data.file}}
- Columns: Vendor Name, Invoice Number, Total Amount, Due Date
- Instructions: "Use ISO date format. Total includes tax."
- Multi-row: OFF
- Page Range: 1
- Response Format: Objects

Spreadsheet (Add Row to Table)
- Table: Invoices
- Map fields: Vendor Name → A, Invoice Number → B, Total Amount → C, Due Date → D

Send Slack
- Channel: #invoices
- Message: "New invoice: {{$item.previous.data.Vendor Name}} for ${{$item.previous.data.Total Amount}}"



Worked example: line items + header


Goal: for each invoice, extract header (Vendor, Total, Date) AND line items (Description, Qty, Price).


Trigger

Data Extractor #1 (header)
- Multi-row: OFF
- Columns: Vendor Name, Invoice Number, Total Amount, Due Date
↓ ↓
Data Extractor #2 (line items) Spreadsheet (write header to Invoices table)
- Multi-row: ON
- Columns: Description, Qty, Price

Loop over rows

Spreadsheet (write each line item to LineItems table, with reference to invoice ID)


This pattern parallels the Pre-Header + Line Items configuration in the spreadsheet UI.



Debugging extraction in workflows


When the Data Extractor returns wrong data:


  1. Open the workflow run. Find the Data Extractor node's output.
  2. Check the input. Was the right file passed in? (Wrong file is the most common cause of wrong output.)
  3. Click "View extracted data". See the raw response.
  4. Compare to what you expected. Is the field missing? Wrong value? Wrong type?
  5. Reproduce in the spreadsheet UI. Open Lido sheet, drop the same file in a Data Extractor with the same config, see if the issue reproduces. Almost always it does — and the UI is faster to iterate in.
  6. Improve. Adjust columns, instructions, or page range. Test again.
  7. Carry the change back into the workflow. Re-run.



Tips


  • Always test a real document first. Don't write 10 columns and pray.
  • Set Page Range to limit cost. Most extraction work touches just 1–3 pages.
  • Use Objects + Split ON for fan-out, Array for single-batch downstream.
  • Persist the source file URL alongside extracted data. Audit trail when something looks wrong.
  • For sensitive documents, remember the 23-hour deletion: extracted data stays in your sheet, but the original is gone after a day.
  • Set instructions in normalized form. "ISO date" beats "Excel date format". The AI understands ISO better.



Common mistakes


  • Multi-row OFF when extracting line items (returns one row per page instead of one row per line).
  • Multi-row ON when extracting summary fields (returns one row per perceived "item" — usually one — but with weird shape).
  • No Page Range on a 100-page PDF that has data on page 1. Burns 100 pages of allowance.
  • Vague column names like "Date" or "Number" with no instructions to disambiguate.
  • Source Type: File when the trigger was an email. Use Source Type: Email and reference the email; or Loop over attachments and pass each as File.
  • Forgetting to test before activating. Misconfigured workflows can burn a week's allowance overnight.




  • Extract data from PDFs and documents
  • Improve extraction accuracy
  • Automate extraction with workflows
  • Build your first workflow
  • Triggers: how workflows start

Updated on: 16/04/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!