How to extract data from PDFs and documents
Lido turns unstructured documents — PDFs, images, scans, Word files, emails — into structured rows and columns you can use in a spreadsheet, send to another system, or pull via API. This article covers every way to do it and helps you pick the right one for your job.
Which option should I use?
If you want to… | Use this |
|---|---|
Extract data from one or a handful of files, by hand | Data Extractor in the spreadsheet (UI) |
Pull a single field from a single file inside a formula | |
Pull a table of rows from a file inside a formula | |
Automatically extract every file that lands in a Drive/OneDrive folder, mailbox, or webhook | Data Extractor node in a Workflow |
Trigger extraction from your own app or backend | |
Get raw text from a PDF (no AI, no fields) | |
Get table-shaped regions out of a PDF (no AI) | |
Make a scanned PDF searchable before extracting | |
If you're not sure, start with the Data Extractor in the spreadsheet. It's the same engine that powers the formulas, the workflow node, and the API — once it works there, the other paths just reuse the configuration.
When to use Lido for extraction
- You receive recurring documents (invoices, receipts, statements, lab results, court filings, applications) and need them as rows in a sheet, database, or downstream system.
- The fields you need are described in the document but not in a consistent location or format across files.
- You need to process scanned PDFs, photos of documents, or low-quality faxes.
- You want to combine extracted data with other Lido capabilities — formulas, AI follow-up steps, sending email, writing to a CRM.
When NOT to use Lido for extraction
- The data you need is already in a structured file (CSV, JSON, XLSX). Use Get Table or just import it.
- You need OCR only, with no field extraction. Use the OCR PDF node or
OCRMYPDFformula. - You're extracting plain text and want full document content as a string. Use
EXTRACTTEXT. - You need real-time, sub-second extraction. Lido extraction takes 10–30 seconds per document depending on size and page range.
Before you start
You'll need:
- A Lido account on any plan. Extraction page allowance varies by plan — see the pricing page for current limits.
- The document(s) you want to extract from, as a file or as a URL Lido can reach.
- A clear list of the fields ("columns") you want pulled out. Specific names produce better results — "Invoice Total" beats "Amount".
For the workflow and API paths you'll also need:
- A connected file source (Google Drive, OneDrive, Lido Mailbox, Outlook) for triggers, or
- An API key from sheets.lido.app/settings/api-keys for the API.
Step-by-step: extract from one PDF in the spreadsheet
This is the fastest path and the recommended starting point for everyone.
- Open Lido at sheets.lido.app.
- Open the Data Extractor from the toolbar, or open an existing extractor worksheet in your workspace.
- Upload your test document in the extractor panel.
- Add the columns you want to pull out — one row per field. Use specific names (e.g. "Vendor Name", "Invoice Number", "Total Amount", "Due Date").
- Add instructions if the document is unusual. Examples: "Amounts include currency symbols", "Dates are in DD/MM/YYYY", "Skip subtotals — only line items".
- Set Page Range if you only want certain pages (e.g.
1-3or2,5,7). - Toggle Multi-row on if the document contains a table and you want one row per item (line items, transactions). Leave it off if you only want one row of summary data.
- Click Extract.
- Review the extracted rows. Adjust columns or instructions and re-run if anything is wrong.
Once an extractor worksheet is configured, you can re-run it on any new file without rebuilding the configuration.
Step-by-step: automate extraction with a workflow
Use this when documents arrive on their own (uploaded to a folder, sent to a mailbox, posted to a webhook) and you want them processed without anyone clicking a button.
- Build the extractor worksheet first using the steps above. Workflows reuse this configuration.
- Open Workflows → New Workflow.
- Add a trigger:
- Google Drive Trigger or OneDrive Trigger — for files dropped in a folder.
- Lido Mailbox Trigger or Outlook Trigger — for documents arriving by email (works on attachments too).
- Webhook Trigger — for documents pushed by another system.
- Add a Data Extractor node and connect it to the trigger.
- In the Data Extractor node:
- Worksheet Name: pick the extractor worksheet you built in step 1.
- Source Type:
File(orEmailif the trigger is a mailbox). - File:
{{$item.data.file}}(or Email:{{$item.data.email}}). - Response Format:
Objects— easier to work with downstream. - Split Rows as Items: turn this on if you want each extracted row to flow through the rest of the workflow as a separate item (e.g. one workflow item per invoice line).
- Add downstream nodes — typically Insert Rows to write to a sheet, Send Gmail / Send Outlook to notify someone, or API Request to push to your own system.
- Connect the error output of the Data Extractor to handle failures (e.g. send a Slack message when extraction fails).
- Click Test with a sample file, then Activate when it looks right.
Step-by-step: extract via the API
Use this when you want extraction to run from your own backend, app, or another tool that can make HTTP requests.
- Build and test the extractor in the Lido app first (see the spreadsheet steps above). This is the most important step — it's how you fine-tune column names, instructions, and page ranges before any code runs.
- In the Data Extractor, click the API button in the bottom-left corner. Lido generates the exact JSON configuration that matches your tested setup. Copy it.
- Get an API key at sheets.lido.app/settings/api-keys.
POSTthe file and configuration tohttps://sheets.lido.app/api/v1/extract-file-data. Two upload modes are supported:
- JSON + base64 — up to 50 MB per file.
- Multipart form data — up to 500 MB per file.
- The endpoint returns a
jobId. Poll the Job Result endpoint to retrieve the extracted data once processing finishes (typically 10–30 seconds). - Job results are kept for 24 hours. Persist the data in your own system before then.
Rate limit: 5 requests per 30 seconds. Full reference: API Documentation → Extract File Data.
Step-by-step: extract from inside a formula
If you only need data from a file inside a single cell or row of a spreadsheet:
- Single field, one file: use
=GPTVISION(file_url, credential, prompt, output_ref)for OpenAI, or=CLAUDE(...)/=GEMINI(...)/=BEDROCK(...)for other providers. Pass the file URL and a prompt describing what to pull. - Whole table of rows from a file: use
=EXTRACTTOTABLE(credential, input_content, output_table_ref). The output is a structured table. - One row per file: use
=EXTRACTFROMFILETOROW(credential, file_source, output_ref).
Each of these formulas needs an AI provider credential (OpenAI, Anthropic, Google, or AWS). Add credentials in your workspace settings before using the formulas.
Worked example: process incoming invoices automatically
Goal: every invoice PDF dropped in a Google Drive folder lands as a row in your invoice tracking sheet, and the team gets a Slack message.
- Build the extractor in the spreadsheet using one sample invoice. Columns:
Vendor Name,Invoice Number,Invoice Date,Due Date,Total Amount,Currency. Instructions: "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates." - Test with three or four different vendor invoices. Adjust column names and instructions until extraction is consistent.
- Build the workflow:
[Google Drive Trigger] → [Data Extractor] → [Insert Rows] → [Send Slack]
- Configure the trigger to watch your "Invoices Inbox" folder.
- In the Data Extractor, point at your invoice extractor worksheet, set Source Type to
File, set File to{{$item.data.file}}, leave Multi-row off (one row per invoice). - In Insert Rows, target your Invoice Tracker sheet and map each extracted field to a column.
- In Send Slack, post a message like
New invoice from {{$item.data.Vendor Name}} for {{$item.data.Total Amount}}, due {{$item.data.Due Date}}to your#ap-inboxchannel. - Connect the Data Extractor's error output to a second Send Slack node so failures get reported instead of swallowed.
- Activate the workflow.
From now on, every invoice that lands in the folder shows up in your sheet and your Slack channel within about a minute, with no human in the loop.
Tips for accurate extraction
- Be specific with column names. "Invoice Total" beats "Amount". "Due Date (YYYY-MM-DD)" beats "Date".
- Use instructions for unusual layouts. Tell the AI what to skip, how to format numbers, how to handle missing fields.
- Limit page range when you know your data is on specific pages. Faster, cheaper, more accurate.
- OCR scanned PDFs first with the OCR PDF node or
OCRMYPDF. Direct AI extraction works on scans too, but quality is much higher with a real text layer. - Test with multiple real samples — at least 3–5 — before automating. Documents vary more than you think.
- Lock in your settings in the spreadsheet UI before moving to API or workflow. The extractor worksheet IS the configuration; everything else points back to it.
Common mistakes
- Vague column names. "Number" or "Date" makes the AI guess. Use the exact field name as it would appear in your destination system.
- Multi-row turned on for summary documents. This forces the AI to invent rows. Only turn it on for documents that actually contain a table.
- Forgetting to map the file in the workflow's Data Extractor node. The File field needs an expression like
{{$item.data.file}}— leaving it as a static placeholder will fail at runtime. - Not connecting the error output in workflows. When extraction fails (corrupted file, unreadable scan, AI refusal), the item is silently lost unless you handle the error output.
- Polling the API too aggressively. The Job Result endpoint is rate-limited. Wait 10–30 seconds before the first check, then back off if it's still running.
- Storing API job results indefinitely. Results expire after 24 hours. Persist them in your own system before then.
Related articles
- Data Extractor node reference
- AI formulas reference (GPTVISION, CLAUDE, GEMINI, BEDROCK, EXTRACTTOTABLE, EXTRACTFROMFILETOROW)
- PDF & Document formulas (EXTRACTTEXT, EXTRACTTABLESFROMPDF, OCRMYPDF, IMPORTPDF)
- OCR PDF node reference
- Connecting Google Drive
- Connecting Outlook / Lido Mailbox
- Lido API: getting started and managing API keys
- Pricing and page allowances
Updated on: 16/04/2026
Thank you!