Articles on: AI & extraction

How to extract data from PDFs and documents

Lido turns unstructured documents — PDFs, images, scans, Word files, emails — into structured rows and columns you can use in a spreadsheet, send to another system, or pull via API. This article covers every way to do it and helps you pick the right one for your job.



Which option should I use?


If you want to…

Use this

Extract data from one or a handful of files, by hand

Data Extractor in the spreadsheet (UI)

Pull a single field from a single file inside a formula

GPTVISION, CLAUDE, or GEMINI formula

Pull a table of rows from a file inside a formula

EXTRACTTOTABLE or EXTRACTFROMFILETOROW formula

Automatically extract every file that lands in a Drive/OneDrive folder, mailbox, or webhook

Data Extractor node in a Workflow

Trigger extraction from your own app or backend

POST /extract-file-data API

Get raw text from a PDF (no AI, no fields)

EXTRACTTEXT formula

Get table-shaped regions out of a PDF (no AI)

EXTRACTTABLESFROMPDF formula

Make a scanned PDF searchable before extracting

OCRMYPDF formula or OCR PDF workflow node


If you're not sure, start with the Data Extractor in the spreadsheet. It's the same engine that powers the formulas, the workflow node, and the API — once it works there, the other paths just reuse the configuration.



When to use Lido for extraction


  • You receive recurring documents (invoices, receipts, statements, lab results, court filings, applications) and need them as rows in a sheet, database, or downstream system.
  • The fields you need are described in the document but not in a consistent location or format across files.
  • You need to process scanned PDFs, photos of documents, or low-quality faxes.
  • You want to combine extracted data with other Lido capabilities — formulas, AI follow-up steps, sending email, writing to a CRM.


When NOT to use Lido for extraction


  • The data you need is already in a structured file (CSV, JSON, XLSX). Use Get Table or just import it.
  • You need OCR only, with no field extraction. Use the OCR PDF node or OCRMYPDF formula.
  • You're extracting plain text and want full document content as a string. Use EXTRACTTEXT.
  • You need real-time, sub-second extraction. Lido extraction takes 10–30 seconds per document depending on size and page range.



Before you start


You'll need:


  • A Lido account on any plan. Extraction page allowance varies by plan — see the pricing page for current limits.
  • The document(s) you want to extract from, as a file or as a URL Lido can reach.
  • A clear list of the fields ("columns") you want pulled out. Specific names produce better results — "Invoice Total" beats "Amount".


For the workflow and API paths you'll also need:




Step-by-step: extract from one PDF in the spreadsheet


This is the fastest path and the recommended starting point for everyone.


  1. Open Lido at sheets.lido.app.
  2. Open the Data Extractor from the toolbar, or open an existing extractor worksheet in your workspace.
  3. Upload your test document in the extractor panel.
  4. Add the columns you want to pull out — one row per field. Use specific names (e.g. "Vendor Name", "Invoice Number", "Total Amount", "Due Date").
  5. Add instructions if the document is unusual. Examples: "Amounts include currency symbols", "Dates are in DD/MM/YYYY", "Skip subtotals — only line items".
  6. Set Page Range if you only want certain pages (e.g. 1-3 or 2,5,7).
  7. Toggle Multi-row on if the document contains a table and you want one row per item (line items, transactions). Leave it off if you only want one row of summary data.
  8. Click Extract.
  9. Review the extracted rows. Adjust columns or instructions and re-run if anything is wrong.


Once an extractor worksheet is configured, you can re-run it on any new file without rebuilding the configuration.



Step-by-step: automate extraction with a workflow


Use this when documents arrive on their own (uploaded to a folder, sent to a mailbox, posted to a webhook) and you want them processed without anyone clicking a button.


  1. Build the extractor worksheet first using the steps above. Workflows reuse this configuration.
  2. Open Workflows → New Workflow.
  3. Add a trigger:
  • Google Drive Trigger or OneDrive Trigger — for files dropped in a folder.
  • Lido Mailbox Trigger or Outlook Trigger — for documents arriving by email (works on attachments too).
  • Webhook Trigger — for documents pushed by another system.
  1. Add a Data Extractor node and connect it to the trigger.
  2. In the Data Extractor node:
  • Worksheet Name: pick the extractor worksheet you built in step 1.
  • Source Type: File (or Email if the trigger is a mailbox).
  • File: {{$item.data.file}} (or Email: {{$item.data.email}}).
  • Response Format: Objects — easier to work with downstream.
  • Split Rows as Items: turn this on if you want each extracted row to flow through the rest of the workflow as a separate item (e.g. one workflow item per invoice line).
  1. Add downstream nodes — typically Insert Rows to write to a sheet, Send Gmail / Send Outlook to notify someone, or API Request to push to your own system.
  2. Connect the error output of the Data Extractor to handle failures (e.g. send a Slack message when extraction fails).
  3. Click Test with a sample file, then Activate when it looks right.



Step-by-step: extract via the API


Use this when you want extraction to run from your own backend, app, or another tool that can make HTTP requests.


  1. Build and test the extractor in the Lido app first (see the spreadsheet steps above). This is the most important step — it's how you fine-tune column names, instructions, and page ranges before any code runs.
  2. In the Data Extractor, click the API button in the bottom-left corner. Lido generates the exact JSON configuration that matches your tested setup. Copy it.
  3. Get an API key at sheets.lido.app/settings/api-keys.
  4. POST the file and configuration to https://sheets.lido.app/api/v1/extract-file-data. Two upload modes are supported:
  • JSON + base64 — up to 50 MB per file.
  • Multipart form data — up to 500 MB per file.
  1. The endpoint returns a jobId. Poll the Job Result endpoint to retrieve the extracted data once processing finishes (typically 10–30 seconds).
  2. Job results are kept for 24 hours. Persist the data in your own system before then.


Rate limit: 5 requests per 30 seconds. Full reference: API Documentation → Extract File Data.



Step-by-step: extract from inside a formula


If you only need data from a file inside a single cell or row of a spreadsheet:


  • Single field, one file: use =GPTVISION(file_url, credential, prompt, output_ref) for OpenAI, or =CLAUDE(...) / =GEMINI(...) / =BEDROCK(...) for other providers. Pass the file URL and a prompt describing what to pull.
  • Whole table of rows from a file: use =EXTRACTTOTABLE(credential, input_content, output_table_ref). The output is a structured table.
  • One row per file: use =EXTRACTFROMFILETOROW(credential, file_source, output_ref).


Each of these formulas needs an AI provider credential (OpenAI, Anthropic, Google, or AWS). Add credentials in your workspace settings before using the formulas.



Worked example: process incoming invoices automatically


Goal: every invoice PDF dropped in a Google Drive folder lands as a row in your invoice tracking sheet, and the team gets a Slack message.


  1. Build the extractor in the spreadsheet using one sample invoice. Columns: Vendor Name, Invoice Number, Invoice Date, Due Date, Total Amount, Currency. Instructions: "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates."
  2. Test with three or four different vendor invoices. Adjust column names and instructions until extraction is consistent.
  3. Build the workflow:


   [Google Drive Trigger][Data Extractor][Insert Rows][Send Slack]


  1. Configure the trigger to watch your "Invoices Inbox" folder.
  2. In the Data Extractor, point at your invoice extractor worksheet, set Source Type to File, set File to {{$item.data.file}}, leave Multi-row off (one row per invoice).
  3. In Insert Rows, target your Invoice Tracker sheet and map each extracted field to a column.
  4. In Send Slack, post a message like New invoice from {{$item.data.Vendor Name}} for {{$item.data.Total Amount}}, due {{$item.data.Due Date}} to your #ap-inbox channel.
  5. Connect the Data Extractor's error output to a second Send Slack node so failures get reported instead of swallowed.
  6. Activate the workflow.


From now on, every invoice that lands in the folder shows up in your sheet and your Slack channel within about a minute, with no human in the loop.



Tips for accurate extraction


  • Be specific with column names. "Invoice Total" beats "Amount". "Due Date (YYYY-MM-DD)" beats "Date".
  • Use instructions for unusual layouts. Tell the AI what to skip, how to format numbers, how to handle missing fields.
  • Limit page range when you know your data is on specific pages. Faster, cheaper, more accurate.
  • OCR scanned PDFs first with the OCR PDF node or OCRMYPDF. Direct AI extraction works on scans too, but quality is much higher with a real text layer.
  • Test with multiple real samples — at least 3–5 — before automating. Documents vary more than you think.
  • Lock in your settings in the spreadsheet UI before moving to API or workflow. The extractor worksheet IS the configuration; everything else points back to it.



Common mistakes


  • Vague column names. "Number" or "Date" makes the AI guess. Use the exact field name as it would appear in your destination system.
  • Multi-row turned on for summary documents. This forces the AI to invent rows. Only turn it on for documents that actually contain a table.
  • Forgetting to map the file in the workflow's Data Extractor node. The File field needs an expression like {{$item.data.file}} — leaving it as a static placeholder will fail at runtime.
  • Not connecting the error output in workflows. When extraction fails (corrupted file, unreadable scan, AI refusal), the item is silently lost unless you handle the error output.
  • Polling the API too aggressively. The Job Result endpoint is rate-limited. Wait 10–30 seconds before the first check, then back off if it's still running.
  • Storing API job results indefinitely. Results expire after 24 hours. Persist them in your own system before then.




  • Data Extractor node reference
  • AI formulas reference (GPTVISION, CLAUDE, GEMINI, BEDROCK, EXTRACTTOTABLE, EXTRACTFROMFILETOROW)
  • PDF & Document formulas (EXTRACTTEXT, EXTRACTTABLESFROMPDF, OCRMYPDF, IMPORTPDF)
  • OCR PDF node reference
  • Connecting Google Drive
  • Connecting Outlook / Lido Mailbox
  • Lido API: getting started and managing API keys
  • Pricing and page allowances


Updated on: 16/04/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!