Articles on: API & developers

Extract data via the API: deep dive

The extract-file-data endpoint is Lido's primary API surface — submit a file, get structured data back. This article goes deeper than the quickstart: the full payload shape, configuration patterns, multi-row vs. single-row, attachments, performance, and code samples in three languages.


For first-time setup (auth, basic submit/poll, getting an API key), see Lido API: quickstart and authentication.



The two-step pattern


Every extraction follows the same shape:


  1. Submit a job: POST /api/v1/extract-file-data → returns jobId.
  2. Poll for results: GET /api/v1/job-result/{jobId} → returns extracted data when complete.


Lido extraction is asynchronous. A typical document takes 10–30 seconds. Don't block; submit, do other work, then poll.



Anatomy of the submit payload


{
"file": {
"type": "base64",
"data": "<base64-encoded file content>",
"name": "invoice.pdf"
},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name should match the invoice header, not the bill-to address.",
"multiRow": false,
"pageRange": "1-3"
}


Field

Type

Required

Notes

file

object

Yes (or use multipart upload)

See "File upload methods"

columns

array of strings

Yes

The fields to extract. Order matters — output keys preserve order

instructions

string

No

Free-form guidance to the AI. Use it to clarify ambiguous fields

multiRow

boolean

No (default false)

True for tabular extraction (line items); false for summary fields

pageRange

string

No

Pages to process. "1" = page 1 only; "1-3" = pages 1 to 3; "2,5,7" = those specific pages. Only bills for the pages listed.


The same fields are configurable in the UI's Data Extractor. The fastest path to a correct payload is: build it in the spreadsheet UI, click the API button (bottom-left of the extractor), copy the generated configuration.


💸 Billing note on pageRange vs. @exclude_pages. Set pageRange when you know which pages to process — you're only billed for the pages it lists. The Extra Instructions directive @exclude_pages is content-based ("skip any page that contains 'Terms and Conditions'") and it still bills for every page in the document, because Lido has to read each page to decide which to exclude. Use pageRange whenever you can identify your target pages by number; reach for @exclude_pages only when you can't.



File upload methods


Two ways to send the file. Both end up at the same endpoint; both produce the same jobId and result format.


Method 1: JSON + base64 (max 50 MB)


Encode the file as base64, embed in JSON.


{
"file": {
"type": "base64",
"data": "JVBERi0xLjQKJ...",
"name": "invoice.pdf"
},
"columns": [...]
}


Best for: web frontends, smaller files, simple integrations.


Tradeoff: base64 encoding inflates file size by ~33% in the request body.


Method 2: Multipart form data (max 500 MB)


Upload the file as a normal multipart form field; configuration goes alongside as a JSON string.


curl -X POST 'https://sheets.lido.app/api/v1/extract-file-data' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-F 'file=@invoice.pdf' \
-F 'config={"columns":["Vendor Name","Invoice Number","Total Amount","Due Date"],"instructions":"...","multiRow":false,"pageRange":"1"}'


Best for: large files, server-side integrations, anywhere base64 encoding would be wasteful.



Single-row vs. multi-row extraction


This is the most important configuration decision.


Single-row (multiRow: false)


Returns ONE result object. Use for: invoices (extract Vendor, Total, Date once); contracts (extract Counterparty, Effective Date, Term); receipts (extract Total, Tax, Date).


Output:


{
"data": [{
"Vendor Name": "Acme Corp",
"Invoice Number": "INV-2026-0042",
"Total Amount": "1234.56",
"Due Date": "2026-05-15"
}]
}


Multi-row (multiRow: true)


Returns ONE row per item. Use for: line items in an invoice (one row per line); transactions in a bank statement (one row per transaction); employees in a roster.


Output:


{
"data": [
{"Description": "Widget A", "Qty": "5", "Price": "10.00"},
{"Description": "Widget B", "Qty": "3", "Price": "20.00"},
{"Description": "Widget C", "Qty": "10", "Price": "5.00"}
]
}


You can run both extractions on the same document with two API calls — one for header fields (single-row), one for line items (multi-row). This is exactly the same pattern as Pre-Header + Line Items in the spreadsheet UI.



Tuning extraction with instructions


instructions is free-form text given to the AI alongside the document. Use it when column names alone are ambiguous. Examples:


  • "Total Amount is the grand total including tax, not the subtotal."
  • "Use ISO format YYYY-MM-DD for all dates. If only a month and year are present, use the first day of the month."
  • "Vendor Name should match the company on the letterhead, not the bill-to address."
  • "Phone numbers should be E.164 format (+15555551234)."
  • "For Status, return one of: Active, Pending, Closed."


A few sentences of instructions usually moves accuracy more than reformatting column names. Iterate in the UI first; whatever instructions work there work in the API.



Polling pattern


import requests, time, base64

API_KEY = "YOUR_API_KEY"
SUBMIT_URL = "https://sheets.lido.app/api/v1/extract-file-data"
RESULT_URL = "https://sheets.lido.app/api/v1/job-result/{}"

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

with open("invoice.pdf", "rb") as f:
file_b64 = base64.b64encode(f.read()).decode("utf-8")

submit = requests.post(SUBMIT_URL, headers=headers, json={
"file": {"type": "base64", "data": file_b64, "name": "invoice.pdf"},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Use ISO date format.",
"multiRow": False,
"pageRange": "1"
})
job_id = submit.json()["jobId"]

# Wait at least 10 seconds before first poll
time.sleep(10)

backoff = 5
while True:
r = requests.get(RESULT_URL.format(job_id), headers=headers).json()
status = r.get("status")
if status == "complete":
print(r["data"])
break
if status == "error":
raise Exception(f"Extraction failed: {r.get('error')}")
time.sleep(backoff)
backoff = min(backoff * 1.5, 30)



Node.js example


const fs = require("fs");
const fetch = require("node-fetch");

const API_KEY = process.env.LIDO_API_KEY;
const headers = {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
};

const fileData = fs.readFileSync("invoice.pdf").toString("base64");

const submit = await fetch("https://sheets.lido.app/api/v1/extract-file-data", {
method: "POST",
headers,
body: JSON.stringify({
file: { type: "base64", data: fileData, name: "invoice.pdf" },
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
})
}).then(r => r.json());

const jobId = submit.jobId;

await new Promise(r => setTimeout(r, 10000));

let backoff = 5000;
while (true) {
const result = await fetch(
`https://sheets.lido.app/api/v1/job-result/${jobId}`,
{ headers }
).then(r => r.json());

if (result.status === "complete") {
console.log(result.data);
break;
}
if (result.status === "error") throw new Error(result.error);

await new Promise(r => setTimeout(r, backoff));
backoff = Math.min(backoff * 1.5, 30000);
}



Ruby example (multipart)


require "net/http"
require "uri"
require "json"

API_KEY = ENV["LIDO_API_KEY"]
SUBMIT_URL = URI("https://sheets.lido.app/api/v1/extract-file-data")

config = {
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
}.to_json

req = Net::HTTP::Post.new(SUBMIT_URL)
req["Authorization"] = "Bearer #{API_KEY}"
req.set_form([
["file", File.open("invoice.pdf")],
["config", config]
], "multipart/form-data")

res = Net::HTTP.start(SUBMIT_URL.host, SUBMIT_URL.port, use_ssl: true) { |http| http.request(req) }
job_id = JSON.parse(res.body)["jobId"]
puts "jobId: #{job_id}"
# Poll job-result/{jobId} as in the Python/JS examples



Rate limits


  • 5 requests per 30 seconds per API key, applied to the submit endpoint.
  • The poll endpoint (GET /job-result/...) has its own, more generous limit.
  • Hitting the limit returns HTTP 429. Back off and retry; don't treat 429 as a fatal error.


For high throughput, run multiple API keys (e.g., one per worker) and stagger submit calls.



Throughput patterns


Sequential (small jobs)


submit → poll → submit → poll → ...


Simple, slow. Throughput ≈ 1 doc / (extraction time + poll interval).



submit doc 1 → submit doc 2 → submit doc 3... → poll all


Submit a batch up to the rate limit, then poll for results in parallel. Throughput ≈ rate_limit / per-doc cost. Practically, for 5 req/30s, that's ~10 docs/minute sustained.


Async with job queue (high volume)


Run a job queue worker that submits and a separate poller that resolves results. Persist job IDs immediately. Use exponential backoff on the poller. Plan for the 24-hour result-expiration window.



Result expiration


  • Results are available at GET /job-result/{jobId} for 24 hours after job creation.
  • After that, the job is purged. The result is gone — but you should have persisted it on receipt.
  • Treat the API as ephemeral. Your application is the system of record.



Error responses


Code

Meaning

Action

400

Bad request — missing/invalid fields

Check the payload; usually a typo or missing required field

401

Auth failed

Check Authorization: Bearer YOUR_API_KEY header

413

File too large

Switch to multipart for >50 MB; max 500 MB

422

Configuration invalid

Check column names, page range syntax

429

Rate limited

Back off; retry with exponential backoff

500 / 502 / 503

Lido server error

Retry with exponential backoff


Job-level errors (the file submitted but extraction failed) come back through the poll endpoint with status: "error" and an error field. These are not HTTP errors — the request itself succeeded.



Tips


  • Always test in the UI first. Always.
  • Persist results immediately. They expire in 24 hours.
  • Use multipart for files >50 MB. Lower memory cost than base64.
  • Run two extractions when you need both header and line items. Single-row + multi-row, same file, two calls.
  • Capture jobId in your logs at submit time. When something goes wrong, you'll need it.
  • Set conservative timeouts. Network hiccups happen — use 30+ second connect timeouts and retry on transient failure.
  • Use environment variables for API keys, never check them into git.



Common mistakes


  • Polling every second. Wastes the rate limit on the poll endpoint and does nothing — extraction takes 10+ seconds. Wait at least 10 before the first poll.
  • Forgetting the Bearer prefix. Authorization: YOUR_API_KEY is wrong. Must be Bearer YOUR_API_KEY.
  • Submitting the same file in a tight loop. If the result looks wrong, fix instructions; don't re-submit.
  • Mixing up single-row and multi-row. Returning 50 rows when you wanted one summary, or one row when you wanted 50 line items, almost always means multiRow is wrong.
  • Treating 429 as fatal. It's a "back off" signal, not a failure.
  • Not handling result expiration. A polling worker that retries forever may lose results to the 24-hour window.
  • Hardcoding API keys in client-side code. API keys belong on the server, in a secrets manager.




  • Lido API: quickstart and authentication
  • API error reference
  • Webhooks and async processing
  • Extract data from PDFs and documents (UI version of the same logic)
  • Improve extraction accuracy

Updated on: 13/05/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!