Extract data via the API: deep dive
The extract-file-data endpoint is Lido's primary API surface — submit a file, get structured data back. This article goes deeper than the quickstart: the full payload shape, configuration patterns, multi-row vs. single-row, attachments, performance, and code samples in three languages.
For first-time setup (auth, basic submit/poll, getting an API key), see Lido API: quickstart and authentication.
The two-step pattern
Every extraction follows the same shape:
- Submit a job:
POST /api/v1/extract-file-data→ returnsjobId. - Poll for results:
GET /api/v1/job-result/{jobId}→ returns extracted data when complete.
Lido extraction is asynchronous. A typical document takes 10–30 seconds. Don't block; submit, do other work, then poll.
Anatomy of the submit payload
{
"file": {
"type": "base64",
"data": "<base64-encoded file content>",
"name": "invoice.pdf"
},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name should match the invoice header, not the bill-to address.",
"multiRow": false,
"pageRange": "1-3"
}
Field | Type | Required | Notes |
|---|---|---|---|
| object | Yes (or use multipart upload) | See "File upload methods" |
| array of strings | Yes | The fields to extract. Order matters — output keys preserve order |
| string | No | Free-form guidance to the AI. Use it to clarify ambiguous fields |
| boolean | No (default false) | True for tabular extraction (line items); false for summary fields |
| string | No | Pages to process. |
The same fields are configurable in the UI's Data Extractor. The fastest path to a correct payload is: build it in the spreadsheet UI, click the API button (bottom-left of the extractor), copy the generated configuration.
💸 Billing note on pageRange vs. @exclude_pages. Set pageRange when you know which pages to process — you're only billed for the pages it lists. The Extra Instructions directive @exclude_pages is content-based ("skip any page that contains 'Terms and Conditions'") and it still bills for every page in the document, because Lido has to read each page to decide which to exclude. Use pageRange whenever you can identify your target pages by number; reach for @exclude_pages only when you can't.
File upload methods
Two ways to send the file. Both end up at the same endpoint; both produce the same jobId and result format.
Method 1: JSON + base64 (max 50 MB)
Encode the file as base64, embed in JSON.
{
"file": {
"type": "base64",
"data": "JVBERi0xLjQKJ...",
"name": "invoice.pdf"
},
"columns": [...]
}
Best for: web frontends, smaller files, simple integrations.
Tradeoff: base64 encoding inflates file size by ~33% in the request body.
Method 2: Multipart form data (max 500 MB)
Upload the file as a normal multipart form field; configuration goes alongside as a JSON string.
curl -X POST 'https://sheets.lido.app/api/v1/extract-file-data' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-F 'file=@invoice.pdf' \
-F 'config={"columns":["Vendor Name","Invoice Number","Total Amount","Due Date"],"instructions":"...","multiRow":false,"pageRange":"1"}'
Best for: large files, server-side integrations, anywhere base64 encoding would be wasteful.
Single-row vs. multi-row extraction
This is the most important configuration decision.
Single-row (multiRow: false)
Returns ONE result object. Use for: invoices (extract Vendor, Total, Date once); contracts (extract Counterparty, Effective Date, Term); receipts (extract Total, Tax, Date).
Output:
{
"data": [{
"Vendor Name": "Acme Corp",
"Invoice Number": "INV-2026-0042",
"Total Amount": "1234.56",
"Due Date": "2026-05-15"
}]
}
Multi-row (multiRow: true)
Returns ONE row per item. Use for: line items in an invoice (one row per line); transactions in a bank statement (one row per transaction); employees in a roster.
Output:
{
"data": [
{"Description": "Widget A", "Qty": "5", "Price": "10.00"},
{"Description": "Widget B", "Qty": "3", "Price": "20.00"},
{"Description": "Widget C", "Qty": "10", "Price": "5.00"}
]
}
You can run both extractions on the same document with two API calls — one for header fields (single-row), one for line items (multi-row). This is exactly the same pattern as Pre-Header + Line Items in the spreadsheet UI.
Tuning extraction with instructions
instructions is free-form text given to the AI alongside the document. Use it when column names alone are ambiguous. Examples:
- "Total Amount is the grand total including tax, not the subtotal."
- "Use ISO format YYYY-MM-DD for all dates. If only a month and year are present, use the first day of the month."
- "Vendor Name should match the company on the letterhead, not the bill-to address."
- "Phone numbers should be E.164 format (+15555551234)."
- "For Status, return one of: Active, Pending, Closed."
A few sentences of instructions usually moves accuracy more than reformatting column names. Iterate in the UI first; whatever instructions work there work in the API.
Polling pattern
import requests, time, base64
API_KEY = "YOUR_API_KEY"
SUBMIT_URL = "https://sheets.lido.app/api/v1/extract-file-data"
RESULT_URL = "https://sheets.lido.app/api/v1/job-result/{}"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
with open("invoice.pdf", "rb") as f:
file_b64 = base64.b64encode(f.read()).decode("utf-8")
submit = requests.post(SUBMIT_URL, headers=headers, json={
"file": {"type": "base64", "data": file_b64, "name": "invoice.pdf"},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Use ISO date format.",
"multiRow": False,
"pageRange": "1"
})
job_id = submit.json()["jobId"]
# Wait at least 10 seconds before first poll
time.sleep(10)
backoff = 5
while True:
r = requests.get(RESULT_URL.format(job_id), headers=headers).json()
status = r.get("status")
if status == "complete":
print(r["data"])
break
if status == "error":
raise Exception(f"Extraction failed: {r.get('error')}")
time.sleep(backoff)
backoff = min(backoff * 1.5, 30)
Node.js example
const fs = require("fs");
const fetch = require("node-fetch");
const API_KEY = process.env.LIDO_API_KEY;
const headers = {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
};
const fileData = fs.readFileSync("invoice.pdf").toString("base64");
const submit = await fetch("https://sheets.lido.app/api/v1/extract-file-data", {
method: "POST",
headers,
body: JSON.stringify({
file: { type: "base64", data: fileData, name: "invoice.pdf" },
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
})
}).then(r => r.json());
const jobId = submit.jobId;
await new Promise(r => setTimeout(r, 10000));
let backoff = 5000;
while (true) {
const result = await fetch(
`https://sheets.lido.app/api/v1/job-result/${jobId}`,
{ headers }
).then(r => r.json());
if (result.status === "complete") {
console.log(result.data);
break;
}
if (result.status === "error") throw new Error(result.error);
await new Promise(r => setTimeout(r, backoff));
backoff = Math.min(backoff * 1.5, 30000);
}
Ruby example (multipart)
require "net/http"
require "uri"
require "json"
API_KEY = ENV["LIDO_API_KEY"]
SUBMIT_URL = URI("https://sheets.lido.app/api/v1/extract-file-data")
config = {
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
}.to_json
req = Net::HTTP::Post.new(SUBMIT_URL)
req["Authorization"] = "Bearer #{API_KEY}"
req.set_form([
["file", File.open("invoice.pdf")],
["config", config]
], "multipart/form-data")
res = Net::HTTP.start(SUBMIT_URL.host, SUBMIT_URL.port, use_ssl: true) { |http| http.request(req) }
job_id = JSON.parse(res.body)["jobId"]
puts "jobId: #{job_id}"
# Poll job-result/{jobId} as in the Python/JS examples
Rate limits
- 5 requests per 30 seconds per API key, applied to the submit endpoint.
- The poll endpoint (
GET /job-result/...) has its own, more generous limit. - Hitting the limit returns HTTP 429. Back off and retry; don't treat 429 as a fatal error.
For high throughput, run multiple API keys (e.g., one per worker) and stagger submit calls.
Throughput patterns
Sequential (small jobs)
submit → poll → submit → poll → ...
Simple, slow. Throughput ≈ 1 doc / (extraction time + poll interval).
Pipelined (recommended for moderate volume)
submit doc 1 → submit doc 2 → submit doc 3 → ... → poll all
Submit a batch up to the rate limit, then poll for results in parallel. Throughput ≈ rate_limit / per-doc cost. Practically, for 5 req/30s, that's ~10 docs/minute sustained.
Async with job queue (high volume)
Run a job queue worker that submits and a separate poller that resolves results. Persist job IDs immediately. Use exponential backoff on the poller. Plan for the 24-hour result-expiration window.
Result expiration
- Results are available at
GET /job-result/{jobId}for 24 hours after job creation. - After that, the job is purged. The result is gone — but you should have persisted it on receipt.
- Treat the API as ephemeral. Your application is the system of record.
Error responses
Code | Meaning | Action |
|---|---|---|
400 | Bad request — missing/invalid fields | Check the payload; usually a typo or missing required field |
401 | Auth failed | Check |
413 | File too large | Switch to multipart for >50 MB; max 500 MB |
422 | Configuration invalid | Check column names, page range syntax |
429 | Rate limited | Back off; retry with exponential backoff |
500 / 502 / 503 | Lido server error | Retry with exponential backoff |
Job-level errors (the file submitted but extraction failed) come back through the poll endpoint with status: "error" and an error field. These are not HTTP errors — the request itself succeeded.
Tips
- Always test in the UI first. Always.
- Persist results immediately. They expire in 24 hours.
- Use multipart for files >50 MB. Lower memory cost than base64.
- Run two extractions when you need both header and line items. Single-row + multi-row, same file, two calls.
- Capture jobId in your logs at submit time. When something goes wrong, you'll need it.
- Set conservative timeouts. Network hiccups happen — use 30+ second connect timeouts and retry on transient failure.
- Use environment variables for API keys, never check them into git.
Common mistakes
- Polling every second. Wastes the rate limit on the poll endpoint and does nothing — extraction takes 10+ seconds. Wait at least 10 before the first poll.
- Forgetting the Bearer prefix.
Authorization: YOUR_API_KEYis wrong. Must beBearer YOUR_API_KEY. - Submitting the same file in a tight loop. If the result looks wrong, fix instructions; don't re-submit.
- Mixing up single-row and multi-row. Returning 50 rows when you wanted one summary, or one row when you wanted 50 line items, almost always means
multiRowis wrong. - Treating 429 as fatal. It's a "back off" signal, not a failure.
- Not handling result expiration. A polling worker that retries forever may lose results to the 24-hour window.
- Hardcoding API keys in client-side code. API keys belong on the server, in a secrets manager.
Related articles
- Lido API: quickstart and authentication
- API error reference
- Webhooks and async processing
- Extract data from PDFs and documents (UI version of the same logic)
- Improve extraction accuracy
Updated on: 13/05/2026
Thank you!