Extract data via the API: deep dive
The extract-file-data endpoint is Lido's primary API surface — submit a file, get structured data back. This article goes deeper than the quickstart: the full payload shape, configuration patterns, multi-row vs. single-row, attachments, performance, and code samples in three languages.
For first-time setup (auth, basic submit/poll, getting an API key), see Lido API: quickstart and authentication.
The two-step pattern
Every extraction follows the same shape:
- Submit a job:
POST /api/v1/extract-file-data→ returnsjobId. - Poll for results:
GET /api/v1/job-result/{jobId}→ returns extracted data when complete.
Lido extraction is asynchronous. A typical document takes 10–30 seconds. Don't block; submit, do other work, then poll.
Anatomy of the submit payload
{
"file": {
"type": "base64",
"data": "<base64-encoded file content>",
"name": "invoice.pdf"
},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name should match the invoice header, not the bill-to address.",
"multiRow": false,
"pageRange": "1-3"
}
Field | Type | Required | Notes |
|---|---|---|---|
| object | Yes (or use multipart upload) | See "File upload methods" |
| array of strings | Yes | The fields to extract. Order matters — output keys preserve order |
| string | No | Free-form guidance to the AI. Use it to clarify ambiguous fields |
| boolean | No (default false) | True for tabular extraction (line items); false for summary fields |
| string | No | Pages to process. |
The same fields are configurable in the UI's Data Extractor. The fastest path to a correct payload is: build it in the spreadsheet UI, click the API button (bottom-left of the extractor), copy the generated configuration.
File upload methods
Two ways to send the file. Both end up at the same endpoint; both produce the same jobId and result format.
Method 1: JSON + base64 (max 50 MB)
Encode the file as base64, embed in JSON.
{
"file": {
"type": "base64",
"data": "JVBERi0xLjQKJ...",
"name": "invoice.pdf"
},
"columns": [...]
}
Best for: web frontends, smaller files, simple integrations.
Tradeoff: base64 encoding inflates file size by ~33% in the request body.
Method 2: Multipart form data (max 500 MB)
Upload the file as a normal multipart form field; configuration goes alongside as a JSON string.
curl -X POST 'https://sheets.lido.app/api/v1/extract-file-data' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-F 'file=@invoice.pdf' \
-F 'config={"columns":["Vendor Name","Invoice Number","Total Amount","Due Date"],"instructions":"...","multiRow":false,"pageRange":"1"}'
Best for: large files, server-side integrations, anywhere base64 encoding would be wasteful.
Single-row vs. multi-row extraction
This is the most important configuration decision.
Single-row (multiRow: false)
Returns ONE result object. Use for: invoices (extract Vendor, Total, Date once); contracts (extract Counterparty, Effective Date, Term); receipts (extract Total, Tax, Date).
Output:
{
"data": [{
"Vendor Name": "Acme Corp",
"Invoice Number": "INV-2026-0042",
"Total Amount": "1234.56",
"Due Date": "2026-05-15"
}]
}
Multi-row (multiRow: true)
Returns ONE row per item. Use for: line items in an invoice (one row per line); transactions in a bank statement (one row per transaction); employees in a roster.
Output:
{
"data": [
{"Description": "Widget A", "Qty": "5", "Price": "10.00"},
{"Description": "Widget B", "Qty": "3", "Price": "20.00"},
{"Description": "Widget C", "Qty": "10", "Price": "5.00"}
]
}
You can run both extractions on the same document with two API calls — one for header fields (single-row), one for line items (multi-row). This is exactly the same pattern as Pre-Header + Line Items in the spreadsheet UI.
Tuning extraction with instructions
instructions is free-form text given to the AI alongside the document. Use it when column names alone are ambiguous. Examples:
- "Total Amount is the grand total including tax, not the subtotal."
- "Use ISO format YYYY-MM-DD for all dates. If only a month and year are present, use the first day of the month."
- "Vendor Name should match the company on the letterhead, not the bill-to address."
- "Phone numbers should be E.164 format (+15555551234)."
- "For Status, return one of: Active, Pending, Closed."
A few sentences of instructions usually moves accuracy more than reformatting column names. Iterate in the UI first; whatever instructions work there work in the API.
Polling pattern
import requests, time, base64
API_KEY = "YOUR_API_KEY"
SUBMIT_URL = "https://sheets.lido.app/api/v1/extract-file-data"
RESULT_URL = "https://sheets.lido.app/api/v1/job-result/{}"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
with open("invoice.pdf", "rb") as f:
file_b64 = base64.b64encode(f.read()).decode("utf-8")
submit = requests.post(SUBMIT_URL, headers=headers, json={
"file": {"type": "base64", "data": file_b64, "name": "invoice.pdf"},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Use ISO date format.",
"multiRow": False,
"pageRange": "1"
})
job_id = submit.json()["jobId"]
# Wait at least 10 seconds before first poll
time.sleep(10)
backoff = 5
while True:
r = requests.get(RESULT_URL.format(job_id), headers=headers).json()
status = r.get("status")
if status == "complete":
print(r["data"])
break
if status == "error":
raise Exception(f"Extraction failed: {r.get('error')}")
time.sleep(backoff)
backoff = min(backoff * 1.5, 30)
Node.js example
const fs = require("fs");
const fetch = require("node-fetch");
const API_KEY = process.env.LIDO_API_KEY;
const headers = {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
};
const fileData = fs.readFileSync("invoice.pdf").toString("base64");
const submit = await fetch("https://sheets.lido.app/api/v1/extract-file-data", {
method: "POST",
headers,
body: JSON.stringify({
file: { type: "base64", data: fileData, name: "invoice.pdf" },
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
})
}).then(r => r.json());
const jobId = submit.jobId;
await new Promise(r => setTimeout(r, 10000));
let backoff = 5000;
while (true) {
const result = await fetch(
`https://sheets.lido.app/api/v1/job-result/${jobId}`,
{ headers }
).then(r => r.json());
if (result.status === "complete") {
console.log(result.data);
break;
}
if (result.status === "error") throw new Error(result.error);
await new Promise(r => setTimeout(r, backoff));
backoff = Math.min(backoff * 1.5, 30000);
}
Ruby example (multipart)
require "net/http"
require "uri"
require "json"
API_KEY = ENV["LIDO_API_KEY"]
SUBMIT_URL = URI("https://sheets.lido.app/api/v1/extract-file-data")
config = {
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
}.to_json
req = Net::HTTP::Post.new(SUBMIT_URL)
req["Authorization"] = "Bearer #{API_KEY}"
req.set_form([
["file", File.open("invoice.pdf")],
["config", config]
], "multipart/form-data")
res = Net::HTTP.start(SUBMIT_URL.host, SUBMIT_URL.port, use_ssl: true) { |http| http.request(req) }
job_id = JSON.parse(res.body)["jobId"]
puts "jobId: #{job_id}"
# Poll job-result/{jobId} as in the Python/JS examples
Rate limits
- 5 requests per 30 seconds per API key, applied to the submit endpoint.
- The poll endpoint (
GET /job-result/...) has its own, more generous limit. - Hitting the limit returns HTTP 429. Back off and retry; don't treat 429 as a fatal error.
For high throughput, run multiple API keys (e.g., one per worker) and stagger submit calls.
Throughput patterns
Sequential (small jobs)
submit → poll → submit → poll → ...
Simple, slow. Throughput ≈ 1 doc / (extraction time + poll interval).
Pipelined (recommended for moderate volume)
submit doc 1 → submit doc 2 → submit doc 3 → ... → poll all
Submit a batch up to the rate limit, then poll for results in parallel. Throughput ≈ rate_limit / per-doc cost. Practically, for 5 req/30s, that's ~10 docs/minute sustained.
Async with job queue (high volume)
Run a job queue worker that submits and a separate poller that resolves results. Persist job IDs immediately. Use exponential backoff on the poller. Plan for the 24-hour result-expiration window.
Result expiration
- Results are available at
GET /job-result/{jobId}for 24 hours after job creation. - After that, the job is purged. The result is gone — but you should have persisted it on receipt.
- Treat the API as ephemeral. Your application is the system of record.
Error responses
Code | Meaning | Action |
|---|---|---|
400 | Bad request — missing/invalid fields | Check the payload; usually a typo or missing required field |
401 | Auth failed | Check |
413 | File too large | Switch to multipart for >50 MB; max 500 MB |
422 | Configuration invalid | Check column names, page range syntax |
429 | Rate limited | Back off; retry with exponential backoff |
500 / 502 / 503 | Lido server error | Retry with exponential backoff |
Job-level errors (the file submitted but extraction failed) come back through the poll endpoint with status: "error" and an error field. These are not HTTP errors — the request itself succeeded.
Tips
- Always test in the UI first. Always.
- Persist results immediately. They expire in 24 hours.
- Use multipart for files >50 MB. Lower memory cost than base64.
- Run two extractions when you need both header and line items. Single-row + multi-row, same file, two calls.
- Capture jobId in your logs at submit time. When something goes wrong, you'll need it.
- Set conservative timeouts. Network hiccups happen — use 30+ second connect timeouts and retry on transient failure.
- Use environment variables for API keys, never check them into git.
Common mistakes
- Polling every second. Wastes the rate limit on the poll endpoint and does nothing — extraction takes 10+ seconds. Wait at least 10 before the first poll.
- Forgetting the Bearer prefix.
Authorization: YOUR_API_KEYis wrong. Must beBearer YOUR_API_KEY. - Submitting the same file in a tight loop. If the result looks wrong, fix instructions; don't re-submit.
- Mixing up single-row and multi-row. Returning 50 rows when you wanted one summary, or one row when you wanted 50 line items, almost always means
multiRowis wrong. - Treating 429 as fatal. It's a "back off" signal, not a failure.
- Not handling result expiration. A polling worker that retries forever may lose results to the 24-hour window.
- Hardcoding API keys in client-side code. API keys belong on the server, in a secrets manager.
Related articles
- Lido API: quickstart and authentication
- API error reference
- Webhooks and async processing
- Extract data from PDFs and documents (UI version of the same logic)
- Improve extraction accuracy
Updated on: 16/04/2026
Thank you!