Articles on: API & developers

Extract data via the API: deep dive

The extract-file-data endpoint is Lido's primary API surface — submit a file, get structured data back. This article goes deeper than the quickstart: the full payload shape, configuration patterns, multi-row vs. single-row, attachments, performance, and code samples in three languages.


For first-time setup (auth, basic submit/poll, getting an API key), see Lido API: quickstart and authentication.



The two-step pattern


Every extraction follows the same shape:


  1. Submit a job: POST /api/v1/extract-file-data → returns jobId.
  2. Poll for results: GET /api/v1/job-result/{jobId} → returns extracted data when complete.


Lido extraction is asynchronous. A typical document takes 10–30 seconds. Don't block; submit, do other work, then poll.



Anatomy of the submit payload


{
"file": {
"type": "base64",
"data": "<base64-encoded file content>",
"name": "invoice.pdf"
},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Total Amount is the grand total including tax. Use ISO format YYYY-MM-DD for dates. Vendor Name should match the invoice header, not the bill-to address.",
"multiRow": false,
"pageRange": "1-3"
}


Field

Type

Required

Notes

file

object

Yes (or use multipart upload)

See "File upload methods"

columns

array of strings

Yes

The fields to extract. Order matters — output keys preserve order

instructions

string

No

Free-form guidance to the AI. Use it to clarify ambiguous fields

multiRow

boolean

No (default false)

True for tabular extraction (line items); false for summary fields

pageRange

string

No

Pages to process. "1" = page 1 only; "1-3" = pages 1 to 3; "2,5,7" = those specific pages


The same fields are configurable in the UI's Data Extractor. The fastest path to a correct payload is: build it in the spreadsheet UI, click the API button (bottom-left of the extractor), copy the generated configuration.



File upload methods


Two ways to send the file. Both end up at the same endpoint; both produce the same jobId and result format.


Method 1: JSON + base64 (max 50 MB)


Encode the file as base64, embed in JSON.


{
"file": {
"type": "base64",
"data": "JVBERi0xLjQKJ...",
"name": "invoice.pdf"
},
"columns": [...]
}


Best for: web frontends, smaller files, simple integrations.


Tradeoff: base64 encoding inflates file size by ~33% in the request body.


Method 2: Multipart form data (max 500 MB)


Upload the file as a normal multipart form field; configuration goes alongside as a JSON string.


curl -X POST 'https://sheets.lido.app/api/v1/extract-file-data' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-F 'file=@invoice.pdf' \
-F 'config={"columns":["Vendor Name","Invoice Number","Total Amount","Due Date"],"instructions":"...","multiRow":false,"pageRange":"1"}'


Best for: large files, server-side integrations, anywhere base64 encoding would be wasteful.



Single-row vs. multi-row extraction


This is the most important configuration decision.


Single-row (multiRow: false)


Returns ONE result object. Use for: invoices (extract Vendor, Total, Date once); contracts (extract Counterparty, Effective Date, Term); receipts (extract Total, Tax, Date).


Output:


{
"data": [{
"Vendor Name": "Acme Corp",
"Invoice Number": "INV-2026-0042",
"Total Amount": "1234.56",
"Due Date": "2026-05-15"
}]
}


Multi-row (multiRow: true)


Returns ONE row per item. Use for: line items in an invoice (one row per line); transactions in a bank statement (one row per transaction); employees in a roster.


Output:


{
"data": [
{"Description": "Widget A", "Qty": "5", "Price": "10.00"},
{"Description": "Widget B", "Qty": "3", "Price": "20.00"},
{"Description": "Widget C", "Qty": "10", "Price": "5.00"}
]
}


You can run both extractions on the same document with two API calls — one for header fields (single-row), one for line items (multi-row). This is exactly the same pattern as Pre-Header + Line Items in the spreadsheet UI.



Tuning extraction with instructions


instructions is free-form text given to the AI alongside the document. Use it when column names alone are ambiguous. Examples:


  • "Total Amount is the grand total including tax, not the subtotal."
  • "Use ISO format YYYY-MM-DD for all dates. If only a month and year are present, use the first day of the month."
  • "Vendor Name should match the company on the letterhead, not the bill-to address."
  • "Phone numbers should be E.164 format (+15555551234)."
  • "For Status, return one of: Active, Pending, Closed."


A few sentences of instructions usually moves accuracy more than reformatting column names. Iterate in the UI first; whatever instructions work there work in the API.



Polling pattern


import requests, time, base64

API_KEY = "YOUR_API_KEY"
SUBMIT_URL = "https://sheets.lido.app/api/v1/extract-file-data"
RESULT_URL = "https://sheets.lido.app/api/v1/job-result/{}"

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

with open("invoice.pdf", "rb") as f:
file_b64 = base64.b64encode(f.read()).decode("utf-8")

submit = requests.post(SUBMIT_URL, headers=headers, json={
"file": {"type": "base64", "data": file_b64, "name": "invoice.pdf"},
"columns": ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
"instructions": "Use ISO date format.",
"multiRow": False,
"pageRange": "1"
})
job_id = submit.json()["jobId"]

# Wait at least 10 seconds before first poll
time.sleep(10)

backoff = 5
while True:
r = requests.get(RESULT_URL.format(job_id), headers=headers).json()
status = r.get("status")
if status == "complete":
print(r["data"])
break
if status == "error":
raise Exception(f"Extraction failed: {r.get('error')}")
time.sleep(backoff)
backoff = min(backoff * 1.5, 30)



Node.js example


const fs = require("fs");
const fetch = require("node-fetch");

const API_KEY = process.env.LIDO_API_KEY;
const headers = {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
};

const fileData = fs.readFileSync("invoice.pdf").toString("base64");

const submit = await fetch("https://sheets.lido.app/api/v1/extract-file-data", {
method: "POST",
headers,
body: JSON.stringify({
file: { type: "base64", data: fileData, name: "invoice.pdf" },
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
})
}).then(r => r.json());

const jobId = submit.jobId;

await new Promise(r => setTimeout(r, 10000));

let backoff = 5000;
while (true) {
const result = await fetch(
`https://sheets.lido.app/api/v1/job-result/${jobId}`,
{ headers }
).then(r => r.json());

if (result.status === "complete") {
console.log(result.data);
break;
}
if (result.status === "error") throw new Error(result.error);

await new Promise(r => setTimeout(r, backoff));
backoff = Math.min(backoff * 1.5, 30000);
}



Ruby example (multipart)


require "net/http"
require "uri"
require "json"

API_KEY = ENV["LIDO_API_KEY"]
SUBMIT_URL = URI("https://sheets.lido.app/api/v1/extract-file-data")

config = {
columns: ["Vendor Name", "Invoice Number", "Total Amount", "Due Date"],
multiRow: false,
pageRange: "1"
}.to_json

req = Net::HTTP::Post.new(SUBMIT_URL)
req["Authorization"] = "Bearer #{API_KEY}"
req.set_form([
["file", File.open("invoice.pdf")],
["config", config]
], "multipart/form-data")

res = Net::HTTP.start(SUBMIT_URL.host, SUBMIT_URL.port, use_ssl: true) { |http| http.request(req) }
job_id = JSON.parse(res.body)["jobId"]
puts "jobId: #{job_id}"
# Poll job-result/{jobId} as in the Python/JS examples



Rate limits


  • 5 requests per 30 seconds per API key, applied to the submit endpoint.
  • The poll endpoint (GET /job-result/...) has its own, more generous limit.
  • Hitting the limit returns HTTP 429. Back off and retry; don't treat 429 as a fatal error.


For high throughput, run multiple API keys (e.g., one per worker) and stagger submit calls.



Throughput patterns


Sequential (small jobs)


submit → poll → submit → poll → ...


Simple, slow. Throughput ≈ 1 doc / (extraction time + poll interval).



submit doc 1 → submit doc 2 → submit doc 3... → poll all


Submit a batch up to the rate limit, then poll for results in parallel. Throughput ≈ rate_limit / per-doc cost. Practically, for 5 req/30s, that's ~10 docs/minute sustained.


Async with job queue (high volume)


Run a job queue worker that submits and a separate poller that resolves results. Persist job IDs immediately. Use exponential backoff on the poller. Plan for the 24-hour result-expiration window.



Result expiration


  • Results are available at GET /job-result/{jobId} for 24 hours after job creation.
  • After that, the job is purged. The result is gone — but you should have persisted it on receipt.
  • Treat the API as ephemeral. Your application is the system of record.



Error responses


Code

Meaning

Action

400

Bad request — missing/invalid fields

Check the payload; usually a typo or missing required field

401

Auth failed

Check Authorization: Bearer YOUR_API_KEY header

413

File too large

Switch to multipart for >50 MB; max 500 MB

422

Configuration invalid

Check column names, page range syntax

429

Rate limited

Back off; retry with exponential backoff

500 / 502 / 503

Lido server error

Retry with exponential backoff


Job-level errors (the file submitted but extraction failed) come back through the poll endpoint with status: "error" and an error field. These are not HTTP errors — the request itself succeeded.



Tips


  • Always test in the UI first. Always.
  • Persist results immediately. They expire in 24 hours.
  • Use multipart for files >50 MB. Lower memory cost than base64.
  • Run two extractions when you need both header and line items. Single-row + multi-row, same file, two calls.
  • Capture jobId in your logs at submit time. When something goes wrong, you'll need it.
  • Set conservative timeouts. Network hiccups happen — use 30+ second connect timeouts and retry on transient failure.
  • Use environment variables for API keys, never check them into git.



Common mistakes


  • Polling every second. Wastes the rate limit on the poll endpoint and does nothing — extraction takes 10+ seconds. Wait at least 10 before the first poll.
  • Forgetting the Bearer prefix. Authorization: YOUR_API_KEY is wrong. Must be Bearer YOUR_API_KEY.
  • Submitting the same file in a tight loop. If the result looks wrong, fix instructions; don't re-submit.
  • Mixing up single-row and multi-row. Returning 50 rows when you wanted one summary, or one row when you wanted 50 line items, almost always means multiRow is wrong.
  • Treating 429 as fatal. It's a "back off" signal, not a failure.
  • Not handling result expiration. A polling worker that retries forever may lose results to the 24-hour window.
  • Hardcoding API keys in client-side code. API keys belong on the server, in a secrets manager.




  • Lido API: quickstart and authentication
  • API error reference
  • Webhooks and async processing
  • Extract data from PDFs and documents (UI version of the same logic)
  • Improve extraction accuracy

Updated on: 16/04/2026

Was this article helpful?

Share your feedback

Cancel

Thank you!