Articles on: PDF extraction

Super-powered Extra Instructions (i.e., Directives)

Directives are deterministic instructions you add in the Extra Instructions section of your Data Extractor to fine‑tune extraction.

Each directive has a specific syntax and must appear on its own line at the bottom of the Extra Instructions.

File Extractor

@parallel:true / @parallel:false

What is this?

Controls whether the extractor treats each page separately or processes the whole document as one unit.

Use when

@parallel:true: Each page stands on its own. Example: one form or one table per page that does not need data from other pages.

@parallel:false: Pages depend on each other. Example: page 1 has the employee name and later pages list that employee’s hours. You want the extractor to keep that connection.

Why it helps

@parallel:true: Many documents have each page holding complete data for that page needed for a spreadsheet row output and each page follows the same structure. In this case, parallel mode speeds things up while keeping rows clean (for example, 1 spreadsheet row per page where each row only holds data for its associated page).

@parallel:false: Some documents rely on context set earlier. @parallel:false tells the tool to treat the uploaded document as a whole document. This prevents missing fields or mismatched rows when data spans pages (for example, when you want to output a spreadsheet row that combines page 2 and page 1 data).

How to use

Add one line to the end of your Extra Instructions:

@parallel:trueor@parallel:false

If you must use parallel mode but still need context from earlier or later pages when a set of data is extracted from a certain page, add @parallel_extended_context: <field> (see below).

Notes

Rule of thumb: @parallel:false works best for documents with max 20 pages and when later pages rely on earlier pages or vice versa.

@exclude_pages:

What is this

Skips pages you don’t want turned into spreadsheet rows.

Use when

You are seeing data outputted from irrelevant pages like cover sheets, marketing content, blank pages, or Terms and Conditions.

Your output shows junk rows that you would not expect to see.

Why it helps

Reduces false positives and speeds up processing because fewer pages are analyzed.

How to use

Write a short rule on an independent line at the bottom of your Extra Instructions that says what makes a page skippable. Examples you can paste and tweak - for example:

@exclude_pages: skip if page contains 'Terms and Conditions'@exclude_pages: skip if page has no dollar amount@exclude_pages: skip if page only has a cover title and no tables or totals

@parallel_extended_context:

What is this

Tells the extractor to remember a field that appears rarely and reuse it for the data extraction of later pages -- until a new value for that field is found. Examples of fields: Employee name, Table headers, Category label.

Use when

A key value appears only once or only occasionally but applies to many pages after it.
Later pages need that earlier value to make sense.

Why it helps

Prevents missing or misaligned data when the page being processed does not repeat a header, name, label...or some data that the spreadsheet output row of that certain page depends on (e.g., when outputting timesheet data for an employee whose data lives on page 3, except for their name whch is on page 1).

How to use

Add one independent line at the end of your Extra Instructions naming the field to remember - for example:

@parallel_extended_context: Employee name@parallel_extended_context: Table headers@parallel_extended_context: Category

Notes

Works best with @parallel:true.
If the file is short (less than 20 pages), @parallel:false can also solve the same issue by processing the whole document together and considering preceding and succeeding page context when extracting data from a given page.

@deduplicate: files

What is this

Prevents the same document from being processed multiple times by tracking documents based on their content.

Use when

You are uploading files that may be duplicates (for example, the same invoice or form sent multiple times).
You want to avoid creating duplicate rows from the same document.

Why it helps

Prevents duplicate data in your spreadsheet when the same file is uploaded again.
Saves processing time and page credits by rejecting duplicate documents immediately.

How to use

Add one line at the end of your Extra Instructions:

@deduplicate: files

Notes

Works only with file extractors. Email extraction does not currently support deduplication.
Deduplication is based on file content, not filename. Renaming a file will not bypass deduplication.
When a duplicate is detected, you will receive an error: "Duplicate document: this document has already been processed and deduplication is enabled."
If you want to reprocess a document, just click the Reset button and run the processing again.

@deep_thinking: 0 | 1 | 2 | 3

What is this?

Lets the AI spend more time thinking before it answers.

Use when

The extraction requires careful logic (for example, multi‑step totals, conditional rules, reasoning across columns, manipulating the order of data before being outputted) - or when simple extraction is inconsistent b/c a file is very complex and requires many "Extra Instructions" to ensure the right data is outputted.

Why it helps

More reasoning time can improve accuracy on complex files.

How to use

Add one line with a level:@deep_thinking: 0 (default)@deep_thinking: 1@deep_thinking: 2@deep_thinking: 3

Notes

Higher is not always better. Try level 1 or 2 first.

Email Extractor

@attachments_only

What is this?

Extracts data only from attachments, not from the email body.

Use when

Attachments contain the data you need (for example, invoice PDFs) and the email body is not needed.

Why it helps

Avoids mixing email body text with attachment data. Keeps rows focused and predictable.

How to use

Add one line at the end of your Extra Instructions:

@attachments_only

Notes

If important data is in the email body, do not use this.

@skip_attachments:

What is this?

Skips attachments that are not relevant.

Use when

Some attachments should not be processed (for example, non‑invoices).

Why it helps

Reduces noise and speeds processing by excluding files you do not need.

How to use

Provide a simple filter rule at the end of your Extra Instructions. Examples:

@skip_attachments: skip if not a PDF@skip_attachments: skip if filename contains 'terms'@skip_attachments: skip if not an invoice

Notes

Works with or without @attachments_only.

OCR Mode

What is this?

OCR (Optical Character Recognition) turns text you can see in a PDF image into real, machine‑readable text. It helps when the PDF has no reliable text layer (for example, you try to highlight a word and nothing—or huge blocks—get highlighted).

Use when

Data is coming out incorrect and you can’t reliably select/copy text in the PDF (nothing highlights or the wrong blocks highlight).
Forms with checkboxes/radio buttons or complex layouts confuse the default extractor.

Why it helps

Makes the visible page content usable as text for extraction.
Two modes:
- @ocr_mode: vision — Use this version when data is coming out incorrect AND the PDF has some text you can highlight. Also helpful for PDFs with form layouts (e.g., forms with radio buttons, checkboxes, etc.)
- @ocr_mode: vision_only — Use this version when data is coming out incorrect AND the PDF has NO text you can highlight.

How to use

Add one independent line at the end of your Extra Instructions:

@ocr_mode: vision@ocr_mode: vision_only

Quick reference

Independent pages → @parallel:true.Cross‑page context → @parallel:false or @parallel_extended_context.Irrelevant pages → @exclude_pages:<instructions>.Complex logic → increase @deep_thinking one level.Email attachments only → @attachments_only.Skip unwanted attachments → @skip_attachments:<rule>.Data is extracted incorrectly AND PDF has some highlightable text OR the PDF has form fields (e.g., checkboxes) → @ocr_mode: visionData is extracted incorrectly AND PDF has NO highlightable text: → @ocr_mode: vision_onlyPrevent duplicate file processing → @deduplicate: files

Updated on: 09/12/2025

Was this article helpful?

Thank you!