OCR PDF
Adds a searchable text layer to scanned PDF documents.
Overview
The OCR PDF node processes scanned PDF documents using Optical Character Recognition (OCR) to add an invisible text layer. This makes the PDF searchable and selectable while preserving the original layout.
Use it to:
- Make scanned documents searchable
- Preprocess scanned PDFs before data extraction
- Add text layers to image-based PDFs
Parameters
| Parameter | Description | Required |
|---|---|---|
| File | PDF file to process (supports expressions) | Yes |
| File Destination | Where to save the OCR-processed PDF | Yes |
| File Name | Output filename without extension | No |
File
The PDF file to OCR. Typically comes from a trigger or file operation:
{{$item.data.file}}
Settings
| Setting | Description |
|---|---|
| Execution Mode | Once per item (default) or Once |
| Output Mode | How to output results when running once |
| Batch Size | Items to process concurrently (default 5) |
| Stop on Error | Stop workflow on failure |
Output
{
"file": {
"type": "fileData",
"name": "document-ocr.pdf",
"mimeType": "application/pdf",
"fileInfo": { "type": "..." }
}
}
Access in expressions:
- File object:
{{$item.data.file}}
Examples
OCR Before Data Extraction
Preprocess scanned invoices for extraction:
[Google Drive Trigger] → [OCR PDF] → [Data Extractor] → [Insert Rows]
Batch OCR Scanned Documents
Process all scanned PDFs from a folder:
[OneDrive Trigger] → [OCR PDF] → [Copy File (processed folder)]
OCR and Archive
[Lido Mailbox Trigger] → [Split (attachments)] → [OCR PDF] → [Copy File (archive)]
Tips
- Use OCR PDF before Data Extractor when processing scanned documents
- Already-searchable PDFs can still be processed — the text layer is added/updated
- This is a long-running operation — processing time depends on document size and page count
- The original PDF layout and visual content are preserved
- Connect error output to handle OCR failures