Skip to main content

OCR PDF

Adds a searchable text layer to scanned PDF documents.

Overview

The OCR PDF node processes scanned PDF documents using Optical Character Recognition (OCR) to add an invisible text layer. This makes the PDF searchable and selectable while preserving the original layout.

Use it to:

  • Make scanned documents searchable
  • Preprocess scanned PDFs before data extraction
  • Add text layers to image-based PDFs

Parameters

ParameterDescriptionRequired
FilePDF file to process (supports expressions)Yes
File DestinationWhere to save the OCR-processed PDFYes
File NameOutput filename without extensionNo

File

The PDF file to OCR. Typically comes from a trigger or file operation:

{{$item.data.file}}

Settings

SettingDescription
Execution ModeOnce per item (default) or Once
Output ModeHow to output results when running once
Batch SizeItems to process concurrently (default 5)
Stop on ErrorStop workflow on failure

Output

{
"file": {
"type": "fileData",
"name": "document-ocr.pdf",
"mimeType": "application/pdf",
"fileInfo": { "type": "..." }
}
}

Access in expressions:

  • File object: {{$item.data.file}}

Examples

OCR Before Data Extraction

Preprocess scanned invoices for extraction:

[Google Drive Trigger] → [OCR PDF] → [Data Extractor] → [Insert Rows]

Batch OCR Scanned Documents

Process all scanned PDFs from a folder:

[OneDrive Trigger] → [OCR PDF] → [Copy File (processed folder)]

OCR and Archive

[Lido Mailbox Trigger] → [Split (attachments)] → [OCR PDF] → [Copy File (archive)]

Tips

  • Use OCR PDF before Data Extractor when processing scanned documents
  • Already-searchable PDFs can still be processed — the text layer is added/updated
  • This is a long-running operation — processing time depends on document size and page count
  • The original PDF layout and visual content are preserved
  • Connect error output to handle OCR failures