Image-only PDFs (scanned documents) contain no machine-readable text — just pictures of pages. The upgrade pipeline runs a two-stage AI process to transform them into proper archival documents.
Two-stage upgrade pipeline
1.Vision AI transcription — Each page is rasterized at high resolution and sent to a local vision language model (LLaVA). The model reads the page as a human would, outputting structured Markdown that preserves headings, paragraphs, lists, and tables — far more context-aware than traditional character-recognition OCR.
2.OCRmyPDF PDF/A rebuild — The transcribed text is embedded as an invisible layer using OCRmyPDF (ISO 19005-3 compliant), which also deskews pages, corrects rotation, and cleans scan artifacts. If OCRmyPDF is unavailable the pdf-lib fallback uses the vision AI text directly.
3.Metadata enrichment — Claude Haiku analyzes the hosting page and document excerpt to infer title, author, and a one-sentence description. These are embedded as standard PDF document properties (/Title, /Author, /Subject, /Keywords) readable by any PDF viewer or cataloging tool.
- 📄
Searchable text layer — invisibly overlaid on each page; document looks identical but is fully searchable and copy-pasteable.
- 📋
Embedded metadata — title, author, subject, keywords, and source URL baked into PDF document properties.
- 🗄
PDF/A-3 archival standard (ISO 19005-3) — fonts embedded, no external dependencies, self-contained and renderer-independent for decades-long preservation.
- ♿
Accessibility — screen readers, translation tools, and assistive technology can read the content.
- 🔍
Full-text indexable — search engines, RAG pipelines, and institutional repositories can extract content without manual work.
- 🔗
Provenance — original URL and processing date embedded as keywords; document origin permanently recorded.
Upgraded PDFs are drop-in replacements — original filename, compatible with all standard PDF viewers. Site owners can replace files one by one or batch-rsync the entire upgraded set.