PubMed-OCR: PMC Open Access OCR Annotations
By: Hunter Heidenreich , Yosheb Getachew , Olivia Dinica and more
Potential Business Impact:
Helps computers read science papers accurately.
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.
Similar Papers
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
CV and Pattern Recognition
Reads messy, complex documents perfectly.
Improving OCR for Historical Texts of Multiple Languages
CV and Pattern Recognition
Helps read old, messy handwriting and documents.
Advancing Medical Representation Learning Through High-Quality Data
Image and Video Processing
Improves AI's understanding of medical images and text.