Blazing Fast PDF to PNG Conversion with SIMD and PDFium
The author developed a Python library called fastpdf2png that can convert PDFs to PNGs at a rate of 1,500 pages per second, significantly faster than existing solutions like PyMuPDF, MuPDF, and ImageMagick.
Why it matters
Provides a fast and efficient solution for converting PDFs to PNGs, which is crucial for document-heavy workflows and machine learning applications.
Key Points
- 1Frustrated with slow PDF to PNG conversion, the author created fastpdf2png using PDFium (Chrome's PDF engine) and a custom PNG encoder with SIMD instructions
- 2Detects grayscale pages and outputs 8-bit PNGs to reduce file size
- 3Benchmarked at 323 pages/s in single process, and up to 1,500 pages/s with 8 workers
- 4Targeted at data pipelines, ML preprocessing, and document management workflows that require fast PDF processing
Details
The author was working on a document extraction pipeline and found existing PDF to PNG conversion tools like PyMuPDF, MuPDF, and ImageMagick to be too slow when processing thousands of documents. To address this, they developed a new Python library called fastpdf2png that uses PDFium (the PDF engine from Chrome) under the hood, along with a custom PNG encoder that leverages SIMD instructions and a patched compression library. The library also detects when a page is grayscale and outputs 8-bit PNGs automatically, resulting in smaller file sizes. Benchmarks show fastpdf2png can convert PDFs to PNGs at a rate of 323 pages per second in a single process, and up to 1,500 pages per second with 8 workers - significantly faster than the competition. This tool is targeted at users dealing with PDFs at scale, such as in data pipelines, machine learning preprocessing, and document management workflows.
No comments yet
Be the first to comment