Scanned PDF to Audiobook: Why It's Hard and How We Fixed It

Five seconds of silence

A reader from Chile signed up, uploaded a 74-page PDF of a children's novel — "¿Quién le tiene miedo a Demetrio Latov?" by Angeles Durini — and clicked Generate audiobook. Fifteen seconds later, an email landed: "Your audiobook is ready."

It wasn't. The file was 5.6 seconds long. It contained nothing.

This is the kind of bug you only catch in production, because the test PDFs we'd been using had embedded text. Hers didn't. The book had been scanned, page by page, in 2012 with Adobe Acrobat Pro — and every page was stored as an image. To our pipeline, the PDF was 74 blank pages.

        Key takeaway: When a PDF is a scan (image-only, no embedded text layer), most AI audiobook generators silently produce empty or nonsense audio. Real text extraction needs OCR — and almost no audiobook tool runs it by default.
    

Why scanned PDFs break audiobook pipelines

A PDF can carry text in two completely different ways:

Text PDFs — words are stored as actual characters with font and position. Any library (pypdf, pdfminer, pypdfium2) can extract them in milliseconds.
Image PDFs (scans) — pages are JPEG or TIFF pictures embedded inside the PDF wrapper. To a text-extraction library, those pages contain zero characters.

Anything that came off a flatbed scanner, a phone camera, or an old digitization project (think Google Books, archive.org, public-domain reprints, used kids' books) tends to be the second kind. They look identical to a human eye. They are radically different to software.

When we examined the failed upload, our PDF parser had returned exactly 17 characters of garbage — control bytes, not letters — for the entire 74-page book. Our pipeline then dutifully fed those 17 bytes to text-to-speech and produced 5 seconds of mumbled output. Then it emailed the reader to say her audiobook was ready.

What we built

The fix is conceptually simple: detect when a PDF has no usable text, then OCR every page. The implementation has a few pieces worth describing.

1. A sparse-text detector

Before charging into OCR for every upload, we check the result of native extraction. If the whole PDF returned fewer than 200 characters, or if the average is less than 30 characters per page across 4+ pages, we treat it as a scan and fall through to OCR. Normal text PDFs — the 90%+ case — never trigger the slow path.

2. pypdfium2 + tesseract (Apache 2.0 all the way)

We render each page with pypdfium2 (a Python wrapper around Google's PDFium engine, Apache 2.0) at 2.5× scale — about 180 DPI, enough for reliable character recognition. Each image goes through tesseract with a multi-language model loaded: eng+spa+por+fra+deu+pol+ita+tur. Tesseract figures out the actual language from the glyph shapes.

We deliberately moved off PyMuPDF — the most popular Python PDF library — because its AGPL license is awkward for a hosted service. Switching to pypdfium2 took an afternoon and removed the legal cliff entirely. Worth knowing if you're building anything PDF-related for a commercial product.

3. Language auto-detect

The reader's book was Spanish, but her UI locale was English — so before our fix, even if OCR had worked, the pipeline would have synthesized Spanish text with an English voice (robotic, mispronounced). Now language detection itself uses the OCR'd text. After 3 sample pages, langdetect classifies the content and picks the right voice — in this case, Spanish-Latin-American.

4. Honest failure path

If OCR still recovers fewer than 100 characters (decorative-only PDFs, broken files), we now raise a clear error instead of generating silence: "This PDF has no extractable text. Please upload an EPUB, TXT, or a text-based PDF." The audiobook job is marked failed, no email is sent, no credits charged.

Got a scanned PDF? Try it now.

MimicReader is free for the first hour of audio every month. No card. Drop in any PDF — text or scanned — and we'll figure out the rest.

Start Free

The result

We reran the reader's PDF through the rebuilt pipeline. Five minutes of OCR extracted 144,710 characters of clean Spanish from her 74 scanned pages. The pipeline then chunked it into 1,127 segments and produced a 2-hour 43-minute audiobook with read-along sync — word-by-word highlights tied to the audio. We emailed her in Spanish, on us, and apologized for the first try.

The whole regeneration took about 90 minutes wall-clock: roughly 5 minutes for OCR, 80 minutes for text-to-speech, a few minutes for audio normalization (EBU R128) and M4A finalization. That's slower than a text PDF — but it works. Before, it didn't, at all.

What this means if you have a shelf of scans

If you've been hoarding scanned books — old reprints, out-of-print novels, public-domain digitizations from archive.org, your own scans of grandma's recipe book — they're not stuck on paper anymore. Upload them. We'll do the OCR.

Currently we OCR in 12 languages: English, Spanish, Portuguese, French, German, Polish, Italian, Turkish, Arabic, Japanese, Korean, Hindi. The audiobook itself can be generated in any of 23 voice languages. If your scanned book is in a language we don't OCR yet, drop us a note — adding a tesseract language pack takes a few minutes.

A note on commercial licensing

The whole OCR stack we use — tesseract, pypdfium2, pytesseract — is Apache 2.0. That matters if you're building something similar: PyMuPDF is the easy choice for PDF rendering in Python, but its AGPL license requires you to open-source your entire SaaS if you use it in production. pypdfium2 + tesseract gives you the same capabilities with no license hangover.

The bug that made us better

Most production bugs are caught in test. This one wasn't, because the assumption — "PDFs contain text" — held for every dev file we'd ever used. It took a real reader from Chile, in her first hour on the platform, to surface it.

So thanks, real readers. You find the bugs we can't.