A Deep Dive Into Page Sync
This article explains how the Page Sync feature in the Earleaf app works, which allows users to photograph a page from a physical book and find the corresponding position in the audiobook.
Why it matters
Page Sync is an important feature for audiobook listeners who switch between physical books and their digital counterparts, as it allows them to seamlessly transition between the two formats.
Key Points
- 1Page Sync matches text extracted from a photo (OCR) with text transcribed from the audiobook (speech recognition)
- 2Both OCR and speech recognition have different types of errors, so the matching process needs to be flexible to tolerate these imperfections
- 3The audiobook is transcribed offline using the Vosk speech recognition engine, with each word stored in a database with timestamps
- 4The OCR output is filtered to remove unwanted elements like headers, footers, and bleed-through from the facing page
- 5Up to 20 query words are selected from the filtered OCR text and used to search the audiobook transcription index to find the matching position
Details
Page Sync is a feature in the Earleaf app that allows users to photograph a page from a physical book and have the app find the corresponding position in the audiobook version. This is a challenging problem because it requires matching the output of two different machine learning models - optical character recognition (OCR) on the photo, and speech recognition on the audiobook. Both of these models make different types of errors, so the matching process needs to be flexible enough to tolerate these imperfections. The article explains the three-step process: first, the audiobook is transcribed offline using the Vosk speech recognition engine, with each word stored in a database with millisecond-level timestamps. Then, when the user takes a photo, the OCR output is filtered to remove unwanted elements like headers, footers, and bleed-through from the facing page. Finally, up to 20 query words are selected from the filtered OCR text and used to search the audiobook transcription index to find the matching position, which can then be used to resume playback at the right spot.
No comments yet
Be the first to comment