Superhuman Multi-Language OCR Engine
Jun 2019
Photo by mana5280 on Unsplash
Building a custom document scanner outperforming Google Tesseract by 10x in accuracy and 100x in speed across 15 European languages simultaneously.
Background and Industry Context
Invoice digitization is critical for enterprise automation across European markets. Real-world documents present unique challenges: multiple scan-print cycles, mixed languages, and varying quality that expose significant limitations in existing OCR solutions.
Our company needed to process invoices from 15 European countries, often containing multiple languages within single documents. Google Tesseract took 30 minutes per document with poor accuracy when the language wasn’t known beforehand. To give an idea of the complexity: the system needed to distinguish between 22 variations of the letter “A” across different diacritics in noisy, low-quality scans.
As the sole developer on the OCR engine, I worked directly with the CTO to architect and implement the entire system in 2 months. This included research, model development, training pipeline creation, and production deployment while collaborating with two other engineers on the broader invoice processing system.
Example poorly-scanned document source
Result
10x accuracy improvement and 100x speed increase over Google Tesseract. Documents that previously took 30 minutes now processed in under 30 seconds, with accurate extraction even from severely degraded scans that challenged human readers.
Solution Architecture
Custom Neural Architecture
Implemented a CRNN with ImageNet pre-training and a specialized head designed for multi-language character recognition. The breakthrough came from applying Curriculum Learning (gradually increasing example difficulty during training) which accelerated convergence and improved final accuracy.
Synthetic Training Data Pipeline
I needed a lot of training data for the supervised approach which closely resembled our test documents: low quality, pixelated, on wrinkly paper, scanned and printer out several times. For this I used Faker to generate realistic data in our target languages alongside transcripts of European Union meetings. Then using PIL for image processing, I degraded the quality with rotations, different noise patterns typical to scanners, contrast, etc.
Technologies Used
PyTorch, FastAI, Python, Docker, PIL, FastAPI, Flask
Need support for your AI project?
Let's work together!