I’ve been secretly working on a way to perform efficient and accurate Urdu OCR! I started working with binary images using C++ but soon moved to Matlab. After in-depth analysis, I realized a character-level OCR would be harder and would be less accurate. The main idea behind designing such a system was to textize all the Urdu images that are floating around on the web and maybe scanned book content in the future. The system I’m developing requires training before it can perform the OCR. I’m sure several products already exist which can do similar task once trained for specialized glyphs but its so much fun to do something from the scratch!
Here is some sample text in image form I grabbed off a website:
The number mapping on each frame correponds to the ID of a successful match in the library:
I have a lot of ideas to automate the library expansion process but time is an enemy on this one. There is a huge amount of detail which I’m not posting on the blog at this time.