Optical Character Recognition for Rookies

by Abhinav Kaiser on September 3, 2008

OCROn this blog, we have used the term OCR software at times during scanner application reference. The term OCR abbreviates for Optical Character Recognition and quite a few readers have pinged me to know what it means. We will throw some light on the term OCR which is relatively new in concept but great for usability.

As most of us know, when we scan documents, it gets stored as images or there may be an option to save it as a PDF file. Let’s consider saving scanned documents as an image. The data stored in an image is analog and the machine cannot read what is written in the image; although it can display the image as it was scanned. The process of converting the analog data into a form that is machine understandable is the concept behind Optical Character Recognition, OCR in short.

Some experts believe that OCR falls into artificial intelligence category, but let me tell you, they are way off!

OCR can digitize documents that are typewritten and at times, handwritten documents as well. These days, we hear that certain OCR applications can read music notes too. Isn’t it wonderful that documents written in hand (when computers were not available) can be digitized within seconds?

OCR is a software feature and not of the scanner. The scanner’s job ends with transferring the details on the document into an analog format. One might argue that files stored on a computer are digital and not analog. If the machine can’t read data in a file, the data stored is as good as any other analog data.

The current state of OCR is close to 100% accuracy for typewritten text in Latin script. Accuracy rates dip to 80% or below for handwritten documents. Probably, OCR accuracy will fall below 10% for my handwriting :).

{ 0 comments… add one now }