Optical Character Recognition (OCR)

OCR is a technology for converting handwritten, typed, scanned text, or text within images into machine-readable text.

You can use OCR on any image file that contains text or a PDF document or any scanned, printed, or handwritten document that is readable to extract text.

Using OCR

Some of the common uses of OCR are

Create automated workflows by digitizing PDF documents across different business units.
Eliminating manual data entry when digitizing paper documents like reading passports, invoices, bank statements, etc.
Create secure access to sensitive information by digitizing ID cards, credit cards, etc.
Digitization of printed books.

Read a PDF file

Here you will read the content of a PDF file. You have to install the pypdf2 library which is built in Python to handle different functionalities of PDF like,

Extract information from the document.
Dividing documents page by page.
Encrypt and decrypt PDF files.

Best practices for OCR using pytesseract

Try a different combination of settings for pytesseract to get the best results for your use case,

The text should not be skewed, leave some white space around the text for best results, and make sure the image is better lit to eliminate dark edges.
300-600 DPI minimum works great.
The font size is 12 pt. or more gives better results.
Apply different pre-processing techniques such as binarize, remove noise from the image, rotate the image to straighten it, sharpen the image etc.

Jyoti Verma

Search This Blog

Optical Character Recognition (OCR)

Optical Character Recognition (OCR)

Comments

Post a Comment