Archive for the ‘OCR’ Category

OCR PDF in .NET - how to extract text from non-searchable PDF

Text extraction is one of the most popular PDF processing tasks. You would need to extract text from a PDF document if you want to:

  • index the document for full-text search
  • parse some data like names and prices
  • highlight, or delete, or replace a word or a phrase

You can extract text manually. Open a document in any PDF viewer, then select and copy some text. It works properly for most documents. We know such documents as “searchable PDF”. Searchable PDF documents render text using special PDF operators and contain correct mappings of glyphs to Unicode in font objects associated with the text.

Many PDF libraries can extract text from searchable PDF documents.

There are also non-searchable PDF documents. Non-searchable documents usually render text as a raster image. A typical example is a scanned PDF document. Non-searchable PDF documents may also render text as vector paths without using fonts or special PDF operators.

You need to perform optical character recognition (OCR) to extract text from non-searchable PDF documents. OCR does not guarantee correct results in 100% of cases. Results depend on the document’s quality and the recognition algorithm. Also, optical recognition is much slower than the extraction of text from searchable documents.

Let’s look at how to perform OCR and extract text from PDF documents in a .NET application.

Read More

Posted in , ,