PDF to text features

You can extract text in different forms from PDF documents in different languages.

PDF to text conversion process

Docotic.Pdf can extract plain and formatted text from PDF documents. And it is possible to get detailed information like font, color, size, and other properties about every single character.

You don't have to do anything special to extract Arabic, Hebrew, or Persian text from PDF documents. This is because Docotic.Pdf is clever enough to extract right-to-left and bidirectional text properly.

Docotic.Pdf library 9.5.17615-dev Regression tests 14,813 passed Total NuGet downloads 4,924,084

Articles

Below are resources that explain different aspects of PDF to text conversion in C# and VB.NET code.

Blog posts

We have a blog post that explains how to extract text from non-searchable PDF. Non-searchable documents usually render text as a raster image.

A typical example is a scanned PDF document. Non-searchable PDF documents may also render text with vector paths without using fonts or special PDF operators.

Sample code

These sample codes show different options for PDF to text conversion in C# and VB.NET.

  • Extract text
    Extract plain text from PDFs with or without formatting.

  • Extract text by words
    Extract all words from a PDF with detailed information like position, font, color and other properties for each word.

  • Find and highlight text
    Extract all words from a PDF page. Find the phrase in the collection of words. Then highlight the result using a highlight annotation.

  • OCR PDF and extract plain text
    Extract text from non-searchable PDF documents using Docotic.Pdf library and Tesseract OCR Engine.

  • Fix garbled text
    Extract text from PDF documents when regular methods and tools produce garbled / unexpected text.

  • Extract text from link target
    Get the first link and extract text from the link's target page below the top offset of the link.