PDF to text features

You can extract text in different forms from PDF documents in different languages.

Docotic.Pdf can extract plain and formatted text from PDF documents. And it is possible to get detailed information like font, color, size, and other properties about every single character.

You don't have to do anything special to extract Arabic, Hebrew, or Persian text from PDF documents. This is because Docotic.Pdf is clever enough to extract right-to-left and bidirectional text properly.

Docotic.Pdf library 9.7.18373 Regression tests 15,244 passed Total NuGet downloads 5,976,723

Articles

Below are resources that explain different aspects of PDF to text conversion in C# and VB.NET code.

Extract text from PDF in C# and VB.NET
Extract text from PDF documents in C# and VB.NET using Docotic.Pdf library. Supports Windows, Linux, macOS, Android, iOS, cloud environments.
Extract text and images from PDF in C# .NET
Extract text, images, and paths from PDF documents in C# and VB.NET using Docotic.Pdf. Convert PDF to text on Windows, Linux, macOS, Android, iOS, in cloud environments.

Blog posts

We have a blog post that explains how to extract text from non-searchable PDF. Non-searchable documents usually render text as a raster image.

A typical example is a scanned PDF document. Non-searchable PDF documents may also render text with vector paths without using fonts or special PDF operators.

OCR PDF in C# and VB.NET
How to OCR PDF and extract text in C# and VB.NET using Tesseract and Docotic.Pdf.
Extract text from PDF on AWS Lambda in C# .NET
How to extract text from PDF on AWS Lambda in C# .NET Core application using Docotic.Pdf library.

Sample code

These sample codes show different options for PDF to text conversion in C# and VB.NET.

Extract text
Extract plain text from PDFs with or without formatting.
Extract text by words
Extract all words from a PDF with detailed information like position, font, color and other properties for each word.
Find and highlight text
Extract all words from a PDF page. Find the phrase in the collection of words. Then highlight the result using a highlight annotation.
OCR PDF and extract plain text
Extract text from non-searchable PDF documents using Docotic.Pdf library and Tesseract OCR Engine.
Fix garbled text
Extract text from PDF documents when regular methods and tools produce garbled / unexpected text.
Extract text from link target
Get the first link and extract text from the link's target page below the top offset of the link.