You can extract text in different forms from PDF documents in different languages.
Docotic.Pdf can extract plain and formatted text from PDF documents. And it is possible to get detailed information like font, color, size, and other properties about every single character.
You don't have to do anything special to extract Arabic, Hebrew, or Persian text from PDF documents. This is because Docotic.Pdf is clever enough to extract right-to-left and bidirectional text properly.
Below are resources that explain different aspects of PDF to text conversion in C# and VB.NET code.
Extract text from PDF in C# and VB.NET
Extract text from PDF documents in C# and VB.NET using Docotic.Pdf library. Supports Windows, Linux, macOS, Android, iOS, cloud environments.
Extract text and images from PDF in C# .NET
Extract text, images, and paths from PDF documents in C# and VB.NET using Docotic.Pdf. Convert PDF to text on Windows, Linux, macOS, Android, iOS, in cloud environments.
We have a blog post that explains how to extract text from non-searchable PDF. Non-searchable documents usually render text as a raster image.
A typical example is a scanned PDF document. Non-searchable PDF documents may also render text with vector paths without using fonts or special PDF operators.
These sample codes show different options for PDF to text conversion in C# and VB.NET.
Extract plain text from PDFs with or without formatting.
Extract text by words
Extract all words from a PDF with detailed information like position, font, color and other properties for each word.
Find and highlight text
Extract all words from a PDF page. Find the phrase in the collection of words. Then highlight the result using a highlight annotation.
OCR PDF and extract plain text
Extract text from non-searchable PDF documents using Docotic.Pdf library and Tesseract OCR Engine.
Fix garbled text
Extract text from PDF documents when regular methods and tools produce garbled / unexpected text.
Extract text from link target
Get the first link and extract text from the link's target page below the top offset of the link.