Extract text, paths and images from PDF files

Extract text from PDFs

You can extract text from PDF files using Docotic.Pdf library.

Text can be extracted from a page at a time or from a whole document at once.

The library supports the extraction of plain and formatted text. Additionally, you can extract separate words, chars, or text chunks with their coordinates.

In case you need to perform a sophisticated analysis, there is also the ability to extract text, paths and image objects in one collection.

Extract images

The library can be used to extract images from PDF files as is or as painted.

Extracted images can be saved as TIFF and JPEG images.

The library does not recompress images while extracting them. You will get images with the same quality as in PDF.

You can get information about where on a page images are actually drawn.

Extract vector paths

You can retrieve information about vector paths using PdfPage.GetObjects() method. Take a look at the Copy page objects sample for more detail.

You can also extract text as vector paths using PdfPage.GetObjects(PdfObjectExtractionOptions) overload. This feature can be used to flatten text.