Extract text, paths, and images from PDF documents in C# and VB.NET

Use Docotic.Pdf library to extract text, images, vector paths from PDF documents in .NET on Windows, Linux, macOS, Android, iOS, or in a cloud environment.

Extract text and images

Get text from PDF

You can convert PDF documents to text in .NET using Docotic.Pdf. This sample shows how to convert PDF to formatted text in C#:

using BitMiracle.Docotic.Pdf;

using (var pdf = new PdfDocument("your_document.pdf"))
{
    var options = new PdfTextExtractionOptions
    {
        SkipInvisibleText = true,
        WithFormatting = true
    };
    string formattedText = pdf.GetText(options);
    Console.WriteLine(formattedText);
}

You can extract text from a specific part of a PDF page only. Use PdfTextExtractionOptions.Rectangle property for that.

You can also get detailed information about every text chunk for sophisticated analysis. Docotic.Pdf allows you to extract PDF text as is, by words, or by characters. This sample shows how to extract PDF text by words in C#:

using (var pdf = new PdfDocument("your_document.pdf"))
{
    PdfPage page = pdf.Pages[0];
    foreach (PdfTextData data in page.GetWords())
    {
        Console.WriteLine(
            $"{{\n" +
            $"  text: '{data.GetText()}',\n" +
            $"  bounds: {data.Bounds},\n" +
            $"  font name: '{data.Font.Name}',\n" +
            $"  font size: {data.FontSize},\n" +
            $"  transformation matrix: {data.TransformationMatrix},\n" +
            $"  rendering mode: '{data.RenderingMode}',\n" +
            $"  brush: {data.Brush},\n" +
            $"  pen: {data.Pen}\n" +
            $"}},"
        );
    }
}

Read the Extract text from PDF article to get more samples and information about PDF to text conversion in .NET.

Get images from PDF in .NET

The library can be used to extract images from PDF files as is or as painted. How to extract all images from PDF in C#:

using BitMiracle.Docotic.Pdf;

using (var pdf = new PdfDocument("your_document.pdf"))
{
    int i = 0;
    foreach (PdfImage image in pdf.GetImages())
    {
        string imageFile = image.Save(i.ToString());
        ++i;
    }
}

Extracted images can be saved as TIFF and JPEG images.

The library does not recompress images while extracting them. You will get images with the same quality as in PDF.

You can also get information about where images are actually drawn on a page.

Extract vector paths from PDF

You can get information about vector paths in PDF document using PdfPage.GetObjects() method. Take a look at the Copy page objects and Extract page objects samples for more detail.

You can also extract text as vector paths using the PdfPage.GetObjects(PdfObjectExtractionOptions) overload. This feature can be used to flatten text in PDF in .NET.