OCR PDF in C# and VB.NET - how to extract text from non-searchable PDF

Text extraction is one of the most popular PDF processing tasks. You would need to extract text from a PDF document if you want to:

  • index the document for full-text search
  • parse some data like names and prices
  • highlight, or delete, or replace a word or a phrase

You can extract text manually. Open a document in any PDF viewer, then select and copy some text. It works properly for most documents. We know such documents as “searchable PDF”. Searchable PDF documents render text using special PDF operators and contain correct mappings of glyphs to Unicode in font objects associated with the text.

Many PDF libraries can extract text from searchable PDF documents.

There are also non-searchable PDF documents. Non-searchable documents usually render text as a raster image. A typical example is a scanned PDF document. Non-searchable PDF documents may also render text as vector paths without using fonts or special PDF operators.

You need to perform optical character recognition (OCR) to extract text from non-searchable PDF documents. OCR does not guarantee correct results in 100% of cases. Results depend on the document’s quality and the recognition algorithm. Also, optical recognition is much slower than the extraction of text from searchable documents.

Let’s look at how to perform OCR and extract text from PDF documents in a C# and VB.NET applications.

Preparations

You have a non-searchable PDF document in the English language. For example, such as Partner.pdf. You need to perform OCR automatically and extract the recognized text. The project should work on .NET Standard to support Windows, Linux, and macOS. And the recognition process should work without an Internet connection.

You will need to do the following steps:

  1. Check that the PDF document does not contain regular searchable text.
  2. Convert pages of the document to high-resolution images.
  3. Recognize text on the images

Use Docotic.Pdf library to perform steps 1 and 2. Docotic.Pdf library fully supports .NET Standard. In the trial mode, the library reads only half of the pages and adds a warning to PDF pages. You may get a free time-limited license key here to try the library without the trial mode restrictions.

Use Tesseract OCR engine and the .NET wrapper for it to recognize text on step 3.

Create a new Console App (.NET Core) C# project and add Docotic.Pdf and Tesseract NuGet packages to the project:

It’s important to use Tesseract version 4.1.0 or newer (https://github.com/tesseract-ocr/tesseract/releases). The version 4.0.0 contains many issues and does not work in a .NET Core application.

Tesseract requires additional configuration in the target operating system:

  1. On Windows install Microsoft Visual C++ 2015-2019 Redistributable

  2. On Linux install or compile projects “libleptonica-dev” and “libtesseract-dev”. Add the binaries to your project. For example, do the following on Ubuntu 20.04:

     cd ~/YourProject/x64 # Place "x64" directory on the same level with "tessdata"
    
     sudo apt install libleptonica-dev
     ln -s /usr/lib/x86_64-linux-gnu/liblept.so.5 libleptonica-1.78.0.so
    
     sudo apt install libtesseract-dev
     ln -s /usr/lib/x86_64-linux-gnu/libtesseract.so.4.0.1 libtesseract41.so
    

    More details here: https://github.com/charlesw/tesseract/issues/503

  3. On macOS install Tesseract using brew:

     brew install tesseract
    

    This command will install Tesseract with Leptonica and other dependencies. More details here: https://github.com/tesseract-ocr/tesseract/wiki

    Then add native dependencies to your project:

     cd ~/YourProject
    
     ln -s /usr/lib/libdl.dylib liblibdl.so
    
     mkdir x64
     cd x64
    
     ln -s /usr/local/Cellar/leptonica/1.79.0/lib/liblept.5.dylib libleptonica-1.78.0.so
     ln -s /usr/local/Cellar/tesseract/4.1.1/lib/libtesseract.4.dylib libtesseract41.so
    

    Place liblibdl.so one level higher than libleptonica-1.78.0.so and libtesseract41.so. Set “Copy to output directory” property to “Always copy” for every *.so file in the project.

    Note that Tesseract might install with dependencies of different versions. For example, at the time of writing, it installs with Leptonica 1.79.0 instead of 1.78.0 required by the .NET wrapper. That’s acceptable while the installed version is compatible with Leptonica 1.78.0 and Tesseract 4.1.

Implementation

Check if OCR is necessary

First thing: check if you at all need to perform OCR. For searchable PDFs, you can extract text without recognition.

using System.Text;
using BitMiracle.Docotic.Pdf;

var documentText = new StringBuilder();
using (var pdf = new PdfDocument("Partner.pdf"))
{
    for (int i = 0; i < pdf.PageCount; ++i)
    {
        if (documentText.Length > 0)
            documentText.Append("\r\n\r\n");

        PdfPage page = pdf.Pages[i];
        string searchableText = page.GetText();
        if (!string.IsNullOrEmpty(searchableText.Trim()))
        {
            documentText.Append(searchableText);
            continue;
        }

        // TODO: This page is not searchable. Perform OCR here
    }
}

using (var writer = new StreamWriter("result.txt"))
    writer.Write(documentText.ToString());

The code above extracts text from PDF. Usually, you don’t need not perform OCR if there is any text.

Save a PDF page as an image

Append the following snippet:

for (int i = 0; i < pdf.PageCount; ++i)
{
    ...

    if (!string.IsNullOrEmpty(searchableText.Trim()))
    {
        documentText.Append(searchableText);
        continue;
    }
    
    // This page is not searchable.
    
    PdfDrawOptions options = PdfDrawOptions.Create();
    options.BackgroundColor = new PdfRgbColor(255, 255, 255);
    options.HorizontalResolution = 300;
    options.VerticalResolution = 300;

    string pageImage = $"page_{i}.png";
    page.Save(pageImage, options);

    // TODO: Perform OCR here
}

This snippet converts a page to a PNG image with a white background. The image has 300x300 dpi resolution.

The code saves the image to the file. You can delete the file after recognition. An alternative way - save the image to a memory stream:

using (var pageImage = new MemoryStream())
{
    page.Save(pageImage, options);
    
    // TODO: Perform OCR here
}

Recognize text

It’s time to use Tesseract for recognition of the text on the PDF page image. Tesseract uses trained models for every language during recognition. The version of model data files must correspond to the version of Tesseract. Use these data files for Tesseract 4.1.0. You only need the eng.traineddata file to recognize an English text.

Create a “tessdata” directory in your project and copy eng.traineddata there. Set “Copy to output directory” property for eng.traineddata to “Copy always” or “Copy if newer”. Or you can use the NuGet package https://www.nuget.org/packages/Tesseract.Data.English/ to automate these steps.

Then add the recognition code:

using Tesseract;

using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default))
{
    for (int i = 0; i < pdf.PageCount; ++i)
    {
        ..
        page.Save(pageImage, options);

        using (Pix img = Pix.LoadFromFile(pageImage))
        {
            using (Page recognizedPage = engine.Process(img))
            {
                Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");
            
                string recognizedText = recognizedPage.GetText();
                documentText.Append(recognizedText);
            }
        }
        
        File.Delete(pageImage);
    }
}

The full sample code:

using System;
using System.IO;
using System.Text;
using BitMiracle.Docotic.Pdf;
using Tesseract;

namespace OCR
{
    public static class OcrAndExtractText
    {
        public static void Main()
        {
            // BitMiracle.Docotic.LicenseManager.AddLicenseData("temporary or permanent license key here");
        
            var documentText = new StringBuilder();
            using (var pdf = new PdfDocument("Partner.pdf"))
            {
                using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default))
                {
                    for (int i = 0; i < pdf.PageCount; ++i)
                    {
                        if (documentText.Length > 0)
                            documentText.Append("\r\n\r\n");

                        PdfPage page = pdf.Pages[i];
                        string searchableText = page.GetText();

                        // Simple check if the page contains searchable text.
                        // We do not need to perform OCR in that case.
                        if (!string.IsNullOrEmpty(searchableText.Trim()))
                        {
                            documentText.Append(searchableText);
                            continue;
                        }

                        // This page is not searchable.
                        // Save the page as a high-resolution image
                        PdfDrawOptions options = PdfDrawOptions.Create();
                        options.BackgroundColor = new PdfRgbColor(255, 255, 255);
                        options.HorizontalResolution = 300;
                        options.VerticalResolution = 300;

                        string pageImage = $"page_{i}.png";
                        page.Save(pageImage, options);

                        // Perform OCR
                        using (Pix img = Pix.LoadFromFile(pageImage))
                        {
                            using (Page recognizedPage = engine.Process(img))
                            {
                                Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");

                                string recognizedText = recognizedPage.GetText();
                                documentText.Append(recognizedText);
                            }
                        }
                        
                        File.Delete(pageImage);
                    }
                }
            }

            using (var writer = new StreamWriter("result.txt"))
                writer.Write(documentText.ToString());
        }
    }
}

The code is available on GitHub

TesseractEngine parameters are a path to a trained model data files, a document language, and a recognition mode. You usually need one TesseractEngine object for all PDF pages. However, sometimes you may need to use multiple TesseractEngine objects. See an example in the Multilingual text section.

The Pix.LoadFromFile and Pix.LoadFromMemory methods create a Pix object from the PDF page image. The Pix class is a .NET wrapper for image objects in Leptonica library. Tesseract uses Leptonica for image manipulations. You can use a single Pix object multiple times from one or multiple TesseractEngine objects.

The using (Page recognizedPage = engine.Process(img)) line performs recognition of text on the image. You can test recognition quality using the “confidence” metric. Confidence is a real number in the range from 0 to 1. The values close to 1 mean that Tesseract is confident the recognition was performed correctly. The Page.GetMeanConfidence() method returns a mean confidence for all the words in the recognized text.

Tesseract calculates confidence as a distance between the recognized character in the input image and the data in the trained model. In general, mean confidence values near 1 do not guarantee 100% correct result from a human perspective. The opposite is also true - lower mean confidence values do not always mean incorrect results. However, in most cases, the confidence value is reliable and useful. You will see some practical use cases below.

Finally, Page.GetText() method returns the recognized text. Tesseract also provides the ability to get detailed information about every recognized chunk using the Page.GetIterator() method. You can use this method to place the recognized text to the source PDF page. That allows you to convert the original non-searchable document to a searchable PDF. The sample for this scenario.

Run the full version of the code for Partner.pdf document. You will get the following text with 0.91 confidence: OCR Partner.pdf 300 DPI

Note that Tesseract does not properly recognize the text on the right side:

Microsoft“ Pa rtner
P rog ra m

 

Program

I
l

How to improve recognition quality

There are many ways you can try to improve recognition quality. The methods below may improve or worsen the result. There is no silver bullet. You need to try different methods to find an optimal solution. Test on multiple real PDF documents. A method may improve recognition for one document and worsen the result for another one.

Configure Tesseract

You can improve recognition quality by using Tesseract configuration options.

Start by trying EngineMode.LstmOnly mode. In this mode Tesseract uses the most advanced recognition algorithm based on LSTM networks:

using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.LstmOnly))

The updated code gives the following result with 0.93 confidence: OCR Partner.pdf 300 DPI LSTM

Before After Is quality improved?
Tesseract recognizes bullet characters in the list as “0” and “o” Now recognizes as “e” Yes and no - the text is the same in all cases but “o” would be better than “e”
https://partner.microsoft.com https://partner. microsoft.com No
up—to—date (dashes) up-to-date (hyphens) Yes
Microsoft“ Pa rtner Microsoft | Partner Yes

There is a degradation of quality for the URL, but the other changes are fine. In most cases, the LstmOnly mode provides better results.

Note that confidence changed from 0.91 to 0.93. In other words, Tesseract reports that recognition quality improved by 2%. I would say the difference is not so significant in this case.

Here is one more example of the relativity of mean confidence values. Let’s recognize text in the simple Hello.pdf document. Tesseract recognizes text 100% correctly in all Default, TesseractAndLstm, and LstmOnly modes:

HELLO, PDF!

lorem ipsum dolor sit amet

However, Tesseract returns mean confidence 0.9 for Default/TesseractAndLstm modes and 0.95 for LstmOnly.

You may try Page Segmentation Mode option if you know the layout of page content. For example:

using (Page recognizedPage = engine.Process(img, PageSegMode.SingleBlock))

Tesseract provides many parameters to configure the recognition process. https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html lists some useful options. Use the source code to find all supported parameters. For example:

Use “configFiles” argument of TesseractEngine constructor to customize “Init only” parameters. Use TesseractEngine.SetVariable method to change other parameter values like that:

engine.SetVariable("textord_min_linesize", 2.5);
engine.SetVariable("lstm_choice_mode", 2);

Change the resolution of PDF page images

The sample code uses a resolution of 300 dpi to convert PDF pages to images. You can try higher resolution. For example, 600 dpi.

You will get the following result with 0.93 confidence for the Partner.pdf document in LstmOnly mode with 600x600 dpi. The result becomes worse - redundant empty lines, “https.” instead of “https:”, “eo” and “oe” for bullet characters in the list.

The result becomes even worse if you use 900 dpi resolution. The confidence will drop to 0.81. The 900 dpi image contains more detail and Tesseract mistakenly recognizes the right-bottom image as text.

Therefore, higher resolution does not guarantee a better result.

However, you can try and decrease the resolution. Here is the related research.

You will get the following result with 0.94 confidence for 200 dpi resolution. Now Tesseract recognizes the URL “https://partner.microsoft.com” properly. Redundant empty lines exist here too, though.

Change PDF content

The code still did not recognize the word “Guide” at the right. Unsuccessful experiment with 900 dpi hints that the right-bottom image might prevent the recognition. Let’s try to replace this image with 1px transparent image before recognition:

using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.LstmOnly))
{
    ...
    foreach (PdfImage image in page.GetImages())
    {
        // simple hack to replace the right-bottom image only
        if (image.Height == 512)
            image.ReplaceWith("1px.png");
    }

    // Save PDF page as high-resolution image
    PdfDrawOptions options = PdfDrawOptions.Create();
    options.BackgroundColor = new PdfRgbColor(255, 255, 255);
    options.HorizontalResolution = 200;
    options.VerticalResolution = 200;
    ...

After that, the code properly recognizes the phrase “Program Guide” at the right. The result looks acceptable now.

You cannot use this method for an arbitrary document. For example, the removal of an image will not work if an input PDF contains a single scanned image on a page. You cannot replace all images in the Partner.pdf document, too. The gradient at right-top is also an image. You will get unrecognizable white text “Microsoft | Partner Program” on the white background if you replace the image with a transparent one.

You can make different preparatory actions depending on PDF content. For example, change orientation for incorrectly rotated page (sample document):

page.Rotation = PdfRotation.None;
page.Save(pageImage, options);

Or crop some unwanted content before image saving:

page.CropBox = new PdfBox(0, 0, 600, 500);
page.Save(pageImage, options);

Pre-process an image

You can help Tesseract to recognize text better by changing a page image. Read more details here: https://tesseract-ocr.github.io/tessdoc/ImproveQuality#image-processing

Use a different trained model

The code uses the default tessdata model. Tesseract 4 comes with two additional models.

Use the tessdata_fast model for faster recognition, but worse quality. Use the tessdata_best model for better quality, but slower recognition.

Try to use the tessdata_best model. Add the eng.traineddata file from tessdata_best to your project. Note that tessdata_best does not help to improve recognition for the test document Partner.pdf.

Existing trained models may not provide good results if a PDF document uses some fancy font or an unsupported language. In that case, you can train a custom model. Read more information here: https://github.com/tesseract-ocr/tessdoc#training-for-tesseract-4

Update Tesseract to the latest version

At the time of publication of the article, Tesseract 5 is in development. The 4.1.1 release is also available. Updating Tesseract to the latest version may help to improve recognition. Watch for updates here: https://github.com/tesseract-ocr/tesseract/releases

Ask community

You can ask the Tesseract community in the Google group. There you may get good ideas on how to improve recognition quality. Also, a lot of useful topics about improving recognition quality already exist in this group.

Special cases

The code above works fine for most documents. Let’s consider cases that require additional steps.

Multilingual text

PDF documents may contain pages in more than one language. You may not know the language of the text on a specific page.

Tesseract can recognize multiple languages at once. For this:

  • Add data model files for all document languages to “tessdata” directory.
  • Pass all languages to TesseractEngine constructor like that:
      using (var engine = new TesseractEngine(@"tessdata", "eng+rus", EngineMode.LstmOnly))
    

You better have an idea about possible document languages. Recognition speed depends on the number of languages. Additional languages slow recognition down.

Sometimes Tesseract incorrectly recognizes multilingual text lines. In such a case, first try methods for improving quality. If that does not help then use the following workaround. For each language, iterate the recognized words and choose the word with higher confidence value. Sample code:

using (var eng = new TesseractEngine(@"tessdata", "eng", EngineMode.LstmOnly))
using (var rus = new TesseractEngine(@"tessdata", "rus", EngineMode.LstmOnly))
{
    ..
    
    using (Pix img = Pix.LoadFromFile(pageImage))
    {
        using (Page rusPage = rus.Process(img))
        using (Page engPage = eng.Process(img))
        {
            using (ResultIterator rusIter = rusPage.GetIterator())
            using (ResultIterator engIter = engPage.GetIterator())
            {
                const PageIteratorLevel Level = PageIteratorLevel.Word;
                rusIter.Begin();
                engIter.Begin();
                do
                {
                    ResultIterator bestIter = rusIter.GetConfidence(Level) > engIter.GetConfidence(Level) ? rusIter : engIter;
                    string text = bestIter.GetText(Level);
                    documentText.AppendLine(text);
                } while (rusIter.Next(Level) && engIter.Next(Level));
            }
        }
    }
}

..

PDF page contains incorrect text

There are searchable PDF documents with incorrect text. This happens when the document does not contain mappings of glyphs to Unicode. Or contains incorrect mappings. For example, U+0007 corresponds to glyph ‘A’, U+00B6 corresponds to glyph ‘B’, and so on.

The first task is to detect that the extracted text is incorrect. You can check if the extracted text corresponds to the document language:

  • Check the text for the presence of popular words (for the English language - “the”, “be”, “to”)
  • Check if characters from the alphabet of the language exist in the text and calculate the character frequency.
  • Use a third-party library to detect the language of the text.

Perform OCR when you detect that most of the text does not correspond to the language of the document. Make sure that the recognized text corresponds to the document language using the same checks you did for the original text.

Conclusion

Use Docotic.Pdf and Tesseract libraries for text recognition in non-searchable PDF documents.

Samples on GitHub:

Posted in ,