Extract text from PDF on AWS Lambda

Since version 5.7.9279 Docotic.Pdf can extract text from PDFs when running in AWS Lambda environment. This is true for PDFs with both embedded and non-embedded fonts. To make this possible, we added ability to use custom font loader for non-embedded fonts.

Let’s make a simple .NET Core application that extracts text from a PDF document and publish it to AWS Lambda.

Prerequisites

You will need the following to complete steps described in this article:

Create AWS Lambda project

In Visual Studio choose “AWS Lambda Project (.NET Core)” project template from “Visual C# -> AWS Lambda” group.

Create AWS Lambda project


Choose “Empty function” in the “Select Blueprint” popup.

Choose Empty Function template

Extract text from PDF document using Docotic.Pdf library

Add Docotic.Pdf NuGet package. You should select 5.7.9279-dev version (from prerelease channel) or newer.

Add Docotic.Pdf NuGet package


Add some PDF document with non-embedded TrueType/OpenType fonts to the project. For example, you can use this document.

Set “Copy to Output Directory” property for the PDF document to “Copy always”.

Use the following code in Function.cs:

using Amazon.Lambda.Core;
using BitMiracle.Docotic.Pdf;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]

namespace ExtractTextOnAwsLambda
{
    public class Function
    {
        public string FunctionHandler(ILambdaContext context)
        {
            // NOTE: 
            // When used in trial mode, the library imposes some restrictions.
            // Please visit https://bitmiracle.com/pdf-library/trial-restrictions.aspx
            // for more information.
            
            PdfConfigurationOptions config = PdfConfigurationOptions.Create();
            config.FontLoader = new DirectoryFontLoader(new[] { "/usr/share/fonts" }, true);
        
            using (var pdf = new PdfDocument("Attachments.pdf", config))
            {
                return pdf.GetTextWithFormatting();
            }
        }
    }
}

Replace “Attachments.pdf” with the name of the PDF file you actually use.

How the code works

Docotic.Pdf uses GdiFontLoader class by default. However, GDI+ is not installed on AWS Lambda. Because of this GdiFontLoader.Load method always throw TypeInitializationException when code runs in AWS Lambda environment.

That is why we use custom font loader in the code above. DirectoryFontLoader class scans the specified directories and loads font bytes. In this sample we use shared Linux fonts. Alternatively, you can deploy some common fonts with your application and point DirectoryFontLoader to them.

Deploy and test the function

It’s time to test the function on AWS Lambda. Right-click on the project in Solution Explorer and select “Publish to AWS Lambda…”

In the “Upload Lambda Function” input the name of your function. Then click “Upload”.

Deploy AWS Lambda function


After successful deployment run the uploaded application. You will see that the text is properly extracted:

Test AWS Lambda function - correct result

Source code

Complete project for this article is available on GitHub.


Posted in