PdfTextExtractionOptions.UnmappedCharacterCodeHandler Property

Gets or sets the handler for character codes that cannot be mapped to Unicode using data from the corresponding PDF font.

Namespace:  BitMiracle.Docotic.Pdf
Assembly:  BitMiracle.Docotic.Pdf (in BitMiracle.Docotic.Pdf.dll)

Syntax

C#
public PdfCharacterCodeToUnicodeMapper UnmappedCharacterCodeHandler { get; set; }
VB
Public Property UnmappedCharacterCodeHandler As PdfCharacterCodeToUnicodeMapper
	Get
	Set

Property Value

Type: PdfCharacterCodeToUnicodeMapper
The handler to map PDF character code to Unicode. Cannot be null.

Remarks

PDF font objects usually define how to map character codes to corresponding Unicode values. However, some PDF producers create PDF files where font objects do not include such data. Using this property, you can instruct the library on how to map character codes from incomplete font objects.

For example, you can save the glyph defined by the character code as an image using a PdfTextRasterizer object. Then you can perform an OCR on the image. See OCR PDF in C# and VB.NET article for ideas on how to implement OCR.

The default handler returns an input character code as the Unicode value:

(charCode) => ((char)charCode.Value).ToString(CultureInfo.InvariantCulture);

You can use the following handler to map character codes to a fixed Unicode value ('?' in this example):

(charCode) => "?";

Use the following handler if you do not want to extract text for unmapped character codes:

(charCode) => null;

See Also