Latest news

OCR PDF in C# and VB.NET

Text extraction is one of the most popular PDF processing tasks. You would need to extract text from a PDF document if you want to:

  • index the document for full-text search
  • parse some data like names and prices
  • highlight, or delete, or replace a word or a phrase

You can extract text manually. Open a document in any PDF viewer, then select and copy some text. It works properly for most documents. We know such documents as “searchable PDF”. Searchable PDF documents render text using special PDF operators and contain correct mappings of glyphs to Unicode in font objects associated with the text.

Many PDF libraries can extract text from searchable PDF documents.

There are also non-searchable PDF documents. Non-searchable documents usually render text as a raster image. A typical example is a scanned PDF document. Non-searchable PDF documents may also render text as vector paths without using fonts or special PDF operators.

You need to perform optical character recognition (OCR) to extract text from non-searchable PDF documents. OCR does not guarantee correct results in 100% of cases. Results depend on the document’s quality and the recognition algorithm. Also, optical recognition is much slower than the extraction of text from searchable documents.

Let’s look at how to perform OCR and extract text from PDF documents in a C# and VB.NET applications.

Read more

Posted in ,

Article about extracting text

Previously we had an article about extracting text posted here. That article is now placed here.

Posted in

Docotic.Pdf 7.0 with support for digital signatures

Hello,

We have published Docotic.Pdf 7.0 on our site and on NuGet.

The main feature of this release is support for digital signatures. The library can sign new and existing PDF documents. To sign a document please use one of the PdfDocument.SignAndSave() methods. You can create signatures of different types, in different formats, using different digest algorithms. For complete set of properties please take a look at the new PdfSigningOptions type.

The library can also verify existing digital signatures. It can verify if digest (hash) is valid, check if a signature contains embedded OCSP or CRL revocation data, or if the signing certificate is revoked. You can also access signing and issuer certificate properties. All this is available via PdfSignature.Contents property.

We created Digital signatures group of samples for all new abilities.

Starting from version 7.0 the library no longer uses System.Drawing.Bitmap when drawing images. This and other improvements increase stability of all ASP.NET applications that perform PDF to image conversion. Also, the library now consumes less memory when drawing PDF documents.

This release also contains bug fixes for text and images extraction, drawing of documents, and for processing of forms and annotations.

Read about all new features and improvements in Docotic.Pdf 7.0 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

New PDF rendering engine in Docotic.Pdf 6.0

Hi,

We have published a new major release of Docotic.Pdf library.

Docotic.Pdf 6.0 brings a new PDF rendering engine that does not depend on System.Drawing.Graphics class. The new engine greatly improves PDF to image conversion in ASP.NET applications and also in Linux and Mac OS environments. This is a major step in “no dependency on System.Drawing” direction. We will continue improving in this area in future releases.

Along with the rendering engine change, we improved PdfPage.Save() method. The method now produces 24bpp images instead of 32bpp when background is opaque. In most cases that leads to smaller output files.

We marked methods of PdfCanvas, PdfDocumentView, and PdfPage that acccept parameters of types from System.Drawing namespace as obsolete. Those methods will be removed in the next release of Docotic.Pdf. For each of the now obsolete methods there is at least one overload. Please use the overloads instead of the obsolete methods.

There is a change our customers asked us about. In the newest release we added PdfTextExtractionOptions.Rectangle property. The property is useful when you want to extract text from only a part of a page.

We changed LicenseManager class so now it is thread-safe. You can use it from multiple threads at the same time. It is still recommended to add all license data at the start of your application. See remarks to LicenseManager.AddLicenseData method.

Read about all new features and improvements in Docotic.Pdf 6.0 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

FIPS compliance, new annotation properties and a lot of bug fixes in Docotic.Pdf 5.10

Hello,

We have released Docotic.Pdf 5.10 on NuGet and on our site.

In this release we changed the library to be as much FIPS-compliant as possible. In fact, this is the first release you can actually use in FIPS mode. When running on a machine with FIPS mode enabled, the library can not use older (non-FIPS compliant) algorithms. It means it can not encrypt or decrypt documents with RC4 algorithm. But other functions like drawing or text extraction will work just fine.

Version 5.10 brings a lot of new properties for annotation classes. We extended PdfCaretAnnotation, PdfEllipseAnnotation, PdfFreeTextAnnotation, PdfFileAttachmentAnnotation, PdfInkAnnotation, PdfLineAnnotation, PdfPolygonAnnotation, PdfPolylineAnnotation, PdfPopupAnnotation, PdfRectangleAnnotation, PdfSoundAnnotation, PdfStampAnnotation, PdfTextMarkupAnnotation, and PdfTextAnnotation. And we added one property to the base PdfWidget class, too.

As usual, we increased speed of PDF drawing. And we improved support for PDFs with broken or incorrect structure. We also added new sample codes that show how to OCR PDF documents.

This release also contains a lot of bug fixes. The fixes cover different areas like drawing, text extraction, parsing, editing of annotations and controls, and some other areas, too.

Read about all new features and improvements in Docotic.Pdf 5.10 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Multithreaded JPEG 2000 decoder

Hello,

We have just released new version of Jpeg2000.Net on our site and on NuGet.

When working on the version 2.1 we were focused on decoding speed improvements. As the result, the new version of the library decodes images faster. And to make it even more performant, we added support for multi-threading decoding. The new J2kDecodingOptions.ThreadCount property is the starting point if you want to decode JPEG 2000 images in multiple threads.

We also fixed some issues related to encoding and decoding of JPEG 2000 images.

We encourage you to download and try the new version of Jpeg2000.Net. The library is also available on NuGet.

Please tell us your thoughts about the library using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Ability to replace images, faster JPEG 2000 decoder, and support for drawing more annotation types in Docotic.Pdf 5.9

Hi,

We have a new Docotic.Pdf release ready.

Docotic.Pdf 5.9 adds ability to replace images. For this we added PdfImage.ReplaceWith methods. The new Replace image sample should give you enough information about the new ability.

We decided to make it more obvious that inline images cannot be recompressed or replaced by the library. Therefore, the corresponding methods now throw UnsupportedImageException when used on an inline image. You can avoid unnecessary exceptions by checking the PdfImage.IsInline property before trying to modify an image. Or you can move inline images to resources first by using one of the PdfCanvas.MoveInlineImagesToResources methods. Please note that moving inline images to resources can increase file size.

Added support for drawing of different annotation types: caret, ellipse, ink, line, movie, 3D, polygon, polyline, printer mark, rich media, screen, signature, text markup, trap network and watermark annotations.

Version 5.9 decodes JPEG 2000 images faster than any previous version. This is because of the optimizations we made to the JPEG 2000 decoder.

Besides the Replace image sample we added Find and highlight text and Header and Footer samples. And we extended Copy text, paths and images sample.

In this release we fixed bugs related to text and images drawing and extraction. And quite some other issues too. As always, we improved support for broken and incorrect documents.

Read about all new features and improvements in Docotic.Pdf 5.9 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Docotic.Pdf 5.8 brings text drawing and extraction improvements

Hello,

We have released Docotic.Pdf 5.8 on our site and on NuGet.

When using fonts embedded in PDFs, the latest version draws and extracts text significantly better. This is because we improved handling of fonts and fixed issues related to text extraction.

The new version adds ability to provide custom font loader for non-embedded fonts. It is helpful in cases when library has no access to GDI+. For example, when running in AWS Lambda and similar environments. Take a look at the new PdfConfigurationOptions.FontLoader property. We also added the DirectoryFontLoader class as an implementation of a directory-based font loader.

And there is yet another important improvement. Docotic.Pdf 5.8 brings the new PdfDocument.RemoveUnusedResources() method. This new method removes references to unused page and XObject resources. It helps to reduce file size in cases when document contains pages or XObjects with unused resources.

As always, we improved support for broken and incorrect documents. And we fixed some bugs of our own.

Read about all new features and improvements in Docotic.Pdf 5.8 in the Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Jpeg2000.Net 2.0 brings speed and memory consumption improvements

Hello,

We have just released a new major version of Jpeg2000.Net on our site and on NuGet.

The new release contains significant improvements. The version 2.0 of the library encodes and decodes images much faster.

In addition to speed improvements, the library now consumes less memory when it decodes images. No matter if you decode a whole image at once or only part of the image. In both cases the library completes decoding using less amount of time and memory.

We also fixed some issues related to encoding and decoding of JPEG 2000 images.

We encourage you to download and try the new version of Jpeg2000.Net. The library is also available on NuGet.

Please tell us your thoughts about the library using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Extract text from PDF on AWS Lambda in C# .NET

Since version 5.7.9279 Docotic.Pdf can extract text from PDFs when running in AWS Lambda environment. This is true for PDFs with both embedded and non-embedded fonts. To make this possible, we added ability to use custom font loader for non-embedded fonts.

Let’s make a simple C# .NET Core application that extracts text from a PDF document and publish it to AWS Lambda.

Read more

Posted in