Archive for the ‘PDF Library’ category

Convert PDF to CMYK images. Or HTML to PDF in Azure

We have released Docotic.Pdf 8.2 and its add-ons on our site and NuGet.

In this version, we added the ability to save PDF pages as CMYK JPEG and CMYK TIFF images. And the ability to specify JPEG compression quality for both RGB and CMYK flavors. Please check the Version History document linked below for the links to the new methods.

We also added some new abilities to the free HTML to PDF add-on. You can use pre-installed Chromium or download Chromium to the custom location, or disable Chromium Sandbox. All these abilities are for converting HTML to PDF in Azure Functions and Azure App Services.

In response to one of our customer’s requests, we implemented methods to remove paths and images from PDF pages. And we added sample codes for these methods.

To provide a better API, we had to make some breaking changes. We changed the return type for PdfPageObject.Layer, PdfXObject.Layer, and PdfWidget.Layer properties. PdfSaveOptions.Version property and PdfFont.Unembed() method now work a bit differently. See the updated documentation for all these methods and properties for more information.

As always, we fixed quite some bugs. The fixes relate to parsing of fonts, extraction of images, text extraction, memory consumption, and many other areas.

Read about all new features and improvements in Docotic.Pdf 8.2 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features, or ask for help.

Posted in

Convert HTML to PDF using Docotic.Pdf and the new add-on

Docotic.Pdf 8.1 is ready to use. You can download it and its add-ons on our site and NuGet.

The main news is the introduction of free BitMiracle.Docotic.Pdf.HtmlToPdf add-on HTML to PDF conversion. The add-on is available on NuGet and in the zip, we distribute from our site.

The add-on can produce PDF documents from the most complex HTML documents with scripts and styles because it uses Chromium under the hood. The web standards compliance is great (please try for yourself). To showcase the capabilities of the new add-on we prepared an article and a new group of sample codes.

In version 8.1 we introduce the ability to access low-level PDF page dictionaries. We have added a new hierarchy of types for COS objects. With this ability, you can implement support for custom keys and values in page dictionaries.

Docotic.Pdf 8.1 adds support for 3D and Rich Media annotations. Now the library can create, read, and modify these types of annotations.

The new version draws pages, extract page objects, and resizes images faster than before. We fixed bugs related to the parsing of huge PDFs larger than 2 GB.

Read about all new features and improvements in Docotic.Pdf 8.1 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features, or ask for help.

Posted in

Docotic.Pdf no longer depends on System.Drawing

Docotic.Pdf 8.0 is available on our site and NuGet.

This is the first version that does not depend on System.Drawing APIs. This change will ease the use of the library in ASP.NET and ASP.NET Core. Removing this dependency also provides significant improvements for everyone, who uses the library in Linux and macOS environments. AWS Lambda functions also benefit from the change.

At the same time, it is still possible for our customers to use the library to draw a PDF document on a System.Drawing.Graphics surface, for example. Alongside this release, we introduce free BitMiracle.Docotic.Pdf.Gdi add-on for anyone, who really needs to interoperate with System.Drawing APIs. The add-on is available on NuGet and in the zip we distribute from our site.

Docotic.Pdf 8.0 brings support for OpenType CFF fonts. This is something our customers asked us about. There are other changes made in response to customer requests.

Read about all new features and improvements in Docotic.Pdf 8.0 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features, or ask for help.

Posted in

Encryption API changes in Docotic.Pdf 7.5

We have released Docotic.Pdf 7.5 on our site and NuGet.

We made a lot of changes and improvements to the library’s encryption API in this release. And there is one more very important change: the library can now extract right-to-left and bidirectional text in the correct order.

Starting from version 7.5, the library can inspect and decrypt certificate-protected documents. And it is now possible to encrypt any PDF document with one or more certificates.

The new features required us to make a lot of changes to the existing encryption API. We added new classes for different types of encryption and decryption handlers. There is also a new clarified way to check if a PDF document is encrypted. To ease migration from the older API, we added 2021 Encryption API Migration Guide.

We changed text extraction methods in PdfDocument, PdfPage, and PdfCanvas to extract right-to-left and bidirectional text according to the logical order. From now on, these methods also normalize Hebrew and Arabic codepoints from Alphabetic and Arabic Presentation Forms. The text extraction methods now better process column-based and tabular layouts.

In the new version, there are some not so big new features. We added some new sample codes and updated some existing ones. And we fixed quite some bugs.

A lot of properties and methods were marked obsolete in the new version. In all cases, there is a new way to achieve the same.

Read about all new features and improvements in Docotic.Pdf 7.5 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features, or ask for help.

Posted in

Image compression improvements in Docotic.Pdf 7.4

Docotic.Pdf 7.4 is now available on our site and on NuGet.

The new release adds ability to recompress images with stencil and soft masks. And now it is possible to resize masked images. You now can use JPEG 2000 compression scheme while resizing images. The new version can compress images with Indexed or Gray color spaces more efficiently. We updated Compress PDF document in .NET and Optimize PDF images in C# and VB.NET sample codes to use latest recommended image optimization approaches.

The new version extracts text faster. We did some important changes to improve in this area. Thanks to some of our customers for sending in great test files!

With Docotic.Pdf 7.4 it is possible to add a timestamp to any digital signature. It is also possible to retrieve and verify embedded timestamps from existing signatures. To illustrate the changes, we added new Sign PDF document and embed a timestamp in C# and VB.NET sample code. We also updated existing Read PDF signature properties in C# and VB.NET and Verify PDF signature in C# and VB.NET sample codes with new timestamping-related features.

This release contains bug fixes for processing of Lab*, Indexed, and Separation color spaces. And fixes for text measurement, drawing, and extraction. The new version contains other important bug fixes, too.

Read about all new features and improvements in Docotic.Pdf 7.4 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Fixes for handling of disposable objects in Docotic.Pdf 7.3

We released Docotic.Pdf 7.3 on our site and on NuGet.

In this release we fixed some parts of the library that didn’t properly dispose streams. These are quite important fixes and therefore we recommend everyone to update to the latest version of the library.

With the new release we are moving closer to our goal of getting rid of System.Drawing and GDI+ dependencies in Docotic.Pdf completely. Starting from version 7.3, the library no longer uses System.Drawing and GDI+ when resizing images, detecting which parts of text are invisible, or processing certain soft mask images. Also, we marked some methods, constructors, and operators that depend on System.Drawing types obsolete. For any now obsolete entity the library provides an other way to achieve the same.

Docotic.Pdf 7.3 can be used from Blazor and from HoloLens projects. After some changes from our side, the corresponding tools can properly process the library.

This release also contains bug fixes for text extraction and drawing (including drawing with some tricky CJK fonts).

Read about all new features and improvements in Docotic.Pdf 7.3 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Support for more logging platforms in Docotic.Pdf 7.2

Hello,

We have released Docotic.Pdf 7.2 on our site and on NuGet.

Starting from the new release, the library can automatically detect and attach to logging frameworks. NLog, Log4Net, Serilog and Loupe loggers are supported. You don’t need to do anything extra, if your solution uses NLog, for example. Docotic.Pdf will output its log messages into the configured loggers. We also added two new samples Logging with NLog and Logging with log4net to illustrate how it works.

We continue our efforts to get rid of System.Drawing and GDI+ dependencies in Docotic.Pdf completely. Starting from version 7.2, the library no longer uses System.Drawing and GDI+ when saving (extracting) images “as painted”. This also improves quality of the extracted images because there is no more unwanted image scaling. Previously, the images were scaled due to the difference in resolutions between PDF and GDI+ (72 vs. 96 dots per inch).

This release also contains bug fixes for processing of images and parsing of XMP metadata.

Read about all new features and improvements in Docotic.Pdf 7.2 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

Docotic.Pdf 7.1 can compress certain PDFs better. And there are other improvements too.

HI,

Docotic.Pdf 7.1 is now available on our site and on NuGet.

In this release we added new PdfDocument.ReplaceDuplicateObjects methods. In addition to the previous ability to replace duplicate fonts, the new methods can deduplicate non-inline images, color spaces, patterns and shading objects. These methods are useful when you are trying to reduce output file size. New methods give good results for documents, which were incrementally updated or created by a merge of several documents with the same objects.

We also added new signature appearance options. Now it is possible to add an image to a signature. You can also specify the alignment of the text inside a signature. It is possible to hide all the text inside a signature, if you don’t need the text.

The new version can save whole PDF files or individual PDF pages as grayscale images. This usually produces smaller images. If you are interested, please take a look at the new ImageCompressionOptions.CreateGrayscaleJpeg, ImageCompressionOptions.CreateGrayscalePng, and ImageCompressionOptions.CreateGrayscaleTiff methods.

There are two breaking changes in version 7.1. One affects the way the library draws glyphs with zero width, and the other is about background and border colors of a control.

This release also contains bug fixes for text and images extraction, drawing of documents, and other areas.

Read about all new features and improvements in Docotic.Pdf 7.1 in Version History document.

We encourage you to download and try the new version. This version is also available on NuGet.

Please tell us your thoughts about the new version using e-mail or via the support form. Don’t hesitate to write us your questions, suggest features or ask for help.

Posted in

OCR PDF in C# and VB.NET

Text extraction is one of the most popular PDF processing tasks. You would need to extract text from a PDF document if you want to:

  • index the document for full-text search
  • parse some data like names and prices
  • highlight, or delete, or replace a word or a phrase

You can extract text manually. Open a document in any PDF viewer, then select and copy some text. It works properly for most documents. We know such documents as “searchable PDF”. Searchable PDF documents render text using special PDF operators and contain correct mappings of glyphs to Unicode in font objects associated with the text.

Many PDF libraries can extract text from searchable PDF documents.

There are also non-searchable PDF documents. Non-searchable documents usually render text as a raster image. A typical example is a scanned PDF document. Non-searchable PDF documents may also render text as vector paths without using fonts or special PDF operators.

You need to perform optical character recognition (OCR) to extract text from non-searchable PDF documents. OCR does not guarantee correct results in 100% of cases. Results depend on the document’s quality and the recognition algorithm. Also, optical recognition is much slower than the extraction of text from searchable documents.

Let’s look at how to perform OCR and extract text from PDF documents in a C# and VB.NET applications.

Read more

Posted in ,

Article about extracting text

Previously we had an article about extracting text posted here. That article is now placed here.

Posted in