Optimizations in Docotic.Pdf 3.6

Hello!

Our customers told us that Docotic.Pdf is not always behaves modestly. The library tends to consume large amounts of memory for large files and often spends much time doing some operations.

We’ve done a lot to make new version of Docotic.Pdf faster and less memory-consuming. Now I want to share some statistics about results of our efforts.

To see what we achieved, we took five PDF files and ran some tests on them. Here is the description of the files we took:

File name Page count File size Contents
emerging.pdf 6 94 KB only text
rdsolr1907.pdf 111 2.03 MB mostly text, some images, linearized
official_journal_10022006.pdf 705 20 MB mostly text, some images, linearized
LargePDFFile.pdf 4800 34 MB mostly text, some images, linearized
OReilly.Head.First.C.Sharp.Nov.2007.pdf 765 146 MB mostly scanned images

For a start, we measured how much time and how many memory required to just open a file. The table below contains relative results of our measurements:

Open only
File name Time, % Memory consumption, %
emerging.pdf -13 -51
rdsolr1907.pdf -44 -55
official_journal_10022006.pdf -87 -95
LargePDFFile.pdf -91 -83
OReilly.Head.First.C.Sharp.Nov.2007.pdf -31 -53

It’s nice to see that opening of PDF files is now about 2 times faster and takes about 3 times less memory (on average). And for larger files improvements are even more obvious.

But how the library behaves in more complex scenarios?

Next, we took the same files and measured time and memory required to open PDF and extract formatted text from all of its pages. Below are the results:

Open and extract all text with formatting
File name Time, % Memory consumption, %
emerging.pdf -10 -33
rdsolr1907.pdf -70 -26
official_journal_10022006.pdf -59 -39
LargePDFFile.pdf -66 -39
OReilly.Head.First.C.Sharp.Nov.2007.pdf -54 -31

And again the whole process took about two times less time (on average). Memory gains are less impressive this time but still, about 30% less memory (on average) is not bad at all.

The last one test is simple but represents a real-life scenario. We measured time and memory required to open PDF, then encrypt it with AES 128bit and then save. Below are the results:

Open, encrypt with AES 128bit and save
File name Time, % Memory consumption, %
emerging.pdf -17 -69
rdsolr1907.pdf -42 -9
official_journal_10022006.pdf -84 -70
LargePDFFile.pdf -69 -41
OReilly.Head.First.C.Sharp.Nov.2007.pdf -19 -69

In this case the whole process took about 2 times less time and memory (on average).

We think that such improvements won’t go unnoticed by our customers. And we want to say that we have some thoughts about how to further improve the library. So, we continue to profile and improve Docotic.Pdf.

Please feel free to share your thoughts about recent improvements.