PDFsharp & MigraDoc Foundation • View topic - Fixing PdfSharp to not load all objects on opening a PDF

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Bug Reports

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Fixing PdfSharp to not load all objects on opening a PDF

Moderator: Stefan Lange

Page 1 of 1

[ 3 posts ]

Print view

Previous topic | Next topic

Author

Message

Gerben Vos

Post subject: Fixing PdfSharp to not load all objects on opening a PDF

Posted: Sun Nov 06, 2016 2:59 pm

Joined: Tue Aug 02, 2016 9:56 am
Posts: 40
Location: Amsterdam, The Netherlands

In our application, we use PdfSharp to open and read PDFs from all kinds of different sources. One major problem that has cropped up with many of these PDFs is that PdfSharp tries to read all objects in a PDF immediately when it opens one. We have found many PDFs that have objects in the xref table that don't actually exist, and the xref table entry points to the middle of some other object's data. Just opening these with PdfSharp gives an error. But Acrobat and other PDF viewers such as mupdf and GSview can open them without any problem.

Some example PDFs are: http://www.stillhq.com/pdfdb/000083/data.pdf and http://www.stillhq.com/pdfdb/000087/data.pdf .

As I already mentioned in another bug report, it is not clear to me why PdfSharp does this. PDF is designed to be easy to lazily load, so why not implement PdfSharp like that?

My questions here are:
1. Have the developers already fixed this in a newer version?
2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?
3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?

_________________
Gerben Vos
Developer

Top

Thomas Hoevel

Post subject: Re: Fixing PdfSharp to not load all objects on opening a PDF

Posted: Mon Nov 07, 2016 4:51 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3110
Location: Cologne, Germany

Hi!

Gerben Vos wrote:

1. Have the developers already fixed this in a newer version?

It is a feature, not a bug.
PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.
It would be a major overhaul to PDFsharp compatible with most corrupt PDF files.

Gerben Vos wrote:

2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?

a) It was developed and tested with clean and intact PDF files.
b) I don't think so. PDFsharp followed a different approach.
c) Lazy loading will lead to lazy exceptions. Many new problems may occur.

Gerben Vos wrote:

3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?

The hurdle will be convincing Stefan that the changes are OK.
Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Gerben Vos

Post subject: Re: Fixing PdfSharp to not load all objects on opening a PDF

Posted: Mon Nov 07, 2016 5:14 pm

Joined: Tue Aug 02, 2016 9:56 am
Posts: 40
Location: Amsterdam, The Netherlands

Thomas Hoevel wrote:

1. PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.

I am explicitly limiting this (at least, for now) to PDFs that Adobe Acrobat opens without complaint. (For most of these PDFs, the non-existing objects are also not referenced anywhere, so they really cannot cause any problem.) Therefore, many of our users, and even the writers of the software that created those PDFs, will see these PDFs as non-corrupt and it will be hard to explain to our users why our software cannot open them.

Thomas Hoevel wrote:

2c) Lazy loading will lead to lazy exceptions. Many new problems may occur.

If we decide to implement this, we will of course run it over our own test set and make sure that everything we encounter and that is fixable within PDFsharp is fixed. This should shake out the most important ones of these.

Thomas Hoevel wrote:

Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.

Yes, this is why I thought it was wise to ask you first.

Indeed, this may cause problems for programs using PDFsharp. One possible idea is to add this as an option: by default, read all objects, but allow it to be turned off if you require it. Then it remains a matter of how many PDFsharp users actually need this (and how many maintenance problems it creates).

However, if properly implemented, I think this could greatly improve PDFsharp's quality.

_________________
Gerben Vos
Developer

Top

Page 1 of 1

[ 3 posts ]

Board index » PDFsharp & MigraDoc » Bug Reports

All times are UTC

Who is online

Users browsing this forum: No registered users and 17 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum