PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu Mar 28, 2024 12:17 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: Thu Jul 12, 2018 6:12 pm 
Offline

Joined: Mon Dec 21, 2015 7:49 pm
Posts: 13
I have 2 pdf documents, both generated via DOC1 Generate (versions 6.1.1373.0 and 6.6.6.60) that when opened with PdfSharp, return incorrect data from an lzw stream. As I have no issue with any other lzw streams from other pdf producers, this leads me to believe that the lzw compression they use may be bugged. However, Adobe Acrobat Reader DC opens the file with the correct data. As a result, my software needs to be able read it as well.

In a particular instance a "www." is outputing as "ww.".

I'm still working on getting permission to post the document but in the meantime, any information might be helpful for me to solve this on my own.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 12, 2018 8:01 pm 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 909
Location: CCAA
PDFsharp uses the SharpZipLib for LZW compression/decompression.

Maybe try you stream with the latest version of SharpZipLib.

If the issue no longer exists with the latest SharpZipLib then PDFsharp needs an update.

If the issue also exists with the latest version, then maybe the SharpZipLib team can help to fix it.
It could be easier to get permission to share just a single stream instead of the complete PDF file.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 12, 2018 9:13 pm 
Offline

Joined: Mon Dec 21, 2015 7:49 pm
Posts: 13
Are you sure about that?
My research lead me to LzwDecode.cs that is handling the lzw decoding.
Are there updates that are not yet reflected on GitHub?

I've attached a file containing the problem stream data.
Here is the code I used to decode the data.
Code:
byte[] data = File.ReadAllBytes(@"bytes.txt");
Console.WriteLine(string.Concat("Length: ", data.Length));

LzwDecode lzw = new LzwDecode();
string raw = lzw.DecodeToString(data);
File.WriteAllText(@"raw.txt", raw);


Line 133 in the generated raw.txt file, "(ww.ocfl.net/PayUtilities/) Tj", should be "(http://www.ocfl.net/PayUtilities/) Tj"
This line is the only data that I can specifically point out as being incorrect. But I do know that there is incorrect data elsewhere because when the page is edited and saved, adobe complains about an error and can't display the page correctly and Chrome displays it but with errors.


Attachments:
File comment: lzw stream
bytes.zip [3.82 KiB]
Downloaded 704 times
Top
 Profile  
Reply with quote  
PostPosted: Fri Jul 13, 2018 5:11 pm 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 909
Location: CCAA
MJLaukala wrote:
Are you sure about that?
No. Most likely I was on the wrong track, I had FlateDecode on my mind.
I think FlateDecoder also is LZW or maybe LZH or something like that.

So there are two options: Debugging the code or comparing the code with the specifications in the Adobe Reference (if there are any).
Or maybe find another implementation that is open source and compatible with an MIT license.
I can't say when I will have time to look after it. David is not available to look into it.

There is not much code, but still it can take some time to find out where it goes wrong.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 02, 2018 9:18 pm 
Offline

Joined: Mon Dec 21, 2015 7:49 pm
Posts: 13
No worries. I'll look into it when I can. LZW seems pretty straight forward. There are specifications in the Adobe Reference and from what I could tell, your code follows that spec. I think the issue I am running into is another case of an out of spec document that Adobe PDF Reader reads just fine. Adobe PDF Reader seems to be very robust when it comes to reading out of spec and even very badly broken documents. I think Adobe PDF Reader's ability read these documents has made the waters of PDF spec very very muddy. If I come up with a suitable fix, I'll do a separate pull request on github than my standard "PDFSharp Fixes" pull request.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 09, 2018 11:51 am 
Offline

Joined: Tue Aug 02, 2016 9:56 am
Posts: 40
Location: Amsterdam, The Netherlands
It looks like the same issue as here: https://forum.pdfsharp.net/viewtopic.php?f=3&t=3410. I found it quite hard to debug, so I translated a LZW decoder from C instead, but I didn't post the patch for reasons. I'll see if I can still find it.

_________________
Gerben Vos
Developer


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 09, 2018 5:31 pm 
Offline

Joined: Tue Aug 02, 2016 9:56 am
Posts: 40
Location: Amsterdam, The Netherlands
I posted my patched code to the thread mentioned in my previous post.

_________________
Gerben Vos
Developer


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 46 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group