PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Tue Apr 16, 2024 5:55 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Mon Feb 10, 2020 5:15 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
I am not sure if this is the intended behavior or a defect.

When a pdf being parsed has an entry in the xRef table beyond the array[0] that is invalid or in my case all zeros, the method ReadXRefTableAndTrailer in file Parser.c throws an invalid entry preventing my customers from processing their uploaded pdf.

Looking at the xRef table for the document i see an object address of 0000000000 0000 n at array[11]. (Assuming a corrupted PDF conversion of a word doc) .

This would seem to be invalid if 000000000 is reserved for the document header at array[0].

Parsing an address of 000000000 at array[0 + 1++] will return a 0 and return the header object which will fail .

I added a check on the line where comment "//skip start entry" is (line 1081) that checks for the first iterator being 0 (id == 0) and skips parsing the header object .

To that line i added || id > 0 && position == 0 ie:

if(id = 0 || id > 0 && position == 0 ) continue;

In this case i simply toss the invalid xRef entry and let rest of the logic rebuild the xRef table.

I have limited knowledge of Pdf Standards so i am not sure what would be the desired behavior here : toss the invalid reference and rebuild the xRef table or throw a fatal exception and notify the consumer.

It seems to me however that an object address of 0000000000 at array[0 + 1++] points to nowhere so why not just simply toss it.

Or is there a possible reason an xRef address at 0000000000 at an array[0 + 1++] position could reference the document header?

iText sharp handles this situation gracefully.

PDfSharp is throwing a fatal exception because of the issues described above.

So I am not sure if this is a defect, an oversight, or desired behavior.

Please advise .


I cannot attach the offending file because it contains confidential information . But below is its xRef table:

xref
0 39
0000000000 65535 f
0000055206 00000 n
0000008162 00000 n
0000045431 00000 n
0000000022 00000 n
0000008142 00000 n
0000008276 00000 n
0000008489 00000 n
0000030529 00000 n
0000045395 00000 n
0000030550 00000 n
0000042530 00000 n
0000000000 00000 n <---- This one . Why not just toss this in Parser.c ReadXRefTableAndTrailer() ?
0000050986 00000 n
0000000000 00000 n
0000045574 00000 n
0000042552 00000 n
0000042605 00000 n
0000042659 00000 n
0000045374 00000 n
0000045524 00000 n
0000046606 00000 n
0000045953 00000 n
0000046586 00000 n
0000046854 00000 n
0000050965 00000 n
0000051818 00000 n
0000051290 00000 n
0000051798 00000 n
0000052071 00000 n
0000054917 00000 n
0000054938 00000 n
0000054965 00000 n
0000055040 00000 n
0000055083 00000 n
0000055102 00000 n
0000055125 00000 n
0000055167 00000 n
0000055186 00000 n
trailer


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 10, 2020 5:27 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany
PDFsharp was not designed to repair corrupt or non-standard PDF files.

Adobe Reader does a good job at fixing PDF files.
Does Adobe Reader prompt to save the file when you open it?
If not, try "Save as…" in Adobe Reader and check the XRef table again.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 10, 2020 5:34 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
Adobe Reader opens the file just fine.

I did not try a save as and re-open. I can perhaps check that and wrap some logic up that does that for cases like these in an attempt to recover.

I understand this is an issue of "you cannot expect us to handle every possible corrupt file situation" .

I am uber focused on my own particular situation and this one particular file. I Have not researched further what other possible corrupt xRef tables may look like.

But in this case it seems to me a no-brainer : if we have an address of 0000000000 at an array index of anything but [0] why bother, just toss it and move on.

So that i suppose is really my question.

But thank you for the suggestion, i am going to try the "Save as" and then try parsing the file again with PdfSharp.

Jon


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 10, 2020 5:41 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
Thomas
indeed a "Save as" rebuilds the xRef table and resolves this invalid entry.

So that is a possible solution for us if we can automate that process.

For the PDFSharp developers my question remains, if a position returns a 0 why not just toss it.

Perhaps your answer is the answer : because we do not consider at all any logic to resolve corrupt files ?


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 25, 2020 7:58 am 
Offline

Joined: Wed Mar 25, 2020 7:27 am
Posts: 3
This is a very common situation and not a "corrupt" file. It is a bug in the PDFSharp implementation.

If you remove an "obj" from the PDF file, you don't need to renumber the remaining ones, you can just set the reference of this "obj" (that didn't exist anymore) to the 00000000 offset.

Furthermore you can have the "obj" in any order, the xref-table tells you the correct offset.

This is an example of an iText (very very common library out there) generated PDF with random objs and a missing (11) one. I did not find any part in the PDF specification which does not allow this!

We need both fixed!


Code:
00000000: %PDF-1.6
00000009: %âãÏÓ
00000015: 3 0 obj <</Length 321/Filter/FlateDecode>>stream
00000064: xœ}’ÁnÂ0 †ïy
00000077: _&•!vœ6ٍiLbÚ8°ÜÐځ4˜ÄúþZB»†õ’úÿü;vŒ Â‡€F"¥J‚ÍA(I|–ÆÃ鵬 ýáô!˜Œ-œjÑQ(ƒ £@ »x„5)ìè>8i ¾^`W4Ô(Vþ VÅã|ÓÑ;øçP€áŸªÑ%u÷µ)(v5icE”ŠC[3e—£¨wüCTU"d°.˜³@mn[ _Ô³*¯™óU<íG0&+Ã{0Ÿmü+%WA/êSîŽ.%ßçÍé·þëú¸]ÃðŒd˜¢ÍVeJÈêKC_·Ûu[a¹˜æÙárwÝàEçó»ei”³P̖נ  ìʛ†„ʆ•4ý¦¼Õû¶† Fa›†1óâéý©|
00000386: endstream
00000396: endobj
00000403: 5 0 obj<</Type/FontDescriptor/StemV 162/FontName/Arial,Bold/ItalicAngle 0/Descent -210/Ascent 728/CapHeight 0/Flags 32/FontBBox[-45 -209 972 903]>>
00000551: endobj
00000558: 4 0 obj<</FontDescriptor 5 0 R/FirstChar 32/Type/Font/BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/LastChar 255/Widths[278 333 474 556 556 889 722 238 333 333 389 584 278 333 278 278 556 556 556 556 556 556 556 556 556 556 333 333 584 584 584 611 975 722 722 722 722 667 611 778 722 278 556 722 611 833 722 778 667 778 722 667 611 722 667 944 667 667 611 333 278 333 584 556 333 556 611 556 611 556 333 611 611 278 278 556 278 889 611 611 611 611 389 556 333 611 556 778 556 556 500 389 280 389 584 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 278 333 556 556 556 556 280 556 333 737 370 556 584 333 737 552 400 549 333 333 333 576 556 333 333 333 365 556 834 834 834 611 722 722 722 722 722 722 1000 722 667 667 667 667 278 278 278 278 722 722 778 778 778 778 778 584 778 722 722 722 722 667 667 611 556 556 556 556 556 556 889 556 556 556 556 556 278 278 278 278 611 611 611 611 611 611 611 549 611 611 611 611 611 556 611 556]/Subtype/TrueType>>
00001594: endobj
00001601: 7 0 obj<</Type/FontDescriptor/StemV 72/FontName/Arial/ItalicAngle 0/Descent -210/Ascent 728/CapHeight 0/Flags 32/FontBBox[-45 -209 979 896]>>
00001743: endobj
00001750: 6 0 obj<</FontDescriptor 7 0 R/FirstChar 32/Type/Font/BaseFont/Arial/Encoding/WinAnsiEncoding/LastChar 255/Widths[278 278 355 556 556 889 667 191 333 333 389 584 278 333 278 278 556 556 556 556 556 556 556 556 556 556 278 278 584 584 584 556 1015 667 667 722 722 667 611 778 722 278 500 667 556 833 722 778 667 778 722 667 611 722 667 944 667 667 611 278 278 278 469 556 333 556 556 500 556 556 278 556 556 222 222 500 222 833 556 556 556 556 333 500 278 556 500 722 500 500 500 334 260 334 584 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 278 333 556 556 556 556 260 556 333 737 370 556 584 333 737 552 400 549 333 333 333 576 537 333 333 333 365 556 834 834 834 611 667 667 667 667 667 667 1000 722 667 667 667 667 278 278 278 278 722 722 778 778 778 778 778 584 778 722 722 722 722 667 667 611 556 556 556 556 556 556 889 500 556 556 556 556 278 278 278 278 556 556 556 556 556 556 556 549 611 556 556 556 556 500 556 500]/Subtype/TrueType>>
00002782: endobj
00002789: 2 0 obj<</Type/Page/MediaBox[0 0 595.2 841.68]/Resources<</Font<</f0 4 0 R/f1 6 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/Parent 8 0 R/Contents[3 0 R]>>
00002948: endobj
00002955: 8 0 obj<</Type/Pages/Kids[2 0 R]/Count 1>>
00002998: endobj
00003005: 10 0 obj<</Parent 9 0 R/Dest[11 0 R/Fit]/Title(1014 - RO 90/8 Nahtlose Präzisionsstahlrohre)>>
00003100: endobj
00003107: 9 0 obj<</Last 10 0 R/Count 1/First 10 0 R>>
00003152: endobj
00003159: 12 0 obj<</Type/Catalog/PageLayout/OneColumn/Pages 8 0 R/Outlines 9 0 R/PageMode/UseOutlines>>
00003254: endobj
00003261: 13 0 obj<</CreationDate(D:20161202121338+01'00')/Producer(iTextSharp 4.0.3 \(based on iText 2.0.2\))/ModDate(D:20161202121338+01'00')>>
00003397: endobj
00003404: xref
00003409: 0 14
00003414: 0000000000 65535 f
00003434: 0000000000 65536 n
00003454: 0000002789 00000 n
00003474: 0000000015 00000 n
00003494: 0000000558 00000 n
00003514: 0000000403 00000 n
00003534: 0000001750 00000 n
00003554: 0000001601 00000 n
00003574: 0000002955 00000 n
00003594: 0000003107 00000 n
00003614: 0000003005 00000 n
00003634: 0000000000 65536 n
00003654: 0000003159 00000 n
00003674: 0000003261 00000 n
00003694: trailer
00003702: <</Size 14/Info 13 0 R/ID [<2bbfdb9d97cc23ae6ceba0258d16912a><2ccb1ceef5b01998b4ef8d61411506b1>]/Root 12 0 R>>
00003813: startxref
00003823: 3404
00003828: %%EOF
00003834: ÿ


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 25, 2020 8:46 am 
Offline

Joined: Wed Mar 25, 2020 7:27 am
Posts: 3
Like you can read in the "pull requests" on GitHub there is a fix from another user, integrated under https://github.com/FeichtnerDataGroup/PDFsharp


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 01, 2020 4:42 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
Hi Jan,
My personal belief is it is a defect. I applied my fix to a build we used internally to handle the issue. I dont see that PDFSharp has addressed the issue as the current code in Master branch seems to still only check for ID of 0 and not ID and array position.

Regardless thanks for the input!

jon








Jan wrote:
This is a very common situation and not a "corrupt" file. It is a bug in the PDFSharp implementation.

If you remove an "obj" from the PDF file, you don't need to renumber the remaining ones, you can just set the reference of this "obj" (that didn't exist anymore) to the 00000000 offset.

Furthermore you can have the "obj" in any order, the xref-table tells you the correct offset.

This is an example of an iText (very very common library out there) generated PDF with random objs and a missing (11) one. I did not find any part in the PDF specification which does not allow this!

We need both fixed!


Code:
00000000: %PDF-1.6
00000009: %âãÏÓ
00000015: 3 0 obj <</Length 321/Filter/FlateDecode>>stream
00000064: xœ}’ÁnÂ0 †ïy
00000077: _&•!vœ6ٍiLbÚ8°ÜÐځ4˜ÄúþZB»†õ’úÿü;vŒ Â‡€F"¥J‚ÍA(I|–ÆÃ鵬 ýáô!˜Œ-œjÑQ(ƒ £@ »x„5)ìè>8i ¾^`W4Ô(Vþ VÅã|ÓÑ;øçP€áŸªÑ%u÷µ)(v5icE”ŠC[3e—£¨wüCTU"d°.˜³@mn[ _Ô³*¯™óU<íG0&+Ã{0Ÿmü+%WA/êSîŽ.%ßçÍé·þëú¸]ÃðŒd˜¢ÍVeJÈêKC_·Ûu[a¹˜æÙárwÝàEçó»ei”³P̖נ  ìʛ†„ʆ•4ý¦¼Õû¶† Fa›†1óâéý©|
00000386: endstream
00000396: endobj
00000403: 5 0 obj<</Type/FontDescriptor/StemV 162/FontName/Arial,Bold/ItalicAngle 0/Descent -210/Ascent 728/CapHeight 0/Flags 32/FontBBox[-45 -209 972 903]>>
00000551: endobj
00000558: 4 0 obj<</FontDescriptor 5 0 R/FirstChar 32/Type/Font/BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/LastChar 255/Widths[278 333 474 556 556 889 722 238 333 333 389 584 278 333 278 278 556 556 556 556 556 556 556 556 556 556 333 333 584 584 584 611 975 722 722 722 722 667 611 778 722 278 556 722 611 833 722 778 667 778 722 667 611 722 667 944 667 667 611 333 278 333 584 556 333 556 611 556 611 556 333 611 611 278 278 556 278 889 611 611 611 611 389 556 333 611 556 778 556 556 500 389 280 389 584 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 278 333 556 556 556 556 280 556 333 737 370 556 584 333 737 552 400 549 333 333 333 576 556 333 333 333 365 556 834 834 834 611 722 722 722 722 722 722 1000 722 667 667 667 667 278 278 278 278 722 722 778 778 778 778 778 584 778 722 722 722 722 667 667 611 556 556 556 556 556 556 889 556 556 556 556 556 278 278 278 278 611 611 611 611 611 611 611 549 611 611 611 611 611 556 611 556]/Subtype/TrueType>>
00001594: endobj
00001601: 7 0 obj<</Type/FontDescriptor/StemV 72/FontName/Arial/ItalicAngle 0/Descent -210/Ascent 728/CapHeight 0/Flags 32/FontBBox[-45 -209 979 896]>>
00001743: endobj
00001750: 6 0 obj<</FontDescriptor 7 0 R/FirstChar 32/Type/Font/BaseFont/Arial/Encoding/WinAnsiEncoding/LastChar 255/Widths[278 278 355 556 556 889 667 191 333 333 389 584 278 333 278 278 556 556 556 556 556 556 556 556 556 556 278 278 584 584 584 556 1015 667 667 722 722 667 611 778 722 278 500 667 556 833 722 778 667 778 722 667 611 722 667 944 667 667 611 278 278 278 469 556 333 556 556 500 556 556 278 556 556 222 222 500 222 833 556 556 556 556 333 500 278 556 500 722 500 500 500 334 260 334 584 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 750 278 333 556 556 556 556 260 556 333 737 370 556 584 333 737 552 400 549 333 333 333 576 537 333 333 333 365 556 834 834 834 611 667 667 667 667 667 667 1000 722 667 667 667 667 278 278 278 278 722 722 778 778 778 778 778 584 778 722 722 722 722 667 667 611 556 556 556 556 556 556 889 500 556 556 556 556 278 278 278 278 556 556 556 556 556 556 556 549 611 556 556 556 556 500 556 500]/Subtype/TrueType>>
00002782: endobj
00002789: 2 0 obj<</Type/Page/MediaBox[0 0 595.2 841.68]/Resources<</Font<</f0 4 0 R/f1 6 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/Parent 8 0 R/Contents[3 0 R]>>
00002948: endobj
00002955: 8 0 obj<</Type/Pages/Kids[2 0 R]/Count 1>>
00002998: endobj
00003005: 10 0 obj<</Parent 9 0 R/Dest[11 0 R/Fit]/Title(1014 - RO 90/8 Nahtlose Präzisionsstahlrohre)>>
00003100: endobj
00003107: 9 0 obj<</Last 10 0 R/Count 1/First 10 0 R>>
00003152: endobj
00003159: 12 0 obj<</Type/Catalog/PageLayout/OneColumn/Pages 8 0 R/Outlines 9 0 R/PageMode/UseOutlines>>
00003254: endobj
00003261: 13 0 obj<</CreationDate(D:20161202121338+01'00')/Producer(iTextSharp 4.0.3 \(based on iText 2.0.2\))/ModDate(D:20161202121338+01'00')>>
00003397: endobj
00003404: xref
00003409: 0 14
00003414: 0000000000 65535 f
00003434: 0000000000 65536 n
00003454: 0000002789 00000 n
00003474: 0000000015 00000 n
00003494: 0000000558 00000 n
00003514: 0000000403 00000 n
00003534: 0000001750 00000 n
00003554: 0000001601 00000 n
00003574: 0000002955 00000 n
00003594: 0000003107 00000 n
00003614: 0000003005 00000 n
00003634: 0000000000 65536 n
00003654: 0000003159 00000 n
00003674: 0000003261 00000 n
00003694: trailer
00003702: <</Size 14/Info 13 0 R/ID [<2bbfdb9d97cc23ae6ceba0258d16912a><2ccb1ceef5b01998b4ef8d61411506b1>]/Root 12 0 R>>
00003813: startxref
00003823: 3404
00003828: %%EOF
00003834: ÿ


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 01, 2020 4:43 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
I saw no pull requests that addressed this issue , but i also didnt look to hard either as I applied the fix to my own build which helps me get my customers PDF to build .

Thanks for your input

Jon

Jan wrote:
Like you can read in the "pull requests" on GitHub there is a fix from another user, integrated under https://github.com/FeichtnerDataGroup/PDFsharp


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 01, 2020 4:51 pm 
Offline

Joined: Wed Mar 25, 2020 7:27 am
Posts: 3
My link is to a fork containing your needed fix and many more. The Pull Request is in the original repo

https://github.com/empira/PDFsharp/pull/39

But as written, the mentioned fork fixes more.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 01, 2020 5:12 pm 
Offline

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 6
In that pull request I see a commit modding method with if(position == 0) continue

I would think you would want to do if(id = 0 || id > 0 && position == 0 ) continue;

This makes the assumption that the ID 0 element will always be the header xRef. But perhaps I am wrong on that assumption?

Or wait, as I am looking at it now, it doesnt matter what ID it is, if we are parsing a zero position we always continue?

If thats the case then yes I can see how this fix addresses the issue.

I am not debugging it and dont have my head in it right now just going off what I recall from stepping through it.

Thanks for your input

Jan wrote:
My link is to a fork containing your needed fix and many more. The Pull Request is in the original repo

https://github.com/empira/PDFsharp/pull/39

But as written, the mentioned fork fixes more.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 23 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group