PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sat Apr 27, 2024 1:51 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Wed Jan 17, 2024 2:19 pm 
Offline

Joined: Sun Apr 27, 2014 7:41 pm
Posts: 4
Hello,

I would like to parse and modify some PDF contents. What exactly needs to be modified does not matter. My code is generic.
Some contents are in the page stream, and some are inside nested /Form objects inside the page resource dictionary.
I am fairly (very) familiar and comfortable with the PDF Specification and all operators and have an almost working code listed below.
The only part I have trouble with is marked in my source code and listed below 3 questions.
I have reduced the source code to only the essential part, replacing resource objects streams in the resource dictionary. However it is complete code.
With the provided sample file that shows the problem, I get a blank page because the created file is missing resource stream /content on object /Xf1 which is object 2 0 R

I am using PDFSharp version 1.50.5147 on .Net Framework 4.6.2 and 4.8

My questions are marked in the source below which is the complete code that can reproduce the problem
    1-How do I turn the modified content object back into a byte array in method ProcessResources ?
    2-How do I attach that byte array back to the Stream object ? My Code sets a new stream, but it is not written into the output file.
    3-How do I persist the modified Stream in the Resource dictionary, so pdfDoc.Save will save changes ?

What am I doing wrong? Any help appreciated.
Thank you and brilliant work on this library.
The processing speed is amazing and possibilities thanks to low level functions are endless.

Sample PDF File 552kb : https://www.dropbox.com/scl/fi/d77iyy9x ... u1w9g&dl=0

To call the main function use Call ParseAndModifyContent("input.pdf", "output.pdf")

Code:
Imports PdfSharp.Pdf
Imports PdfSharp.Pdf.Advanced
Imports PdfSharp.Pdf.Content
Imports PdfSharp.Pdf.Content.Objects
Imports PdfSharp.Pdf.IO

    Private Sub ParseAndModifyContent(input As String, output As String)

        Using pdfdoc As PdfDocument = PdfReader.Open(input, PdfDocumentOpenMode.Modify)
            Dim page As PdfPage
            Dim pagecount As Integer = pdfdoc.PageCount
            'loop all pages
            For ipage As Integer = 0 To pagecount - 1
                page = pdfdoc.Pages(ipage)
                Dim contents As CSequence = ContentReader.ReadContent(page)
                Call ProcessContentObjects(contents)
                page.Contents.ReplaceContent(contents)

                Dim pdfres As PdfResources = page.Resources
                Call ProcessResources(pdfres)

            Next
            'save modified file
            pdfdoc.Save(output)
        End Using

    End Sub

Private Sub ProcessContentObjects(contents As Objects.CSequence)
        Dim cOp As COperator
        'loop all content objects
        For i As Integer = 0 To contents.Count - 1
            If contents(i).GetType Is GetType(COperator) Then
                cOp = contents(i)
                Debug.WriteLine(cOp.OpCode.Name & " - " & cOp.OpCode.Postscript & " - " & cOp.OpCode.Description)
                'do anything needed with this Operator and its Operands
                'this part works fine
            End If
        Next
    End Sub

    Private Sub ProcessResources(res As PdfDictionary)
        'check if XObjects exist
        If res IsNot Nothing AndAlso res.Elements.ContainsKey(PdfResources.Keys.XObject) Then
            Dim xObj As PdfDictionary = res.Elements.GetDictionary(PdfResources.Keys.XObject)
            If xObj IsNot Nothing Then
                Dim items As ICollection(Of PdfItem) = xObj.Elements.Values
                For Each item As PdfItem In items
                    If item.GetType Is GetType(PdfReference) Then
                        Dim ref As PdfReference = DirectCast(item, PdfReference)
                        Debug.WriteLine("ObjectNumber = " & ref.ObjectNumber) 'avoid processing endless recursions using this number if needed
                        Dim xObj2 As PdfDictionary = ref.Value
                        'check if Subtype is /Form
                        If xObj2.Elements.GetString("/Subtype") = "/Form" Then
                            'get content bytes of stream
                            Dim stream As PdfDictionary.PdfStream = xObj2.Stream
                            'check if a content stream exists
                            If stream IsNot Nothing Then
                                'get unfiltered/uncompressed bytes
                                Dim contentbytes() As Byte = stream.UnfilteredValue
                                Dim encoder As Internal.RawEncoding = New PdfSharp.Pdf.Internal.RawEncoding()
                                'get stream content as string to check visually
                                Dim content_string As String = encoder.GetString(contentbytes)
                                Debug.WriteLine(content_string)
                                'get content objects
                                Dim contents As CSequence = ContentReader.ReadContent(contentbytes)
                                'process content objects same as page contents
                                Call ProcessContentObjects(contents)
                                '--------PROBLEM STARTS HERE
                                '1-How do I turn the modified content object back into a byte array
                                '2-How do I attach that byte array back to the Stream object
                                '3-How do I persist the modified Stream in the Resource dictionary, so pdfDoc.Save will save changes ?
                                'testing with unmodified content, just writing same bytes back
                                Dim modifiedcontentbytes() As Byte = contentbytes.Clone
                                'write modified content back to stream and compress
                                '-------THIS PART FAILS-------Output PDF has no Stream, but no error in code
                                xObj2.Stream = Nothing
                                xObj2.Stream = xObj2.CreateStream(modifiedcontentbytes)
                                xObj2.Stream.Zip()

                            End If
                            'get nested resources if they exist
                            Dim res2 As PdfDictionary = xObj2.Elements.GetDictionary("/Resources")
                            'recursive call
                            If res2 IsNot Nothing Then Call ProcessResources(res2)

                        ElseIf xObj2.Elements.GetString("/Subtype") = "/Image" Then
                            'process anything for /Image
                        Else
                            'process anything for other Subtypes
                            Debug.WriteLine(xObj2.Elements.GetString("/Subtype"))
                        End If
                    End If
                Next
            End If
        End If
    End Sub



Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 4:51 pm 
Offline

Joined: Tue Sep 30, 2014 12:29 pm
Posts: 36
Short answer to all 3 questions: You already did everything, that is needed.

There is just a little detail missing:
When overwriting stream-data, make sure you remove all filters from the parent-dictionary beforehand. (the /Filter entry)
In the linked document the stream has already a filter applied (with the value /FlateDecode).
When you attach a new stream with
Code:
xObj2.Stream = xObj2.CreateStream(modifiedcontentbytes)
the stream-data is no longer flate-encocded as the filter states.
The following call to
Code:
xObj2.Stream.Zip()
does nothing, as there is already a filter specified in the parent-dictionary.
The stream-data is still saved, but is no longer flate-encoded.
A Pdf-Reader trying to open the document may fail silently trying to decode the stream, resulting in an "empty" document.

This should work (untested):
Code:
'Remove existing filter
xObj2.Elements.Remove("/Filter")
' Set new value for stream (has the same effect as creating a new stream)
xObj2.Stream.Value = modifiedcontentbytes
' Zip it, this creates a new /Filter -entry in the parent-dictionary
xObj2.Stream.Zip()


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 7:09 pm 
Offline

Joined: Sun Apr 27, 2014 7:41 pm
Posts: 4
Your suggestion is a giant step into the right direction - very much appreciated, but one piece of the puzzle is still missing.
First the good news: The missing stream data and it's filter are now properly created. A valid PDF with no errors is created as output.
So replacing the stream data in the resource dictionary with unmodified bytes was successful.
I would have never discovered that one needs to remove the existing /Filter first. Thank you very much for that guidance.

The last part of the puzzle was unsolved in my code...my sample code wrote back the unmodified byte array...that works

But how do I convert modified objects of type CSequence back to a byte array so they can be written back to the stream ?

Code:
Dim contents As CSequence = ContentReader.ReadContent(contentbytes) 'works
Call ProcessContentObjects(contents) 'works
'I cannot find a method to convert modified object of type CSequence back to a byte array.
Dim modifiedcontentbytes() As Byte = ContentWriter.WriteContentFromObjects(contents)   'looking for something like this


In the part where page content is directly accessible modification is done over the reference to the page content object and written back, page contents are replaced with an additional line of code

Code:
Dim contents As CSequence = ContentReader.ReadContent(page)
Call ProcessContentObjects(contents)
page.Contents.ReplaceContent(contents)

This is not possible for nested /Form objects because there does not seem to be a method for

Code:
Dim pdfres As PdfResources = page.Resources  'works
Call ProcessResources(pdfres)  'works
page.Resources.ReplaceResources(pdfres)  'there is no such method


There has to be a simple way to convert an Object of type CSequence to a byte array that can be written to the stream, or am I missing something ?


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 7:35 pm 
Offline

Joined: Sun Apr 27, 2014 7:41 pm
Posts: 4
solved it....one liner

Code:
Dim modifiedcontentbytes() As Byte = contents.ToContent


works brilliant

In case the Development Team reads this,
viewtopic.php?f=3&t=3468 same issue there about removing /Filter first, which is not intuitive to the average user.

Any of the following would make this great library even better

1- update the existing summary to function CreateStream

' Summary:
' Creates the stream of this dictionary and initializes it with the specified byte
' array. The function must not be called if the dictionary already has a stream.
' You may need to remove the existing /Filter from the parent element object first
Public Function CreateStream(value() As Byte) As PdfStream

2- update the existing summary to property

' Summary:
' Gets or sets the PDF stream belonging to this dictionary. Returns null if the
' dictionary has no stream. To create the stream, call the CreateStream function.
' You may need to remove the existing /Filter from the parent element object first
Public Property Stream As PdfStream

3-
Implement Method .CreateStreamWithFilter, or allow an additonal Parameter in
Public Function CreateStream(value() As Byte, updateFilter as Boolean) As PdfStream
and let PDFSharp set/clear the existing /Filter on the parent stream object

4-
' Summary:
' Compresses the stream with the FlateDecode filter. If a filter is already defined,
' the function has no effect.
Public Sub Zip()

Add an optional parameter or overload Public Sub Zip(ResetFilter as boolean) which deletes the existing Filter, does the zipping and sets its own proper filter.

Thank you


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 372 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group