PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.com/

Traversing the PdfDocument structure
http://forum.pdfsharp.com/viewtopic.php?f=2&t=771
Page 1 of 1

Author:  kelthar [ Mon Jun 29, 2009 10:09 am ]
Post subject:  Traversing the PdfDocument structure

I've been sitting here all morning trying to traverse the structure of a pdf (to replace som text in a pdf).

My objective is to have pdf:s that are constructed in advance (call 'em templates) that have %identifier% in text within them. I've been thinking in the lines of reading all text from the PDF and replacing all %identifier% i find with their corresponding values.

The problem is, I can't seem to find them. I start parsing from PdfDocument.Pages.Element.Values. There I find a PdfArray with a PdfReference that point to the PdfPage. So far so good. One would expect that the elements containg the text would reside in the Content. However, I cannot find it. I'm starting to wonder if there are different ways to save a PDF and if I have done it incorrectly (Save as from Word 2007).

Anyone got any pointers or suggestions?

(I don't iterate trought the PdfPage:s directly due to the fact that my recursive method takes a PdfItem[] as the param for elements to inspect).

Author:  peteratoce [ Mon Jun 29, 2009 10:58 am ]
Post subject:  Re: Traversing the PdfDocument structure

Hi,

Text in PDF is essentially a one-way proposition, especially when the font encoding is non-standard. So, the best I have achieved up till now (though not with the current version of PDFsharp) is to extract the sequence of words on a page, with the relative position and size of the respective BoundingBoxes. Then it is possible to cover a BB with a rectangle and write the new string over that.

As you can see, this approach is not well suited for flowing text, as the length of the rendered replacement must be identical to or a little less than the length of the original string.

But you were talking about templates, so I would choose another approach entirely:
Create your PDFs with the help of MigraDoc. As a block of text is a plain ASCII string before actual creation of the PDF, it should be trivial to do replacements in your program (e.g. "Hello %name%!" becomes "Hello Sally!") and then create the PDFs on-the-fly.

Peter

Author:  kelthar [ Mon Jun 29, 2009 1:00 pm ]
Post subject:  Re: Traversing the PdfDocument structure

Thank you for your quick reply. Well, yes, that is also an option. The thing is that I'd like to have a pdf as a template containing tags. So the end-user just can create a document with %xxx%. However, I see that this isn't a good idea =\.

I going to try another library (which i used before, but that's very bloated) and see if it has some support for it. And check the streams and see if I can LZDecode it.

Author:  Thomas Hoevel [ Mon Jun 29, 2009 2:05 pm ]
Post subject:  Re: Traversing the PdfDocument structure

kelthar wrote:
The problem is, I can't seem to find them.

Your chance to find "%identifier%" will be greater if you use a fixed-pitch font for the text (e. g. Courier).

With proportional fonts, words will often be drawn in small parts (one to three letters) for best results with letter spacing and kerning pairs and whatever.
To find "%identifier%" you'll have to concatenate the texts that are side-by-side in one line (most tools will emit letters from left to right (but you cannot even be sure of this)).

Telling letter spacing from word spacing is a category where many PDF2RTF (or PDF2DOC) converters fail.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/