Hi folks,
Trying to process a PDF file and split it using the bookmarks defined using PDFSharp and while I can get a list of bookmarks I can not figure out how to actually figure out what page number corresponds to the bookmark definition.
Back story: One of the engineering software we use generates a single PDF file that actually consists of three separate documents. In the infinite wisdom of this enterprise software company, they don't actually let you split these and save them as separate PDFs. There are also a couple of other quirks we post-process so I have a small utility that engineers run their output files through and I'd like to add the functionality to split that combined PDF into separate documents.
An example PDF file I am working with has three top level bookmarks defined, on pages 1, 5 and 6 and while I can see the bookmarks with the snippet below I couldn't figure out a way to map the bookmark to a page number.
Splitting the PDF seems to be fairly well documented, what I am stuck with is how I can map bookmarks to page numbers?
Test Code:
Code:
using (PdfDocument document = PdfReader.Open("test.pdf", PdfDocumentOpenMode.Import))
{
PdfDictionary outline = document.Internals.Catalog.Elements.GetDictionary("/Outlines");
Console.WriteLine("Page count: " + document.PageCount);
foreach(var page in document.Pages)
{
// any hierarchy info on the page itself? doesn't seem to have any.
Console.WriteLine(page.ToString());
}
for (PdfDictionary child = outline.Elements.GetDictionary("/First"); child != null; child = child.Elements.GetDictionary("/Next"))
{
Console.WriteLine(child.Elements.GetString("/Title"));
// FIXME: get page numbers?
}
}
Results in:
Code:
Page count: 9
<< /Contents [ 1019 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1018 0 R /Type /Page >>
<< /Contents [ 1022 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1021 0 R /Type /Page >>
<< /Contents [ 1025 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1024 0 R /Type /Page >>
<< /Contents [ 1028 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1027 0 R /Type /Page >>
<< /Contents [ 1032 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 842 595 ] /Parent 1 0 R /Resources 1031 0 R /Type /Page >>
<< /Annots [ 46 0 R 48 0 R 50 0 R 52 0 R 54 0 R 56 0 R 58 0 R 60 0 R 62 0 R 64 0 R 66 0 R 68 0 R 70 0 R 72 0 R 74 0 R ] /Contents [ 1043 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1042 0 R /Type /Page >>
<< /Annots [ 82 0 R 84 0 R 86 0 R 88 0 R 90 0 R 92 0 R 94 0 R 96 0 R 98 0 R 100 0 R 102 0 R 104 0 R 106 0 R 108 0 R 110 0 R 112 0 R 114 0 R 116 0 R 118 0 R 120 0 R 122 0 R 124 0 R 126 0 R 128 0 R 130 0 R 132 0 R 134 0 R 136 0 R 138 0 R 140 0 R 142 0 R 144 0 R 146 0 R 148 0 R 150 0 R 152 0 R 154 0 R 156 0 R 158 0 R ] /Contents [ 1048 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1047 0 R /Type /Page >>
<< /Annots [ 166 0 R 168 0 R 170 0 R 172 0 R 174 0 R 176 0 R 178 0 R 180 0 R 182 0 R ] /Contents [ 1053 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1052 0 R /Type /Page >>
<< /Annots [ 190 0 R 192 0 R 194 0 R 196 0 R ] /Contents [ 1058 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1057 0 R /Type /Page >>
Bookmark 1
Bookmark 2
Bookmark 3
Manually looking at the file I know the three top level bookmarks defined are on pages 1 (Bookmark 1), 5 (Bookmark 2) and 6 (Bookmark 3). How can I go about extracting this information using PDFSharp?
Thanks for any pointers.