PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.com/

Images not on a page, but in PDF document
http://forum.pdfsharp.com/viewtopic.php?f=2&t=4002
Page 1 of 1

Author:  bonds007 [ Wed Aug 14, 2019 3:34 pm ]
Post subject:  Images not on a page, but in PDF document

Hi,

I've been using the provided example http://www.pdfsharp.net/wiki/ExportImages-sample.ashx to extract images from each page in a PDF. I've extended it to also support /FlateDecode (in the case where the colour space is RGB) and this is working fine (although I'd love to know how to handle cmyk since loads of our PDFs use it)
But I have a few PDF documents where some or all of the images are not detected AT ALL (i.e. no /XObject /Image items are detected when processing the PDF page by page, but the images are clearly there if you open in Adobe Reader). If I open the PDF in Notepad++ I can clearly see the /XObject /Image items, so I know they are present in the PDF.

So I approached the problem in a different manner. I used the "Internals" class to access "GetAllObjects()" and read through each object without a care about which page they were on. Code snippet below:

Code:
        // Get a list of all objects
        PdfObject[] arrPDFObjects = objPDFDocument.Internals.GetAllObjects();
        if (arrPDFObjects != null) {
          Console.WriteLine("Number of objects: " + arrPDFObjects.Length);
          foreach (PdfObject objThisPDFObject in arrPDFObjects) {
            PdfReference objThisPdfObjectReference = objThisPDFObject.Reference;
            if (objThisPdfObjectReference != null) {
              PdfDictionary xObject = objThisPdfObjectReference.Value as PdfDictionary;
              // Is external object an image?
              if (xObject == null) {
                // Null value

              } else if (xObject.Elements.GetString("/Subtype") == "/Image") {
                Console.WriteLine("Image found. Id = " + objThisPdfObjectReference.ObjectID);
                // Export the image
                ExportImage(xObject, ref valImageCount);
              }
            }
          }

        } else {
          Console.WriteLine("No objects");
        }



So while I can extract the images, I don't understand why they aren't found when I process the PDF page by page (using the sample on your Wiki). These images are clearly on the page since you can see them in Adobe Reader (i.e. they aren't orphaned objects).

So I guess my question is:
Is there some other method by which an image can be on a page which isn't detected by the provided example and how should I be detecting these images?

Author:  Thomas Hoevel [ Thu Aug 15, 2019 8:56 am ]
Post subject:  Re: Images not on a page, but in PDF document

bonds007 wrote:
Is there some other method by which an image can be on a page which isn't detected by the provided example and how should I be detecting these images?
Each page has a list of resources. The sample exports images listed as resources of that page.

IIRC you can also draw XObjects on pages and XObjects can also contain images.
In the case of nested objects you have to search images recursively.

Author:  bonds007 [ Thu Aug 15, 2019 6:31 pm ]
Post subject:  Re: Images not on a page, but in PDF document

Thanks for the information. I will try to discover how a PDF viewer "knows" that these images are on page 1 even though the image isn't listed as a resource of page 1.

Simon

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/