PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.com/

Elements of PDFs Using Powershell
http://forum.pdfsharp.com/viewtopic.php?f=2&t=4317
Page 1 of 1

Author:  PDFsFun546 [ Sat Jan 22, 2022 4:19 am ]
Post subject:  Elements of PDFs Using Powershell

I have a lot of PDFs and I'm looking for a way to just identify what items are in them, such as the number of text blocks, forms, images, etc. Just a list of the elements so we can determine what PDFs to look at further. Is there a method to determine the actual elements in PDFs using PDFSharp? I'm using Powershell to scan the documents. These are the methods we were trying but they do not appear to give much detail.

Code:
$input.Internals.Catalog.Elements
$input.Contents.Elements
$input.Info.Elements

Author:  TH-Soft [ Mon Jan 24, 2022 8:38 am ]
Post subject:  Re: Elements of PDFs Using Powershell

PDFsFun546 wrote:
These are the methods we were trying but they do not appear to give much detail.
Not many details in your question.
On SO you mention Sitecore, but do not provide other details there either.
https://stackoverflow.com/q/70790714/162529

Where does "$input" come from?

Author:  PDFsFun546 [ Mon Feb 21, 2022 10:26 pm ]
Post subject:  Re: Elements of PDFs Using Powershell

We tried the writing out the elements in the above post and did not see a way to get the actual elements.


$input comes from the following code.

$input = [PdfSharp.Pdf.IO.PdfReader]::Open($stream, [PdfSharp.Pdf.IO.PdfDocumentOpenMode]::ReadOnly)

Author:  TH-Soft [ Tue Feb 22, 2022 8:57 am ]
Post subject:  Re: Elements of PDFs Using Powershell

There are no text block objects. Text is part of the page contents.

Here is a C# sample that searches for images and exports JPEG images:
http://pdfsharp.net/wiki/ExportImages-sample.ashx

There are samples for extracting text on this forum and elsewhere on the Internet.

I hope this helps to get you started.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/