PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu May 26, 2022 1:46 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Sat Jan 22, 2022 4:19 am 
Offline

Joined: Fri Jan 21, 2022 4:41 am
Posts: 2
I have a lot of PDFs and I'm looking for a way to just identify what items are in them, such as the number of text blocks, forms, images, etc. Just a list of the elements so we can determine what PDFs to look at further. Is there a method to determine the actual elements in PDFs using PDFSharp? I'm using Powershell to scan the documents. These are the methods we were trying but they do not appear to give much detail.

Code:
$input.Internals.Catalog.Elements
$input.Contents.Elements
$input.Info.Elements


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 24, 2022 8:38 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 721
Location: CCAA
PDFsFun546 wrote:
These are the methods we were trying but they do not appear to give much detail.
Not many details in your question.
On SO you mention Sitecore, but do not provide other details there either.
https://stackoverflow.com/q/70790714/162529

Where does "$input" come from?

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 21, 2022 10:26 pm 
Offline

Joined: Fri Jan 21, 2022 4:41 am
Posts: 2
We tried the writing out the elements in the above post and did not see a way to get the actual elements.


$input comes from the following code.

$input = [PdfSharp.Pdf.IO.PdfReader]::Open($stream, [PdfSharp.Pdf.IO.PdfDocumentOpenMode]::ReadOnly)


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 22, 2022 8:57 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 721
Location: CCAA
There are no text block objects. Text is part of the page contents.

Here is a C# sample that searches for images and exports JPEG images:
http://pdfsharp.net/wiki/ExportImages-sample.ashx

There are samples for extracting text on this forum and elsewhere on the Internet.

I hope this helps to get you started.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group