PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu Mar 28, 2024 9:36 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 8 posts ] 
Author Message
PostPosted: Mon Jun 24, 2019 9:12 pm 
Offline

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5
Hi all,

I want to use PDFSharp for a proyect I'm developing. By now I only need some basic tasks:

1.- Read content of PDF page in txt/plain text
2.- Split pdf by pages.

I've get the second, but I'm unable to get the content of the PDF into a string (that should be the easiest thing).

I've spend hours reading possible solutions and trying all them. I did this in the past using the own adobe library under excel in a macro, but now I can't get this :(

I've try this:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\Thisfile.pdf", PdfDocumentOpenMode.ReadOnly);

            int NumPages = SamplePdf.Pages.Count;

            PdfPage SamplePage = SamblePdf.Pages[0];           

            var content = ContentReader.ReadContent(SamplePage);


But I don't know how to get the text into a string from the var content.

I've also try many other ways, for example replacing last line with

Code:
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;


But then I can't get the text from the stream.

Anyone can lend me a hand?? I think it's must be a simple line more or two, but I can't find it

Many thanks in advance


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 7:46 am 
Offline

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5
Hi everyone,

According to what I've read, PDFSharp can't do it (extract plain text/txt from a PDF file) by itself, and needs some additional code to format correctly the data string into a readable string.

My problem now is I don't know how to implement this code correctly. This is one of the many examples I've found:

https://stackoverflow.com/questions/101 ... p/23667589

I've create the PdfSharpExtensions class and run this code:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var content = ContentReader.ReadContent(SamplePage);
var text = PdfSharpExtensions.ExtractText(content);


And also this one:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var text = PdfSharpExtensions.ExtractText(SamplePage);


In both cases, what I get instead of the actual PDF content is this:

MYPROYECT.PdfSharpExtensions+<ExtractText>d__1

I can't find the way to retrieve the content of the PDF file in a string.

Regards and many thanks in advance to anyone who can guide me a bit.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 8:08 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3095
Location: Cologne, Germany
The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 8:30 am 
Offline

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5
Thomas Hoevel wrote:
The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.


Many thanks for your answer, Thomas Hoevel,

In my own code I never use var, I like defining variables in the type they are, but the code I've use is the one of the examples I've found.

var text turns out to be 'system.Collections.Generic.IEnumerable<string>', so as you say, I have to do a Convert.ToString(text). You mean the problem can be in this conversion? To be honest I don't understand exactly what you mean and how to fix this.

In the meanwhile, I've try also this solution:

https://github.com/DavidS/PdfTextract/b ... tractor.cs

In this case the class of the samle has a method GetText that should take the PDF path and return the content. I've try this

Code:
string text = PdfTextExtractor.GetText(@"T:\samplepdf.pdf");


But it does not work, it returns a blank string (nothing).

If you could help me to find where my problem is I really really appreciate it, I've spend a lot hours trying all solutions I've found and none of them works, perhaps for someone who has more knowledge about it is possible to find the solution.

Regards and many thanks for your help


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 8:49 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3095
Location: Cologne, Germany
"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 11:31 am 
Offline

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5
Thomas Hoevel wrote:
"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.


Many thanks for your reply, Thomas Hoevel.

I've use the forearch to get all the characters and it works with some PDF files. The problem is that it works only with simple pdf files, in most of them I get a blank string (it retrieves nothing). Of course, I've try to get the content of those files using other libraries such as XpdfViewer or the Adobe Acrobat type library and it gets the text.

I'm sure the problem is the way I convert from the PdfPage to the string, because I'm able to get pages and create a new pdf using PDFSharp (it allows me to split PDF files).

The truth is it looks really weird to me, I had the idea that getting the content of a PDF into string was something very basic for PDFSharp, but I'm starting to think I'm wrong. Have you ever try to do this using PDFSharp??

I've go to the features webpage:
http://www.pdfsharp.net/PDFsharpFeatures.ashx

It does not say that it's usefull to do what I need, it semms to me that PDFSharp purpose is creating your own PDFs, not reading others created with other format.

But, I see Migradoc's page:
http://www.pdfsharp.net/MigraDocFeatures.ashx

And I see this:
Import data from various sources via XML files or direct interfaces (any data source that can be used with .NET)

I'm not sure if MigraDoc could do this.

I'm quite desperate, can you help me :(.

Regards


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 11:41 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3095
Location: Cologne, Germany
PDFsharp cannot read text from PDF pages, that's why you need to add code that does that.
MigraDoc cannot read text from PDF pages.

PDF files can be simple, PDF files can be complicated. Simply text extraction code will work with simple PDF files only.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Jun 25, 2019 11:53 am 
Offline

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5
Thomas Hoevel wrote:
PDFsharp cannot read text from PDF pages, that's why you need to add code that does that.
MigraDoc cannot read text from PDF pages.

PDF files can be simple, PDF files can be complicated. Simply text extraction code will work with simple PDF files only.


Again, many thanks for your answer, Thomas Hoevel. For what you say is seems PDFSharp is not a good option for what I need. I have pdftotext, that works perfect, but I thought I can't use this in my proyect (I think it's not open source as PDFSharp is):

https://en.wikipedia.org/wiki/Pdftotext

But now I'm not sure, I'm going to make some research, if it's legal using this for my program I'll use it.

Many thanks again for your time


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC


Who is online

Users browsing this forum: Bing [Bot] and 143 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group