PDFsharp & MigraDoc Foundation • View topic - Please some help to extract text from PDF page.

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Please some help to extract text from PDF page.

Moderator: Stefan Lange

Page 1 of 1

[ 8 posts ]

Print view

Previous topic | Next topic

Author

Message

Tassadar

Post subject: Please some help to extract text from PDF page.

Posted: Mon Jun 24, 2019 9:12 pm

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5

Hi all,

I want to use PDFSharp for a proyect I'm developing. By now I only need some basic tasks:

1.- Read content of PDF page in txt/plain text
2.- Split pdf by pages.

I've get the second, but I'm unable to get the content of the PDF into a string (that should be the easiest thing).

I've spend hours reading possible solutions and trying all them. I did this in the past using the own adobe library under excel in a macro, but now I can't get this

I've try this:

Code:

PdfDocument SamplePdf = PdfReader.Open(@"T:\Thisfile.pdf", PdfDocumentOpenMode.ReadOnly);

            int NumPages = SamplePdf.Pages.Count;

            PdfPage SamplePage = SamblePdf.Pages[0];            

            var content = ContentReader.ReadContent(SamplePage);

But I don't know how to get the text into a string from the var content.

I've also try many other ways, for example replacing last line with

Code:

PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;

But then I can't get the text from the stream.

Anyone can lend me a hand?? I think it's must be a simple line more or two, but I can't find it

Many thanks in advance

Top

Tassadar

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 7:46 am

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5

Hi everyone,

According to what I've read, PDFSharp can't do it (extract plain text/txt from a PDF file) by itself, and needs some additional code to format correctly the data string into a readable string.

My problem now is I don't know how to implement this code correctly. This is one of the many examples I've found:

https://stackoverflow.com/questions/101 ... p/23667589

I've create the PdfSharpExtensions class and run this code:

Code:

PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var content = ContentReader.ReadContent(SamplePage);
var text = PdfSharpExtensions.ExtractText(content);

And also this one:

Code:

PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var text = PdfSharpExtensions.ExtractText(SamplePage);

In both cases, what I get instead of the actual PDF content is this:

MYPROYECT.PdfSharpExtensions+<ExtractText>d__1

I can't find the way to retrieve the content of the PDF file in a string.

Regards and many thanks in advance to anyone who can guide me a bit.

Top

Thomas Hoevel

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 8:08 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Tassadar

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 8:30 am

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5

Thomas Hoevel wrote:

The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.

Many thanks for your answer, Thomas Hoevel,

In my own code I never use var, I like defining variables in the type they are, but the code I've use is the one of the examples I've found.

var text turns out to be 'system.Collections.Generic.IEnumerable<string>', so as you say, I have to do a Convert.ToString(text). You mean the problem can be in this conversion? To be honest I don't understand exactly what you mean and how to fix this.

In the meanwhile, I've try also this solution:

https://github.com/DavidS/PdfTextract/b ... tractor.cs

In this case the class of the samle has a method GetText that should take the PDF path and return the content. I've try this

Code:

string text = PdfTextExtractor.GetText(@"T:\samplepdf.pdf");

But it does not work, it returns a blank string (nothing).

If you could help me to find where my problem is I really really appreciate it, I've spend a lot hours trying all solutions I've found and none of them works, perhaps for someone who has more knowledge about it is possible to find the solution.

Regards and many thanks for your help

Top

Thomas Hoevel

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 8:49 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Tassadar

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 11:31 am

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5

Thomas Hoevel wrote:

"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.

Many thanks for your reply, Thomas Hoevel.

I've use the forearch to get all the characters and it works with some PDF files. The problem is that it works only with simple pdf files, in most of them I get a blank string (it retrieves nothing). Of course, I've try to get the content of those files using other libraries such as XpdfViewer or the Adobe Acrobat type library and it gets the text.

I'm sure the problem is the way I convert from the PdfPage to the string, because I'm able to get pages and create a new pdf using PDFSharp (it allows me to split PDF files).

The truth is it looks really weird to me, I had the idea that getting the content of a PDF into string was something very basic for PDFSharp, but I'm starting to think I'm wrong. Have you ever try to do this using PDFSharp??

I've go to the features webpage:
http://www.pdfsharp.net/PDFsharpFeatures.ashx

It does not say that it's usefull to do what I need, it semms to me that PDFSharp purpose is creating your own PDFs, not reading others created with other format.

But, I see Migradoc's page:
http://www.pdfsharp.net/MigraDocFeatures.ashx

And I see this:
Import data from various sources via XML files or direct interfaces (any data source that can be used with .NET)

I'm not sure if MigraDoc could do this.

I'm quite desperate, can you help me

.

Regards

Top

Thomas Hoevel

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 11:41 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

PDFsharp cannot read text from PDF pages, that's why you need to add code that does that.
MigraDoc cannot read text from PDF pages.

PDF files can be simple, PDF files can be complicated. Simply text extraction code will work with simple PDF files only.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Tassadar

Post subject: Re: Please some help to extract text from PDF page.

Posted: Tue Jun 25, 2019 11:53 am

Joined: Mon Jun 24, 2019 8:58 pm
Posts: 5

Thomas Hoevel wrote:

Again, many thanks for your answer, Thomas Hoevel. For what you say is seems PDFSharp is not a good option for what I need. I have pdftotext, that works perfect, but I thought I can't use this in my proyect (I think it's not open source as PDFSharp is):

https://en.wikipedia.org/wiki/Pdftotext

But now I'm not sure, I'm going to make some research, if it's legal using this for my program I'll use it.

Many thanks again for your time

Top

Page 1 of 1

[ 8 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 197 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum