PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.com/

Please some help to extract text from PDF page.
http://forum.pdfsharp.com/viewtopic.php?f=2&t=3988
Page 1 of 1

Author:  Tassadar [ Mon Jun 24, 2019 9:12 pm ]
Post subject:  Please some help to extract text from PDF page.

Hi all,

I want to use PDFSharp for a proyect I'm developing. By now I only need some basic tasks:

1.- Read content of PDF page in txt/plain text
2.- Split pdf by pages.

I've get the second, but I'm unable to get the content of the PDF into a string (that should be the easiest thing).

I've spend hours reading possible solutions and trying all them. I did this in the past using the own adobe library under excel in a macro, but now I can't get this :(

I've try this:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\Thisfile.pdf", PdfDocumentOpenMode.ReadOnly);

            int NumPages = SamplePdf.Pages.Count;

            PdfPage SamplePage = SamblePdf.Pages[0];           

            var content = ContentReader.ReadContent(SamplePage);


But I don't know how to get the text into a string from the var content.

I've also try many other ways, for example replacing last line with

Code:
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;


But then I can't get the text from the stream.

Anyone can lend me a hand?? I think it's must be a simple line more or two, but I can't find it

Many thanks in advance

Author:  Tassadar [ Tue Jun 25, 2019 7:46 am ]
Post subject:  Re: Please some help to extract text from PDF page.

Hi everyone,

According to what I've read, PDFSharp can't do it (extract plain text/txt from a PDF file) by itself, and needs some additional code to format correctly the data string into a readable string.

My problem now is I don't know how to implement this code correctly. This is one of the many examples I've found:

https://stackoverflow.com/questions/101 ... p/23667589

I've create the PdfSharpExtensions class and run this code:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var content = ContentReader.ReadContent(SamplePage);
var text = PdfSharpExtensions.ExtractText(content);


And also this one:

Code:
PdfDocument SamplePdf = PdfReader.Open(@"T:\samplepdf.pdf", PdfDocumentOpenMode.ReadOnly);
PdfPage SamplePage = SamplePdf.Pages[0];
PdfDictionary.PdfStream stream = SamplePage.Contents.Elements.GetDictionary(0).Stream;
var text = PdfSharpExtensions.ExtractText(SamplePage);


In both cases, what I get instead of the actual PDF content is this:

MYPROYECT.PdfSharpExtensions+<ExtractText>d__1

I can't find the way to retrieve the content of the PDF file in a string.

Regards and many thanks in advance to anyone who can guide me a bit.

Author:  Thomas Hoevel [ Tue Jun 25, 2019 8:08 am ]
Post subject:  Re: Please some help to extract text from PDF page.

The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.

Author:  Tassadar [ Tue Jun 25, 2019 8:30 am ]
Post subject:  Re: Please some help to extract text from PDF page.

Thomas Hoevel wrote:
The format "var text" is cool for writers, but bad for readers.
What is "var" in this case?
It looks as if you call "ToString()" for a class that does not have a useful override for that method.


Many thanks for your answer, Thomas Hoevel,

In my own code I never use var, I like defining variables in the type they are, but the code I've use is the one of the examples I've found.

var text turns out to be 'system.Collections.Generic.IEnumerable<string>', so as you say, I have to do a Convert.ToString(text). You mean the problem can be in this conversion? To be honest I don't understand exactly what you mean and how to fix this.

In the meanwhile, I've try also this solution:

https://github.com/DavidS/PdfTextract/b ... tractor.cs

In this case the class of the samle has a method GetText that should take the PDF path and return the content. I've try this

Code:
string text = PdfTextExtractor.GetText(@"T:\samplepdf.pdf");


But it does not work, it returns a blank string (nothing).

If you could help me to find where my problem is I really really appreciate it, I've spend a lot hours trying all solutions I've found and none of them works, perhaps for someone who has more knowledge about it is possible to find the solution.

Regards and many thanks for your help

Author:  Thomas Hoevel [ Tue Jun 25, 2019 8:49 am ]
Post subject:  Re: Please some help to extract text from PDF page.

"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.

Author:  Tassadar [ Tue Jun 25, 2019 11:31 am ]
Post subject:  Re: Please some help to extract text from PDF page.

Thomas Hoevel wrote:
"IEnumerable<string>" means you have a list of strings, so you should use e.g. a foreach loop to get all the strings.


Many thanks for your reply, Thomas Hoevel.

I've use the forearch to get all the characters and it works with some PDF files. The problem is that it works only with simple pdf files, in most of them I get a blank string (it retrieves nothing). Of course, I've try to get the content of those files using other libraries such as XpdfViewer or the Adobe Acrobat type library and it gets the text.

I'm sure the problem is the way I convert from the PdfPage to the string, because I'm able to get pages and create a new pdf using PDFSharp (it allows me to split PDF files).

The truth is it looks really weird to me, I had the idea that getting the content of a PDF into string was something very basic for PDFSharp, but I'm starting to think I'm wrong. Have you ever try to do this using PDFSharp??

I've go to the features webpage:
http://www.pdfsharp.net/PDFsharpFeatures.ashx

It does not say that it's usefull to do what I need, it semms to me that PDFSharp purpose is creating your own PDFs, not reading others created with other format.

But, I see Migradoc's page:
http://www.pdfsharp.net/MigraDocFeatures.ashx

And I see this:
Import data from various sources via XML files or direct interfaces (any data source that can be used with .NET)

I'm not sure if MigraDoc could do this.

I'm quite desperate, can you help me :(.

Regards

Author:  Thomas Hoevel [ Tue Jun 25, 2019 11:41 am ]
Post subject:  Re: Please some help to extract text from PDF page.

PDFsharp cannot read text from PDF pages, that's why you need to add code that does that.
MigraDoc cannot read text from PDF pages.

PDF files can be simple, PDF files can be complicated. Simply text extraction code will work with simple PDF files only.

Author:  Tassadar [ Tue Jun 25, 2019 11:53 am ]
Post subject:  Re: Please some help to extract text from PDF page.

Thomas Hoevel wrote:
PDFsharp cannot read text from PDF pages, that's why you need to add code that does that.
MigraDoc cannot read text from PDF pages.

PDF files can be simple, PDF files can be complicated. Simply text extraction code will work with simple PDF files only.


Again, many thanks for your answer, Thomas Hoevel. For what you say is seems PDFSharp is not a good option for what I need. I have pdftotext, that works perfect, but I thought I can't use this in my proyect (I think it's not open source as PDFSharp is):

https://en.wikipedia.org/wiki/Pdftotext

But now I'm not sure, I'm going to make some research, if it's legal using this for my program I'll use it.

Many thanks again for your time

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/