PDFsharp & MigraDoc Foundation • View topic - Failure to retrieve text of PDF documents

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Failure to retrieve text of PDF documents

Moderator: Stefan Lange

Page 1 of 1

[ 7 posts ]

Print view

Previous topic | Next topic

Author

Message

rdunnill

Post subject: Failure to retrieve text of PDF documents

Posted: Fri May 17, 2019 6:17 am

Joined: Fri May 17, 2019 6:00 am
Posts: 2

We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?

Top

rdunnill

Post subject: ContentReader.ReadContent() returns garbled text

Posted: Fri May 17, 2019 3:58 pm

Joined: Fri May 17, 2019 6:00 am
Posts: 2

We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp (ContentReader.ReadContent()) to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?

Top

rjdunnill

Post subject: Re: Failure to retrieve text of PDF documents

Posted: Tue May 21, 2019 11:18 pm

Joined: Wed May 15, 2019 8:30 pm
Posts: 3

On further analysis, this seems to be happening because the document's Tj operator calls use text consisting of indexes instead of ASCII characters.

Is there a setting or parameter that tells PdfSharp to interpret the document text as such?

Top

Thomas Hoevel

Post subject: Re: Failure to retrieve text of PDF documents

Posted: Wed May 22, 2019 8:59 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3097
Location: Cologne, Germany

rjdunnill wrote:

Is there a setting or parameter that tells PdfSharp to interpret the document text as such?

Since PDFsharp does not render PDF it has only limited support for analyzing the instructions that draw the page.
Not my area of expertise, but I'm afraid you'll have to write code to decode the Tj parameters.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

rjdunnill

Post subject: Re: Failure to retrieve text of PDF documents

Posted: Wed May 22, 2019 5:15 pm

Joined: Wed May 15, 2019 8:30 pm
Posts: 3

Our algorithm opens the PDF document, and reads the content of each page, searching said content for a particular tag. Our problem is that the new-format documents are Unicode, and hence the read content consists of indexes and not text. Shouldn't PDFSharp be internally converting the indexes to their respective characters?

Top

Thomas Hoevel

Post subject: Re: Failure to retrieve text of PDF documents

Posted: Thu May 23, 2019 8:13 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3097
Location: Cologne, Germany

rjdunnill wrote:

Shouldn't PDFsharp be internally converting the indexes to their respective characters?

I fully understand that this would be convenient for you, but since PDFsharp does not do anything with the strings (yet), this functionality is not yet included in PDFsharp.
Feel free to share your code if you implement this conversion.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

rjdunnill

Post subject: Re: Failure to retrieve text of PDF documents

Posted: Fri May 31, 2019 1:48 am

Joined: Wed May 15, 2019 8:30 pm
Posts: 3

Management approval would be required to share the code; I'll ask. Meanwhile, our addition will consist of adding a method to extract condensed text from a page (without spaces), similar to IronPDF's ExtractTextFromPage() method, to PdfPage. (ExtractTextFromPage() extracts the text, sans spaces, but doesn't work properly with Unicode-encoded documents.)

With regards to adding this functionality to PdfSharp, is there currently any support inside PdfSharp for ToUnicode CMaps? Do I have to create my own class, or can I use an existing one within PdfSharp? And is there any support internally parsing the ToUnicode maps?

Top

Page 1 of 1

[ 7 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 110 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum