PDFsharp & MigraDoc Foundation
http://forum.pdfsharp.com/

CMaps for Text Extraction
http://forum.pdfsharp.com/viewtopic.php?f=2&t=2906
Page 1 of 1

Author:  cscangarella [ Wed Aug 20, 2014 6:53 pm ]
Post subject:  CMaps for Text Extraction

I've been working on a text extractor for pdfs using the PDFsharp library - first and foremost I'd like to thank everyone who has worked on this library. It's been a ton of help and I would have given up this project a long time ago without it.

Things are coming quite well, and for the most part I've finished this task. However, any content that use fonts that require a CMap don't extract correctly (understandably, as their bytes are mapped to unicode values). Are there any PDFsharp classes that can help out with this? I can always go into the ToUnicode stream and parse it out myself, but I don't believe in reinventing the wheel so I figured that I'd ask. I've noticed PdfSharp.Fonts.CMapInfo but am unsure of it's usage.

Author:  Caivs [ Sun Jan 27, 2019 7:46 pm ]
Post subject:  Re: CMaps for Text Extraction

May be something has changed since the original post. Does PDFsharp has any features to parse the /ToUnicode stream and get a character map from it?

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/