PDFsharp & MigraDoc Foundation http://forum.pdfsharp.com/ |
|
CMaps for Text Extraction http://forum.pdfsharp.com/viewtopic.php?f=2&t=2906 |
Page 1 of 1 |
Author: | cscangarella [ Wed Aug 20, 2014 6:53 pm ] |
Post subject: | CMaps for Text Extraction |
I've been working on a text extractor for pdfs using the PDFsharp library - first and foremost I'd like to thank everyone who has worked on this library. It's been a ton of help and I would have given up this project a long time ago without it. Things are coming quite well, and for the most part I've finished this task. However, any content that use fonts that require a CMap don't extract correctly (understandably, as their bytes are mapped to unicode values). Are there any PDFsharp classes that can help out with this? I can always go into the ToUnicode stream and parse it out myself, but I don't believe in reinventing the wheel so I figured that I'd ask. I've noticed PdfSharp.Fonts.CMapInfo but am unsure of it's usage. |
Author: | Caivs [ Sun Jan 27, 2019 7:46 pm ] |
Post subject: | Re: CMaps for Text Extraction |
May be something has changed since the original post. Does PDFsharp has any features to parse the /ToUnicode stream and get a character map from it? |
Page 1 of 1 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/ |