PDFsharp & MigraDoc Foundation • View topic - accessing text in a pdf document

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

accessing text in a pdf document

Moderator: Stefan Lange

Page 1 of 1

[ 2 posts ]

Print view

Previous topic | Next topic

Author

Message

gary

Post subject: accessing text in a pdf document

Posted: Mon Feb 26, 2007 10:00 pm

Joined: Sun Feb 25, 2007 1:19 pm
Posts: 1

Hi there all... I've tried creating a few test applications to answer this question, but cannot figure it out!

Can someone give me a simple example showing how to extract all the text in a pdf document into a single string? I would *greatly* appreciate any help you can provide!

Top

aknuth

Post subject:

Posted: Wed May 16, 2007 5:46 pm

Joined: Fri Mar 23, 2007 11:37 pm
Posts: 16
Location: Berlin

Hello,
this is a very dirty solution, but it shows one way to get what you want. You do have to mind about encoding properly, as the example assumes, that the pdf text is encoded in default system encoding.

it extracts text from the first page only.

Code:

string pdfTextRegexp = @"(T[wdcm*])[\s]*(\[([^\]]*)\]|\((?<text>[^\)]*)\))[\s]*Tj";

PdfDocument r = PdfReader.Open(file);
PdfContents contents = r.Pages[0].Contents;
foreach (PdfReference o in contents.Elements) {
   PdfContent c = o.Value as PdfContent;
   if (c != null) {
      string content = Encoding.Default.GetString(c.Stream.Value);
      using (StringReader sr = new StringReader(content)) {
         string line;
         while ((line = sr.ReadLine()) != null) {
            Match m = Regex.Match(line, pdfTextRegexp, RegexOptions.Compiled);
            if (m.Success) {
               Debug.WriteLine(m.Groups["text"].Value);
            }
         }
      }
   }
}

Anyone who has a better solution, hopefully using the PDFsharp api, please contribute.

Regards,
André

Top

Page 1 of 1

[ 2 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 52 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum