Hello,
this is a very dirty solution, but it shows one way to get what you want. You do have to mind about encoding properly, as the example assumes, that the pdf text is encoded in default system encoding.
it extracts text from the first page only.
Code:
string pdfTextRegexp = @"(T[wdcm*])[\s]*(\[([^\]]*)\]|\((?<text>[^\)]*)\))[\s]*Tj";
PdfDocument r = PdfReader.Open(file);
PdfContents contents = r.Pages[0].Contents;
foreach (PdfReference o in contents.Elements) {
PdfContent c = o.Value as PdfContent;
if (c != null) {
string content = Encoding.Default.GetString(c.Stream.Value);
using (StringReader sr = new StringReader(content)) {
string line;
while ((line = sr.ReadLine()) != null) {
Match m = Regex.Match(line, pdfTextRegexp, RegexOptions.Compiled);
if (m.Success) {
Debug.WriteLine(m.Groups["text"].Value);
}
}
}
}
}
Anyone who has a better solution, hopefully using the PDFsharp api, please contribute.
Regards,
André