PDFsharp & MigraDoc Foundation :: View topic - Reading PDF contents?

PDFsharp & MigraDoc Foundation http://forum.pdfsharp.com/

Reading PDF contents? http://forum.pdfsharp.com/viewtopic.php?f=2&t=452	Page 1 of 1

Author:	Megidolaon [ Tue Aug 19, 2008 10:54 am ]
Post subject:	Reading PDF contents?
Hello, I've just started using PDFSharp and I was wondering how you can read the content of a PDF. I tried looping through the Pages.Elements Property of the PdfDocument class but I get an error that I cannot convert from DictionaryEntry to Typ DictionaryElements. Alternatively I tried using the PdfContent class from the CreateSingleContent method of a PdfPage but all I get are a handful cryptic values (something like "7 0 R", "120 B" or such) as whole content of a Pdf containing text and a table with at least 50 values. Also, is there a difference between reading normal text and the contents of a table? Thanks in advance.

Author:	gkataria [ Wed Aug 20, 2008 11:26 am ]
Post subject:
i was able to get the images of a page from below code, but still unable to find the text. write below code in any click event PdfDocument document = PdfReader.Open("C:\\HelloWorld.pdf", PdfDocumentOpenMode.ReadOnly); int imageCount = 0; // Iterate pages foreach (PdfPage page in document.Pages) { // Get resources dictionary PdfDictionary resources = page.Elements.GetDictionary("/Resources"); if (resources != null) { // Get external objects dictionary PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject"); if (xObjects != null) { PdfItem[] items = xObjects.Elements.Values; // Iterate references to external objects foreach (PdfItem item in items) { PdfReference reference = item as PdfReference; if (reference != null) { PdfDictionary xObject = reference.Value as PdfDictionary; // Is external object an image? if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image") { imageCount++; ExportImage(xObject, imageCount); } } } } } } the following functions are used: /// <summary> /// Currently extracts only JPEG images. /// </summary> static void ExportImage(PdfDictionary image, int count) { string filter = image.Elements.GetName("/Filter"); switch (filter) { case "/DCTDecode": ExportJpegImage(image, count); break; case "/FlateDecode": ExportAsPngImage(image, count); break; } } /// <summary> /// Exports a JPEG image. /// </summary> static void ExportJpegImage(PdfDictionary image, int count) { // Fortunately JPEG has native support in PDF and exporting an image is just writing the stream to a file. byte[] stream = image.Stream.Value; //FileStream fs = new FileStream(String.Format("Image{0}.jpeg", count++), FileMode.Create, FileAccess.Write); //fs.Read( //BinaryWriter bw = new BinaryWriter(fs); //bw.Write(stream); File.WriteAllBytes("C:\\poc_image_" + count.ToString() + ".jpeg", stream); //bw.Close(); }

Author:	blackjack2150 [ Thu Aug 21, 2008 7:41 am ]
Post subject:
Hi. For text extraction you can use the PDFBox library. For .NET you also have to put a reference to IKVM in your code. An easy solution is using Text Mining Tool (which uses PDFBox). Just google it.

Author:	gkataria [ Tue Aug 26, 2008 12:56 pm ]
Post subject:
But i actually needed to find each text and image objects position as well

Page 1 of 1	All times are UTC
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/