PDFsharp & MigraDoc Foundation • View topic

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Reading PDF contents?

Moderator: Stefan Lange

Page 1 of 1

[ 4 posts ]

Print view

Previous topic | Next topic

Author

Message

Megidolaon

Post subject: Reading PDF contents?

Posted: Tue Aug 19, 2008 10:54 am

Joined: Tue Aug 19, 2008 9:55 am
Posts: 1

Hello, I've just started using PDFSharp and I was wondering how you can read the content of a PDF.

I tried looping through the Pages.Elements Property of the PdfDocument class but I get an error that I cannot convert from DictionaryEntry to Typ DictionaryElements.

Alternatively I tried using the PdfContent class from the CreateSingleContent method of a PdfPage but all I get are a handful cryptic values (something like "7 0 R", "120 B" or such) as whole content of a Pdf containing text and a table with at least 50 values.

Also, is there a difference between reading normal text and the contents of a table?

Thanks in advance.

Top

gkataria

Post subject:

Posted: Wed Aug 20, 2008 11:26 am

Joined: Wed Aug 20, 2008 11:21 am
Posts: 3

i was able to get the images of a page from below code, but still unable to find the text.

write below code in any click event

PdfDocument document = PdfReader.Open("C:\\HelloWorld.pdf", PdfDocumentOpenMode.ReadOnly);

int imageCount = 0;
// Iterate pages
foreach (PdfPage page in document.Pages)
{
// Get resources dictionary
PdfDictionary resources = page.Elements.GetDictionary("/Resources");
if (resources != null)
{
// Get external objects dictionary
PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
if (xObjects != null)
{
PdfItem[] items = xObjects.Elements.Values;
// Iterate references to external objects
foreach (PdfItem item in items)
{
PdfReference reference = item as PdfReference;
if (reference != null)
{
PdfDictionary xObject = reference.Value as PdfDictionary;
// Is external object an image?
if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
{
imageCount++;
ExportImage(xObject, imageCount);

}
}
}
}
}
}

the following functions are used:

/// <summary>
/// Currently extracts only JPEG images.
/// </summary>
static void ExportImage(PdfDictionary image, int count)
{
string filter = image.Elements.GetName("/Filter");
switch (filter)
{
case "/DCTDecode":
ExportJpegImage(image, count);
break;

case "/FlateDecode":
ExportAsPngImage(image, count);
break;
}
}

/// <summary>
/// Exports a JPEG image.
/// </summary>
static void ExportJpegImage(PdfDictionary image, int count)
{
// Fortunately JPEG has native support in PDF and exporting an image is just writing the stream to a file.
byte[] stream = image.Stream.Value;
//FileStream fs = new FileStream(String.Format("Image{0}.jpeg", count++), FileMode.Create, FileAccess.Write);
//fs.Read(
//BinaryWriter bw = new BinaryWriter(fs);
//bw.Write(stream);

File.WriteAllBytes("C:\\poc_image_" + count.ToString() + ".jpeg", stream);
//bw.Close();
}

Top

blackjack2150

Post subject:

Posted: Thu Aug 21, 2008 7:41 am

Joined: Thu Aug 21, 2008 7:23 am
Posts: 5

Hi. For text extraction you can use the PDFBox library. For .NET you also have to put a reference to IKVM in your code. An easy solution is using Text Mining Tool (which uses PDFBox). Just google it.

Top

gkataria

Post subject:

Posted: Tue Aug 26, 2008 12:56 pm

Joined: Wed Aug 20, 2008 11:21 am
Posts: 3

But i actually needed to find each text and image objects position as well

Top

Page 1 of 1

[ 4 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 39 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum