Skip to content

PdfPageTextObject.chars() returns wrong results for text objects with overlapping bounding boxes #98

@cemerick

Description

@cemerick

I'd like to use pdfium-render to access all "primitive" elements (characters, paths, images) in the order that they are rendered, so that I can determine visibility for each such element (accounting for occlusion of primitives rendered earlier due to simple obstruction, clipping paths, etc).

I figured that I would be able to do this by iterating through PdfPage.objects(), and within that, iterating through each PdfPageTextObject.chars(). However, the latter doesn't retrieve individual chars specifically associated with a given text object; rather, it grounds out in a bounding-box search:

pub fn chars_for_object(
&self,
object: &PdfPageTextObject,
) -> Result<PdfPageTextChars, PdfiumError> {
self.chars_inside_rect(object.bounds()?)
.map_err(|_| PdfiumError::NoCharsInPageObject)
}

Of course, this doesn't reflect original rendering order at all, and ironically will result in the same character being visited multiple times, in the case of overlapping text objects.

Is there a way to access primitives, down to the character level, in rendered order (or with a render-order property if direct iteration isn't possible)?

(Thanks so much for this library, the work is greatly appreciated. 🙇)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions