-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page cropbox is not used for bbox if present #1054
Comments
Hi @stefanw, and thank you for flagging this! I've now spent a couple of hours looking into it and wanted to send you this update. Although I think |
Hey @jsvine, thanks for looking into this. I think the cropbox is meant to represent the visible user space and is used in other implementations as the transformation base (e.g. pdfbox). The PDF standard says:
|
Hi @stefanw, this should now be fixed via 07d9997 (now available on As a demonstration of the fixes, now the character and line bounding boxes are rendered as expected: Your interpretation of Thanks again for opening this issue, which helped me to identify inconsistencies in how the various boxes were handled. Closing it for now, but feel free to continue the conversation, point out edge-cases I might have missed, et cetera. |
Describe the bug
When a PDF page contains a cropbox that differs from the mediabox the positions of the extracted text will not be correct and drawing them via
page.to_image().draw_rects(...)
will not work as expected.Code to reproduce the problem
PDF file
pdfplumber-cropbox.pdf
Expected behavior
I would expect pdfplumber to take the cropbox as the bbox.
Actual behavior
Even when the cropbox is present it is not taken as the page's bbox.
Screenshots
Environment
Additional context
The problem seems to be this line in the
Page
class:The mediabox is a required page attribute, so
self.cropbox
will never be assigned as the mediabox.The cropbox is optional, so my guess is that the code should be:
Indeed this change improves the positioning significantly, although it's still not perfectly aligned.
The text was updated successfully, but these errors were encountered: