-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plumbing pdf results in mixed characters of neighbouring words #764
Comments
Hi @XuShanJiang, have you tried adjusting the |
Hi @XuShanJiang, just checking back on this. |
Hi @jsvine , |
Thank you for letting me know. A few observations:
|
Hello @jsvine, thank you for solution. Okay, adding these options works for |
Right now, that's not possible with For the specific PDF discussed above, however, I don't think it'd work, due to the character-positioning issues. (I.e., many characters that should be inside a particular table cell are not.) |
@Rustemhak, in the latest version of |
Describe the bug
The pdf is not plumbed correctly in text. The words are incomplete and characters of neighbouring words are mixed together.
Code to reproduce the problem
with pdfplumber.open("woo-besluit-contacten-rabo-pveu.pdf") as pdf: for i in range(len(pdf.pages)): print(pdf.pages[i].extract_text())
PDF file
woo-besluit-contacten-rabo-pveu.pdf
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
I expected that the text will be plumbed correctly. I imported several of these documents, in which the words are normal, like in the pdf-file.
Actual behavior
Some similar pdf files (including this one) is plumbed very weirdly. Characters of words are mixed.
For example Pagina 7 is read as Pag7iv naa7n.
Screenshots
Environment
Additional context
I tried to copy the whole pdf and paste it in a text editor manually, which works totally fine...
The text was updated successfully, but these errors were encountered: