-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Document rendered as (cid:..) sequence #122
Comments
I'd start here: |
PDFMiner writes strings of this kind when it is not able to recognise the letter font or encoding. Other relevant files for rendering the characters are latin_enc.py, glyphlist.py and encodingdb.py. Here you can follow the whole flow, where PDFMiner uses the cid to identify which character it should render. Most of the times, when you get the character not rendered, you have some issues with the font. A couple of very useful question on stackoverflow (at least for me): |
have u guys figure out how to deal with this |
@lucadealfaro - i would start digging into the fonts, try |
I found how it's really works: https://github.com/adobe-type-tools/cmap-resources#cmap-resources-versus-cmap-tables
Which lead to: |
@luizvaz Can you give some pointers to how the issue can be solved? |
It's really awkward and at same time very common today, to face this problem. The only way I found was using OCR. But you can go in the opposite direction and use OCR on each Glyph ID (GID) and replace all with the found letter. |
@luizvaz it seems the only viable solution for this issue. |
The apparently simple file hosted at: https://storage.googleapis.com/lucadealfaro-share/sample_pdf_fails_convert.pdf when parsed and dumped via
results in:
Any idea of where to look for the problem?
The text was updated successfully, but these errors were encountered: