Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Document rendered as (cid:..) sequence #122

Open
lucadealfaro opened this issue Sep 21, 2015 · 8 comments
Open

Document rendered as (cid:..) sequence #122

lucadealfaro opened this issue Sep 21, 2015 · 8 comments

Comments

@lucadealfaro
Copy link

The apparently simple file hosted at: https://storage.googleapis.com/lucadealfaro-share/sample_pdf_fails_convert.pdf when parsed and dumped via

def get_text(f):
    s = StringIO.StringIO()
    d = TextConverter(mgr, s)
    interpreter = PDFPageInterpreter(mgr, d)
    pagenos = set()
    for page in PDFPage.get_pages(f, pagenos):
        interpreter.process_page(page)
    return s.getvalue()

results in:

(cid:28)(cid:18)(cid:20)(cid:25)(cid:18)(cid:20) ... 

Any idea of where to look for the problem?

@jrussell999
Copy link

I'd start here:
http://stackoverflow.com/search?q=cid+font
http://stackoverflow.com/search?q=pdfminer
I haven't had to figure out the cid thing myself.

@lucanaso
Copy link
Contributor

lucanaso commented Dec 9, 2015

PDFMiner writes strings of this kind when it is not able to recognise the letter font or encoding.
This check is performed in converter.py line 106
In the same file, line 118, you can find the function that deals with the case when the character is not recognised (and the string is written).

Other relevant files for rendering the characters are latin_enc.py, glyphlist.py and encodingdb.py. Here you can follow the whole flow, where PDFMiner uses the cid to identify which character it should render.

Most of the times, when you get the character not rendered, you have some issues with the font.

A couple of very useful question on stackoverflow (at least for me):
Why character ID 160 is not recognised as Unicode in PDFMiner?
What is this (cid:51) in the output of pdf2txt?

@wanghaisheng
Copy link

have u guys figure out how to deal with this

@macmania314
Copy link

@lucadealfaro - i would start digging into the fonts, try
pdffonts -f <file_location> and see what it outputs

@luizvaz
Copy link

luizvaz commented Nov 13, 2020

I found how it's really works:

https://github.com/adobe-type-tools/cmap-resources#cmap-resources-versus-cmap-tables

CMap resources should not be confused with 'cmap' tables of sfnt-based fonts, such as OpenType and TrueType. While they are functionally similar, in that both unidirectionally map character codes, a 'cmap' table maps them to GIDs (Glyph IDs). For some fonts, such as OpenType fonts that are based on one of these character collections and include every glyph, CIDs can equal GIDs, but it is not guaranteed, thus the importance of the distinction.

Which lead to:
https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#cmap-header

@syka14
Copy link

syka14 commented Nov 24, 2020

@luizvaz Can you give some pointers to how the issue can be solved?

@luizvaz
Copy link

luizvaz commented Dec 5, 2020

It's really awkward and at same time very common today, to face this problem.
From my findings, that happening as an effort to avoid automated import of PDF data.
PDFs created this way, don't allow even Word Search thru of well know PDF Clients.

The only way I found was using OCR.
From the top of lazyness I converted PDF pages to Image with Poppler PDFToPPM and using Tesseract-OCR to get back to Text solved my problem.
Poppler can be installed in windows with Chocolatey.
And worked very well.

But you can go in the opposite direction and use OCR on each Glyph ID (GID) and replace all with the found letter.

@wj-Mcat
Copy link

wj-Mcat commented May 26, 2021

@luizvaz it seems the only viable solution for this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants