Document rendered as (cid:..) sequence #122

lucadealfaro · 2015-09-21T03:56:51Z

The apparently simple file hosted at: https://storage.googleapis.com/lucadealfaro-share/sample_pdf_fails_convert.pdf when parsed and dumped via

def get_text(f):
    s = StringIO.StringIO()
    d = TextConverter(mgr, s)
    interpreter = PDFPageInterpreter(mgr, d)
    pagenos = set()
    for page in PDFPage.get_pages(f, pagenos):
        interpreter.process_page(page)
    return s.getvalue()

results in:

(cid:28)(cid:18)(cid:20)(cid:25)(cid:18)(cid:20) ...

Any idea of where to look for the problem?

The text was updated successfully, but these errors were encountered:

jrussell999 · 2015-09-21T21:10:35Z

I'd start here:
http://stackoverflow.com/search?q=cid+font
http://stackoverflow.com/search?q=pdfminer
I haven't had to figure out the cid thing myself.

lucanaso · 2015-12-09T15:31:29Z

PDFMiner writes strings of this kind when it is not able to recognise the letter font or encoding.
This check is performed in converter.py line 106
In the same file, line 118, you can find the function that deals with the case when the character is not recognised (and the string is written).

Other relevant files for rendering the characters are latin_enc.py, glyphlist.py and encodingdb.py. Here you can follow the whole flow, where PDFMiner uses the cid to identify which character it should render.

Most of the times, when you get the character not rendered, you have some issues with the font.

A couple of very useful question on stackoverflow (at least for me):
Why character ID 160 is not recognised as Unicode in PDFMiner?
What is this (cid:51) in the output of pdf2txt?

wanghaisheng · 2016-08-13T08:06:31Z

have u guys figure out how to deal with this

macmania314 · 2016-12-17T10:45:42Z

@lucadealfaro - i would start digging into the fonts, try
pdffonts -f <file_location> and see what it outputs

luizvaz · 2020-11-13T01:48:06Z

I found how it's really works:

https://github.com/adobe-type-tools/cmap-resources#cmap-resources-versus-cmap-tables

CMap resources should not be confused with 'cmap' tables of sfnt-based fonts, such as OpenType and TrueType. While they are functionally similar, in that both unidirectionally map character codes, a 'cmap' table maps them to GIDs (Glyph IDs). For some fonts, such as OpenType fonts that are based on one of these character collections and include every glyph, CIDs can equal GIDs, but it is not guaranteed, thus the importance of the distinction.

Which lead to:
https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#cmap-header

syka14 · 2020-11-24T17:34:22Z

@luizvaz Can you give some pointers to how the issue can be solved?

luizvaz · 2020-12-05T01:19:33Z

It's really awkward and at same time very common today, to face this problem.
From my findings, that happening as an effort to avoid automated import of PDF data.
PDFs created this way, don't allow even Word Search thru of well know PDF Clients.

The only way I found was using OCR.
From the top of lazyness I converted PDF pages to Image with Poppler PDFToPPM and using Tesseract-OCR to get back to Text solved my problem.
Poppler can be installed in windows with Chocolatey.
And worked very well.

But you can go in the opposite direction and use OCR on each Glyph ID (GID) and replace all with the found letter.

wj-Mcat · 2021-05-26T06:50:50Z

@luizvaz it seems the only viable solution for this issue.

lucanaso mentioned this issue Dec 9, 2015

pdf2txt.py get (cid:%d) unknown char #102

Open

jsvine mentioned this issue Mar 20, 2017

text': '(cid:0) instead of character jsvine/pdfplumber#29

Closed

samkit-jain mentioned this issue Nov 30, 2019

When I use extract PDF content, it's all CID: XXXX jsvine/pdfplumber#159

Closed

Some1Somewhere mentioned this issue Oct 29, 2024

Text extraction issue with extract_text_to_fp - Uncleaned CID characters pdfminer/pdfminer.six#1056

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document rendered as (cid:..) sequence #122

Document rendered as (cid:..) sequence #122

lucadealfaro commented Sep 21, 2015

jrussell999 commented Sep 21, 2015

lucanaso commented Dec 9, 2015

wanghaisheng commented Aug 13, 2016

macmania314 commented Dec 17, 2016

luizvaz commented Nov 13, 2020

syka14 commented Nov 24, 2020

luizvaz commented Dec 5, 2020 •

edited

Loading

wj-Mcat commented May 26, 2021

Document rendered as (cid:..) sequence #122

Document rendered as (cid:..) sequence #122

Comments

lucadealfaro commented Sep 21, 2015

jrussell999 commented Sep 21, 2015

lucanaso commented Dec 9, 2015

wanghaisheng commented Aug 13, 2016

macmania314 commented Dec 17, 2016

luizvaz commented Nov 13, 2020

syka14 commented Nov 24, 2020

luizvaz commented Dec 5, 2020 • edited Loading

wj-Mcat commented May 26, 2021

luizvaz commented Dec 5, 2020 •

edited

Loading