-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect extraction in tables with overlapping columns #912
Comments
Thanks for opening this issue, @gnadlr; and thanks for your contributions to the related discussion and other recent ones, @cmdlineluser! Some observations:
|
You can check out these 2 sample pdf: Issue with extract_tables(): text mingling. It is very apparent in 2nd pdf. Issue with extract_words(): not recognizing space properly between the last word of a column and the first word of the next column. It is very apparent in the 2nd pdf (column overlapping) and also in the 1st pdf (column not overlap but words of 2 columns are very close). |
Apologies for any confusion regarding this @jsvine
Using row 1 from your update page 2 sample as an example: page2.search('Desloratadin.*')[0]['text']
With page2.search('Desloratadin.*', use_text_flow=True)[0]['text']
page2.search('Desloratadin.*', use_text_flow=True, keep_blank_chars=True)[0]['text']
I'm not sure if all the text there is now correct? There are spacing issues e.g. It does appear you can pass pd.DataFrame(page2.extract_table()).iloc[[1], :10]
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar mIancdeiuatical L tHd ộp 1 lọ 60 ml pd.DataFrame(page2.extract_table({"text_use_text_flow": True})).iloc[[1], :10]
# 0 1 2 3 4 5 6 7 8 9
# 1 1 Desloratadin Uống 0,5mg/ml x 60 m lDestacure VN-16773-13 VN-16773-13 Gracure Phar maceutical LIndia td Hộp 1 lọ 60 ml There does seem to be something else going on though. |
@cmdlineluser is on point. Here are some more comparisons so it is easier to see the issue. Expected result
Extract_tables ()Text is split where there is overlapping, the split text merges with the next column (lDestacure) and mingled (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | mIancdeiuatical L | tHd)
Extract_tables ({"text_use_text_flow": True, "text_keep_blank_chars": True})Same split behavior (lDestacure). Texts mingling is better but still not correct (Gracure Pharmaceutical Ltd becomes 3 columns Gracure Phar | maceutical LIndia | td).
Extract_words ({"use_text_flow" = True, "keep_blank_chars" = True})All texts are correct (no mingling) but several spaces are not detected so it's not possible to parse correctly into tables (spaces are not detected when texts of adjacent columns are too close or when columns overlap)
|
I was able to extract individual characters with their coordinates with extract_text_lines(), Now the only thing left to do is to parse it into columns.
|
As noted in #912, `use_text_flow` was not being handled consistently, as characters and words were being re-sorted without checking first if this parameter was set to `True`.
FYI v0.10.0, now available, contains this fix. Hopefully it helps with this issue more broadly. I'll be eager to know what you think. |
Thank you very much for the fix. Space is now correctly detected when text of 2 columns physically overlap. However, space is not detected when text of 2 columns is very close but does not overlap. Using extract_words, this comes out as a single word "mlDestacure" (after "m" is "l", but it is hidden and occupies the empty space before "D"). It should have been "ml" and "Destacure" separately. Because of the above, my current best option is to extract individual characters. But Im not sure how to fix this based on character coordinates alone, since the coordinates in this case are close and continous very similar to any standalone word. My idea is pdfplumber could detect which character is "visible" and which is "hidden", then I could write a parser to split words when this attribute change from "hidden" to "visible". If you have better idea please let me know. Hope it makes sense. Thank you. |
I wonder why Trying all the various tools/libraries for extracting text - they all seem to extract In looking for debugging options, I found the <fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
<span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
<g unicode="0" glyph="zero" x="172.1" y="479.74" adv=".5"/>
<g unicode="," glyph="comma" x="175.1" y="479.74" adv=".25"/>
<g unicode="5" glyph="five" x="176.654" y="479.74" adv=".5"/>
<g unicode="m" glyph="m" x="179.654" y="479.74" adv=".778"/>
<g unicode="g" glyph="g" x="184.09401" y="479.74" adv=".5"/>
<g unicode="/" glyph="slash" x="187.09401" y="479.74" adv=".278"/>
<g unicode="m" glyph="m" x="188.76201" y="479.74" adv=".778"/>
<g unicode="l" glyph="l" x="193.214" y="479.74" adv=".278"/>
<g unicode=" " glyph="space" x="194.654" y="479.74" adv=".25"/>
<g unicode="x" glyph="x" x="196.20801" y="479.74" adv=".5"/>
<g unicode=" " glyph="space" x="199.08802" y="479.74" adv=".25"/>
<g unicode="6" glyph="six" x="200.64202" y="479.74" adv=".5"/>
<g unicode="0" glyph="zero" x="203.64202" y="479.74" adv=".5"/>
<g unicode=" " glyph="space" x="206.64202" y="479.74" adv=".25"/>
<g unicode="m" glyph="m" x="208.19602" y="479.74" adv=".778"/>
<g unicode="l" glyph="l" x="212.63602" y="479.74" adv=".278"/>
</span>
</fill_text>
<pop_clip/>
<end_layer/>
<layer name="P"/>
<clip_path winding="eofill" transform="1 0 0 -1 0 595.32">
<moveto x="51.96" y="59.28"/>
<lineto x="782.62" y="59.28"/>
<lineto x="782.62" y="540.6"/>
<lineto x="51.96" y="540.6"/>
<closepath/>
</clip_path>
<fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 595.32">
<span font="Times New Roman" wmode="0" bidi="0" trm="6 0 0 6">
<g unicode="D" glyph="D" x="213.26" y="479.74" adv=".722"/>
<g unicode="e" glyph="e" x="217.592" y="479.74" adv=".444"/>
<g unicode="s" glyph="s" x="220.22" y="479.74" adv=".389"/>
<g unicode="t" glyph="t" x="222.5" y="479.74" adv=".278"/>
<g unicode="a" glyph="a" x="224.294" y="479.74" adv=".444"/>
<g unicode="c" glyph="c" x="226.934" y="479.74" adv=".444"/>
<g unicode="u" glyph="u" x="229.574" y="479.74" adv=".5"/>
<g unicode="r" glyph="r" x="232.574" y="479.74" adv=".333"/>
<g unicode="e" glyph="e" x="234.608" y="479.74" adv=".444"/>
</span>
</fill_text>
<pop_clip/>
<end_layer/> The However, $ mutool convert -O preserve-spans -o 2a.txt Downloads/2a.pdf
$ grep -C 3 Dest 2a.txt
Desloratadin
Uống
0,5mg/ml x 60 ml
Destacure
VN-16773-13
VN-16773-13
Gracure Pharmaceutical Ltd From what I can find, the clipping commands are currently no-ops in pdfminer: pdfminer/pdfminer.six#414 - I'm not sure if this is something that needs to be supported in order for pdfplumber to be able to handle this? |
Thanks for this extra context, @cmdlineluser, and for flagging the pdfminer no-op. Unfortunately, that no-op blocks pdfplumber from making use of clipping paths. So not sure we can do much with this here. I keep a fairly close eye on pdfminer.six releases; if/when a future release includes clipping path information, I'll aim to incorporate it. (Maybe something like
I think the issue is that the
Hmm, I see |
It's possible I am the one misunderstanding things, or using the wrong terminology @jsvine the and here the I was just wondering how come it doesn't extract as |
Ah, I see; this is a good motivation for me to write more comprehensive documentation about how word segmentation works in pdfplumber. Until then:
Because But because the Note: Technically, both criteria are tested when |
This is correctly what I was trying to say (though it should be If detection of clipping is not possible, another idea is to check character spacing: With a certain font and font-size, spacing between 2 specific characters should be consistent (in theory). For example: in our example However, I haven't figured out the rule of spacing in PDF files (sometime characters even have negative spacing). Since these PDF files has consistent font and font-size, if I can figure the spacing rule, I can write a parser to check individual spacing to see if it is part of the same word. Note: by "spacing" i mean |
Thanks! Updated the comment to fix that.
Yes, I think the difficulty here is the "(in theory)" part. In practice, I think we'd see a lot of unexpected violations of this theory — enough that it'd create a whole class of edge cases perhaps more common than the thing it's trying to fix. That said, I'm quite open to being persuaded otherwise with examples and testing! |
Ah, I see - the Thanks for the explanation @jsvine. I noticed from the (I'm not sure if that is just a property for this particular PDF?) From some poking around, it looks like these are the Adding in some debug prints and in [LAYER]
[SHOW_TEXT] seq=[b'0,', -9, b'5m', 38, b'g/m', 36, b'l', 38, b' ', -9, b'x', 20, b' ', -9, b'60 ', -9, b'm', 38, b'l']
[LAYER]
[SHOW_TEXT] seq=[b'De', 6, b's', 9, b't', -21, b'a', 4, b'c', 4, b'ur', -6, b'e'] Perhaps you know if it's somehow possible to use this layer information to help with this? |
Really interesting, thanks for sharing @cmdlineluser. I think you're right about those layers being created by marked-content commands. As it happens @dhdaines is doing some experimentation with extracting those sections in #937. Forcing a word-split when crossing in/out of a marked content section makes sense; certainly something worth trying out if we're able to merge that info. Another option (perhaps defaulting to Unfortunately, getting access to |
Ironically, I have a similar problem to this where a space character appears for unknown reasons just above a line of text and causes a word break due to the sorting of characters - in this PDF I get "63 5" instead of "635" at the bottom of the page... solution is either to use
This can be problematic because marked content section boundaries can show up just about anywhere - take this PDF for example, running: pdf = pdfplumber.open(sys.argv[1])
page = pdf.pages[0]
for word in page.extract_words(extra_attrs=["mcid"]):
print(word["mcid"], word["text"]) You will see that basically every word has its own MCID, but also many words are split into multiple marked content sections:
I already do this in #937 ;-) there is really no other option, particularly since It kind of seems to me like the subset of functionality in |
As for switching to a different library ... there doesn't seem to exist one that has:
Maybe pdf-rs could be interesting in the future ... binding Python to Rust is relatively painless. |
Actually - sorry for the spam here ... but in this case the MCIDs correspond to inline Span elements in the structure tree, so they should be expected to not force word breaks. See
So basically no we should not put word-breaks at marked content section boundaries unless we know that they are block elements. |
Thanks for the notes, @dhdaines. Thoughts/responses below:
Ah, very interesting, thanks. I wouldn't want MCIDs to be incorporated into word-splitting by default, but it might be a nice option to have available.
Although it's true that
Although it may not seem so at first,
|
Well... it's only sort of monkey-patching since the
Ah, yes, indeed. The Python bindings won't let you do this, but it is easy to call the underlying C API, which, at least for text, seems to give you everything you need to get individual characters and all their attributes: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_text.h You can see how to do this in the pypdfium2 documentation as well as in my code to read the structure tree. I shouldn't let my allergy to Google-origin software cloud my judgement here :) and anyway PDFium wasn't originally created by Google and doesn't seem to have been infected by their software engineering practices and tools (monorepo, bazel, abseil, and that whole bestiary) |
Hrm, seems I'm back here again. I've run into the issue of text made invisible by setting a clipping path and will make a PR to pdfminer to support some common cases. |
(also, pdfminer is being maintained again! hooray!) |
This is a continuation of a discussion posted here, please check for more info.
Describe the bug
When the pdf has overlapping columns (i.e the columns do not wrap text), all extraction methods (extract_tables, extract_text, extract_words) give incorrect result.
Original text
'aaaa b|bbb' and '1111' (the | is the separator line between the columns)
Expected behavior
'aaaa bbbb' and '1111'
Actual behavior
'aaaa b' and 'b1b1b11' when using extract_tables() or extract_text()
'aaaa' and 'bbbb1111'. when using extract_words(use_text_flow=True)
Sample pdf
https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf
The text was updated successfully, but these errors were encountered: