-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390
Conversation
Could you add a unit test ? something similar to: Lines 3202 to 3216 in 790470c
You can just test that the retrieved text starts with Yýsledková listina přijímacích zkoušek .
|
OK, I will. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When addressing the comments, please remember to squash the commits.
OK, I made the necessary changes and squashed the commits. |
The CI failed on linting. You can automatically fix that by running |
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/c8ea0e506839555/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/db23d9ada3a63cd/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/c8ea0e506839555/output.txt Total script time: 28.99 mins
Image differences available at: http://54.241.84.105:8877/c8ea0e506839555/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please improve the commit message a bit, such that all relevant information is available on the Git command line as well without having to read GitHub?
Generally the first line of the commit message should contain a summary of the changes, with a reference to the bug/issue, and then any other relevant details below. You included a bunch of nice context in #18390 (comment) that belongs in the commit message too. My suggestion would be something like this:
Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099)
In the PDF in question, the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding unicode code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF 16, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF 16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding high bytes ( 01 becomes \u0100, 02 becomes \u0200) which ended up turning latin text in the PDF into chinese when it was copied.
I'm not sure if the specification actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this.
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/db23d9ada3a63cd/output.txt Total script time: 42.46 mins
Image differences available at: http://54.193.163.58:8877/db23d9ada3a63cd/reftest-analyzer.html#web=eq.log |
…(issue 18099) Add unit test to check compatability with such cmaps In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is <start_char_code> <end_char_code> <start_unicode_codepoint> As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read <00> <7F> <0000> Instead it omitted two leading zeros from the UTF-16 like this <00> <7F> <00> This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
@Snuffleupagus I changed the commit message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, thank you!
Thank YOU for all your help! |
Modifies partialEvaluator.readToUnicode() to handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16, as seen in the PDF in [https://github.com//issues/18099](issue 18099).
In the PDF in question, the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding unicode code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF 16, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF 16 like this
<00> <7F> <00>
This confused pdf.js into mapping these character codes to the UTF-16 characters with the corresponding high bytes ( 01 becomes \u0100, 02 becomes \u0200) which ended up turning latin text in the PDF into chinese when it was copied.
I'm not sure if the specification actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, pdf.js should probably support this.