Unit test for extraction of Thai text from PDF broken? #1098

vk-github18 · 2024-03-10T12:20:15Z

The following test checks the extraction of Thai text from a PDF file generated with LayoutProcessor:
src/test/java/com/lowagie/text/pdf/TextExtractTest.java textCreateAndExtractTest2()
The expected text is different from the input text and contains some more spaces.

With the proposed change in https://github.com/vk-github18/OpenPDF-vk2 (Change24)
the test fails because the generated string contains less spaces.

In detail:

TestText กขน้ำตา ญูญูิ่ ก้กิ้
Expected ก ข น ํ้ า ต า ญูญูิ่ ก้กิ้
Change24 กขน ํ้ าตา ญูญูิ่ ก้กิ้

TestText 0e 01 0e 02 0e 19 0e 49 0e 33 0e 15 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49 
Expected 0e 01 00 20 0e 02 00 20 0e 19 00 20 0e 4d 0e 49 00 20 0e 32 00 20 0e 15 00 20 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49 
Change24 0e 01 0e 02 0e 19 00 20 0e 4d 0e 49 00 20 0e 32 0e 15 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49

Could some expert in Thai please review if the extracted string in Change24 is correct and clarify the significance of spaces in this example?
Is it acceptable to change the test, sothat it expects the string in Change24?

The text was updated successfully, but these errors were encountered:

vk-github18 · 2024-03-10T14:15:00Z

@forfin Could you please look at this issue?

asturio · 2024-03-10T14:56:57Z

@vk-github18 , nice finding. It's quite difficult to check those tests using non-latin alphabets, so any contribution in that point is always a help. This applies also to Chinese, Japanese, Arabic, Hindi and other languages.

vk-github18 added the bug label Mar 10, 2024

vk-github18 mentioned this issue Mar 10, 2024

LayoutProcessor with inline images generates incorrect text offsets #1051

Closed

vk-github18 mentioned this issue Mar 24, 2024

Use operators Ts and TJ for glyph layout. Some refactorings. #1114

Merged

asturio linked a pull request Mar 27, 2024 that will close this issue

Use operators Ts and TJ for glyph layout. Some refactorings. #1114

Merged

asturio closed this as completed in #1114 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit test for extraction of Thai text from PDF broken? #1098

Unit test for extraction of Thai text from PDF broken? #1098

vk-github18 commented Mar 10, 2024 •

edited

Loading

vk-github18 commented Mar 10, 2024

asturio commented Mar 10, 2024

Unit test for extraction of Thai text from PDF broken? #1098

Unit test for extraction of Thai text from PDF broken? #1098

Comments

vk-github18 commented Mar 10, 2024 • edited Loading

vk-github18 commented Mar 10, 2024

asturio commented Mar 10, 2024

vk-github18 commented Mar 10, 2024 •

edited

Loading