Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit test for extraction of Thai text from PDF broken? #1098

Closed
vk-github18 opened this issue Mar 10, 2024 · 2 comments · Fixed by #1114
Closed

Unit test for extraction of Thai text from PDF broken? #1098

vk-github18 opened this issue Mar 10, 2024 · 2 comments · Fixed by #1114
Labels

Comments

@vk-github18
Copy link
Contributor

vk-github18 commented Mar 10, 2024

The following test checks the extraction of Thai text from a PDF file generated with LayoutProcessor:
src/test/java/com/lowagie/text/pdf/TextExtractTest.java textCreateAndExtractTest2()
The expected text is different from the input text and contains some more spaces.

With the proposed change in https://github.com/vk-github18/OpenPDF-vk2 (Change24)
the test fails because the generated string contains less spaces.

In detail:

TestText กขน้ำตา ญูญูิ่ ก้กิ้
Expected ก ข น ํ้ า ต า ญูญูิ่ ก้กิ้
Change24 กขน ํ้ าตา ญูญูิ่ ก้กิ้

TestText 0e 01 0e 02 0e 19 0e 49 0e 33 0e 15 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49 
Expected 0e 01 00 20 0e 02 00 20 0e 19 00 20 0e 4d 0e 49 00 20 0e 32 00 20 0e 15 00 20 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49 
Change24 0e 01 0e 02 0e 19 00 20 0e 4d 0e 49 00 20 0e 32 0e 15 0e 32 00 20 0e 0d 0e 39 0e 0d 0e 39 0e 34 0e 48 00 20 0e 01 0e 49 0e 01 0e 34 0e 49 

Could some expert in Thai please review if the extracted string in Change24 is correct and clarify the significance of spaces in this example?
Is it acceptable to change the test, sothat it expects the string in Change24?

@vk-github18
Copy link
Contributor Author

@forfin Could you please look at this issue?

@asturio
Copy link
Member

asturio commented Mar 10, 2024

@vk-github18 , nice finding. It's quite difficult to check those tests using non-latin alphabets, so any contribution in that point is always a help. This applies also to Chinese, Japanese, Arabic, Hindi and other languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants