Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

vaiorabbit · 2020-11-28T12:51:30Z

Hello!

This PR makes ImFontAtlas::GetGlyphRangesJapanese() support more Japanese characters (Kanjis defined by the Government of Japan) out of the box.

2136 Joyo (meaning "for regular use" or "for common use") characters
863 Jinmeiyo (meaning "for personal name") characters

What this PR do

The commit 0e6b84c rebuilds internal offset table in ImFontAtlas::GetGlyphRangesJapanese.

Source of the offset table

As a reliable source of this offset table, I chose the character information database of the Information-technology Promotion Agency (IPA, an administrative entity of Japan).

IPA provides REST API to access their database https://mojikiban.ipa.go.jp/mji/ .
The information acquired from the database is freely available under the terms of Creative Commons Attribution-ShareAlike 2.1 Japan (CC BY-SA 2.1 JP).

Supplemental scripts

I made a repository https://github.com/vaiorabbit/everyday_use_kanji that contains several Ruby scripts to

query the IPA database, and
generate GetGlyphRangesJapanese() implementation (e.g. https://github.com/vaiorabbit/everyday_use_kanji/blob/master/imgui/GetGlyphRangesJapanese.cpp ).

These scripts will be useful when we want to keep GetGlyphRangesJapanese() up-to-date in the future.

Motivation

Click here to expand

Current GetGlyphRangesJapanese() implementation supports 1946 characters, but this is not enough to support 2136 Joyo (common-use) characters and 863 Jinmeiyo (for personal names) characters, which are defined by the Government of Japan).

So we often see garbled characters in relatively simple Japanese sentences and people's names
(displayed by the replacement character ("?") as a fallback in this screenshot).

Though Sometimes GetGlyphRangesChineseFull is recommended as a replacement,

using GetGlyphRangesChineseFull() tends to produce texture larger than that of GetGlyphRangesJapanese. Though it would depend on the configuration, GetGlyphRangesChineseFull produces 4096 x 4096 font texture internally, which is quite large compared to GetGlyphRangesJapanese() implementation, which produces only 1024 x 2048 texture.
- Font texture (GetGlyphRangesChineseFull) displayed in RenderDoc
- Font texture (GetGlyphRangesJapanese) displayed in RenderDoc
but still fails to display several Joyo characters.

There is another alternative called GetGlyphRangesChineseSimplifiedCommon that supports 2500 characters,

but covers different ranges that does not used from Japanese characters, results in more garbled characters.

I thought it would be easy and reasonable to rebuild the internal tables in GetGlyphRangesJapanese() to support Japanese characters defined by the government.

Limitations

What you will about to read below is a topic that is difficult even for the Japanese people. But I will try to explain it somehow.

In short:

In the current Joyo kanji table, there is only one character that its code point cannot be represented in 2-byte variable.
To avoid/alleviate the problem, I made a tweak so that most Japanese wouldn't notice.
Those who wants to handle this character correctly, IMGUI_USE_WCHAR32 and ImFontGlyphRangesBuilder easily solve the problem.

Limitation and workaround due to the code point of "𠮟"

In a commit in the previous similar PR ( #1650 ), we can see a line that says:

// FIXME: We lost U+20B9F because it's out of range.

This means the character corresponding to the code point 0x20B9F(==134047) exceeds the range of 2-byte variable (short or ImWchar16) so cannot be displayed.

The actual character is "𠮟" (scold, rebuke or reprimand, etc.).

𠮟 (code point 0x20b9f(==134047)
- encoded as F0 A0 AE 9F in UTF-8
- was added as the Joyo Kanji in 2010
- is the only character in 2136 Joyo characters that requires more than 2 bytes to express its code point

"𠮟" still can cause garbled character. When we try to use "𠮟" in Windows, Microsoft's standard Japanese IME displays attention "環境依存(environment-dependent)", that means "this character may cause garbled characters because there are several environments that cannot handle this character code".

So, this character is often substituted by the variant character "叱" (U+53F1).

叱 (code point 0x53f1(==21489)
- encoded as E5 8F B1 in UTF-8
- is the traditional form of 「𠮟」
- means "scold, rebuke or reprimand", etc. So the only difference between the two kanji is in design.
- can be stored its code point in 2-byte variable (short or ImWchar16).
- has been used for a long time before the modern form 「𠮟」 was added in 2010, and still used

Actual history of this problem is a bit more complex, but in terms of actual use cases, these two characters can be recognized as the same character, differing only in design.

So in this PR, I intentionally used "叱 (U+53F1)" at everywhere "𠮟 (u+20B9F)" should come but unusable.

(∵) 0x20B9F - 0xFFFF == 134047 - 65535 == 68512 > 0xFFFF.
- It's impossible to store the offset into static const short accumulative_offsets_from_0x4E00[].
I used the list of Joyo characters "regular_use_force_2byte_codepoint_utf8.csv" to generate GetGlyphRangesJapanese(). In this list, as a workaround the character "𠮟 (U+20B9F, modern form)" is replaced with "叱 (U+53F1, traditional form)" to represent all characters in 2 bytes.
- This csv file was generated by this Ruby script. Character substitution is also performed inside.
  - https://github.com/vaiorabbit/everyday_use_kanji/blob/master/scripts/generate_csv.rb#L20-L23

Even after this PR was merged, GetGlyphRangesJapanese() can display "叱" (U+53F1) but cannot display "𠮟 (u+20B9F)".
Users who want to display "𠮟 (modern form)" should follow these steps:

Build ImGui with IMGUI_USE_WCHAR32 enabled
Prepare appropriate font (e.g. Google Noto Fonts)

Write codes like:

ImFontGlyphRangesBuilder builder;
builder.AddRanges(io.Fonts->GetGlyphRangesJapanese());
#ifdef IMGUI_USE_WCHAR32
builder.AddText(u8"𠮟"); // code point 0x20b9f(==134047, exceeds the range of ImWchar16), encoded as F0 A0 AE 9F in UTF-8
#endif
ImVector<ImWchar> out_ranges;
builder.BuildRanges(&out_ranges);
ImFont* font = io.Fonts->AddFontFromFileTTF("/font/NotoSansMonoCJKjp-Regular.otf", 20.0f, nullptr, out_ranges.Data);

References
- 𠮟 (modern form)
  - https://en.wiktionary.org/wiki/%F0%A0%AE%9F
- 叱 (traditional form)
  - https://en.wiktionary.org/wiki/%E5%8F%B1

Test and Performance

I made a small test code that tries to display all 2136 Joyo characters and 863 Jinmeiyo characters.

Test code can be found here:
- https://github.com/vaiorabbit/imgui_fork/blob/feature/japanese_glyph_range_test/examples/example_sdl_opengl3/japanese_glyph_test.cpp
I used the List of jōyō kanji (Wikipedia) as the source of test data and made into several text files by hand:
- https://github.com/vaiorabbit/imgui_fork/tree/feature/japanese_glyph_range_test/examples/example_sdl_opengl3/kanji

Screenshots

Screenshot (/w current GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

causes several garbled characters.

Screenshot (/w new GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

can display all 2999 characters, except for 「叱 (modern form)」

Screenshot (/w new GetGlyphRangesJapanese(), enable IMGUI_USE_WCHAR32 and use ImFontGlyphRangesBuilder::AddText)

builder.AddText(u8"𠮟") solves the problem and we're done!

Performance issue

Size of font texture

Though it would depend on the configuration, both current GetGlyphRangesJapanese() and new implementation created 1024x2048 font texture internally in the test code. The increase in texture size was not so great.

GetGlyphRangesJapanese[Current]
GetGlyphRangesJapanese[New]

Memory consumption

The test code reports memory consumption by ImGui when the macro MEASURE_MEMORY_ALLOCATION is defined
(by using the allocator hooks provided by ImGui::SetAllocatorFunctions).
The increase in memory consumption due to the new implementation is less than 100K Bytes.

[Windows x64 / Visual Studio 2019 Version 16.7.4 / ImGui 1.80 WIP]
GetGlyphRangesJapanese[Current]
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=27718242
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=28537544
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=27730578
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=28549880

GetGlyphRangesJapanese[New]
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=27790566
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=28613312
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=27802902
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=28625648

GetGlyphRangesChineseFull
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=102034930
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=102847380
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=102034924
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=102847374

- GetGlyphRangesJapanese now supports - 2136 'Joyo (meaning "for regular use" or "for common use")' Kanji - 863 'Jinmeiyo" (meaning "for personal name")' Kanji

ocornut · 2020-12-02T11:23:08Z

Thank you for this incredible amount of details.
I took the liberty to add a line of comment under the comments for GetGlyphRangesJapanese(), which says:
"- Missing 1 Joyo Kanji: U+20B9F (Kun'yomi: Shikaru, On'yomi: Shitsu,shichi), see #3627 for details."

We'll later take inspiration from some of your tests to include in your test suite!

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table

684f43e

- GetGlyphRangesJapanese now supports - 2136 'Joyo (meaning "for regular use" or "for common use")' Kanji - 863 'Jinmeiyo" (meaning "for personal name")' Kanji

ocornut added the font/text label Dec 2, 2020

ocornut closed this Dec 2, 2020

ocornut mentioned this pull request Dec 2, 2020

Update GetGlyphRangesJapanese and Add GetGlyphRangesJoyoKanji #1650

Closed

ocornut mentioned this pull request Feb 13, 2021

Improve on automatic circle segment count calculation #3808

Closed

vaiorabbit mentioned this pull request Jan 9, 2023

Update comments in ImFontAtlas::GetGlyphRangesJapanese #6066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

vaiorabbit commented Nov 28, 2020

ocornut commented Dec 2, 2020

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

Conversation

vaiorabbit commented Nov 28, 2020

What this PR do

Source of the offset table

Supplemental scripts

Motivation

Limitations

Limitation and workaround due to the code point of "𠮟"

Test and Performance

Screenshot (/w current GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

Screenshot (/w new GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

Screenshot (/w new GetGlyphRangesJapanese(), enable IMGUI_USE_WCHAR32 and use ImFontGlyphRangesBuilder::AddText)

Size of font texture

Memory consumption

ocornut commented Dec 2, 2020