Skip to content

Commit

Permalink
[browser][non-icu] HybridGlobalization compare (#84249)
Browse files Browse the repository at this point in the history
* Added support for hybrid globalization comparison.

* Clean-up. Added exception test cases to docs.

* Nit changes from @kg's review.

* Applied @kg's review.

* Revert unintentional change.

* Refactor: all hybrid globalization js methods in one file.

* Color the syntax.

* Undeline the fact that this PR is only for WASM.

* Move interop functions to proper location.
  • Loading branch information
ilonatommy authored Apr 7, 2023
1 parent f18c88d commit d34a2a2
Show file tree
Hide file tree
Showing 15 changed files with 658 additions and 191 deletions.
154 changes: 154 additions & 0 deletions docs/design/features/hybrid-globalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,157 @@ Affected public APIs:
- TextInfo.ToTitleCase.

Case change with invariant culture uses `toUpperCase` / `toLoweCase` functions that do not guarantee a full match with the original invariant culture.

**String comparison**

Affected public APIs:
- CompareInfo.Compare,
- String.Compare,
- String.Equals.

The number of `CompareOptions` and `StringComparison` combinations is limited. Originally supported combinations can be found [here for CompareOptions](https://learn.microsoft.com/dotnet/api/system.globalization.compareoptions) and [here for StringComparison](https://learn.microsoft.com/dotnet/api/system.stringcomparison).

- `IgnoreWidth` is not supported because there is no equivalent in Web API. Throws `PlatformNotSupportedException`.
``` JS
let high = String.fromCharCode(65281) // %uff83 = テ
let low = String.fromCharCode(12486) // %u30c6 = テ
high.localeCompare(low, "ja-JP", { sensitivity: "case" }) // -1 ; case: a ≠ b, a = á, a ≠ A; expected: 0

let wide = String.fromCharCode(65345) // %uFF41 = a
let narrow = "a"
wide.localeCompare(narrow, "en-US", { sensitivity: "accent" }) // 0; accent: a ≠ b, a ≠ á, a = A; expected: -1
```

For comparison where "accent" sensitivity is used, ignoring some type of character widths is applied and cannot be switched off (see: point about `IgnoreCase`).

- `IgnoreKanaType`:

It is always switched on for comparison with locale "ja-JP", even if this comparison option was not set explicitly.

``` JS
let hiragana = String.fromCharCode(12353) // %u3041 = ぁ
let katakana = String.fromCharCode(12449) // %u30A1 = ァ
let enCmp = hiragana.localeCompare(katakana, "en-US") // -1
let jaCmp = hiragana.localeCompare(katakana, "ja-JP") // 0
```

For locales different than "ja-JP" it cannot be used separately (no equivalent in Web API) - throws `PlatformNotSupportedException`.

- `None`:

No equivalent in Web API for "ja-JP" locale. See previous point about `IgnoreKanaType`. For "ja-JP" it throws `PlatformNotSupportedException`.

- `IgnoreCase`, `CurrentCultureIgnoreCase`, `InvariantCultureIgnoreCase`

For `IgnoreCase | IgnoreKanaType`, argument `sensitivity: "accent"` is used.

``` JS
let hiraganaBig = `${String.fromCharCode(12353)} A` // %u3041 = ぁ
let katakanaSmall = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaBig.localeCompare(katakanaSmall, "en-US", { sensitivity: "accent" }) // 0; accent: a ≠ b, a ≠ á, a = A
```

Known exceptions:

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** | **comments** |
|:---------------:|:---------------:|--------------------|:------------------------:|:-------:|:-------------------------------------------------------:|
| a | `\uFF41`| IgnoreKanaType | 0 | -1 | applies to all wide-narrow chars |
| `\u30DC`| `\uFF8E`| IgnoreCase | 1 | -1 | 1 is returned in icu when we additionally ignore width |
| `\u30BF`| `\uFF80`| IgnoreCase | 0 | -1 | |


For `IgnoreCase` alone, a comparison with default option: `sensitivity: "variant"` is used after string case unification.

``` JS
let hiraganaBig = `${String.fromCharCode(12353)} A` // %u3041 = ぁ
let katakanaSmall = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
let unchangedLocale = "en-US"
let unchangedStr1 = hiraganaBig.toLocaleLowerCase(unchangedLocale);
let unchangedStr2 = katakanaSmall.toLocaleLowerCase(unchangedLocale);
unchangedStr1.localeCompare(unchangedStr2, unchangedLocale) // -1;
let changedLocale = "ja-JP"
let changedStr1 = hiraganaBig.toLocaleLowerCase(changedLocale);
let changedStr2 = katakanaSmall.toLocaleLowerCase(changedLocale);
changedStr1.localeCompare(changedStr2, changedLocale) // 0;
```

From this reason, comparison with locale `ja-JP` `CompareOption` `IgnoreCase` and `StringComparison`: `CurrentCultureIgnoreCase` and `InvariantCultureIgnoreCase` behave like a combination `IgnoreCase | IgnoreKanaType` (see: previous point about `IgnoreKanaType`). For other locales the behavior is unchanged with the following known exceptions:

| **character 1** | **character 2** | **CompareOptions** | **hybrid globalization** | **icu** |
|:------------------------------------------------:|:----------------------------------------------------------:|-----------------------------------|:------------------------:|:-------:|
| `\uFF9E` (HALFWIDTH KATAKANA VOICED SOUND MARK) | `\u3099` (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) | None / IgnoreCase / IgnoreSymbols | 1 | 0 |

- `IgnoreNonSpace`

`IgnoreNonSpace` cannot be used separately without `IgnoreKanaType`. Argument `sensitivity: "case"` is used for comparison and it ignores both types of characters. Option `IgnoreNonSpace` alone throws `PlatformNotSupportedException`.

``` JS
let hiraganaAccent = `${String.fromCharCode(12353)} á` // %u3041 = ぁ
let katakanaNoAccent = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaAccent.localeCompare(katakanaNoAccent, "en-US", { sensitivity: "case" }) // 0; case: a ≠ b, a = á, a ≠ A
```

- `IgnoreNonSpace | IgnoreCase`
Combination of `IgnoreNonSpace` and `IgnoreCase` cannot be used without `IgnoreKanaType`. Argument `sensitivity: "base"` is used for comparison and it ignores three types of characters. Combination `IgnoreNonSpace | IgnoreCase` alone throws `PlatformNotSupportedException`.

``` JS
let hiraganaBigAccent = `${String.fromCharCode(12353)} A á` // %u3041 = ぁ
let katakanaSmallNoAccent = `${String.fromCharCode(12449)} a a` // %u30A1 = ァ
hiraganaBigAccent.localeCompare(katakanaSmallNoAccent, "en-US", { sensitivity: "base" }) // 0; base: a ≠ b, a = á, a = A
```

- `IgnoreSymbols`

The subset of ignored symbols is limited to the symbols ignored by `string1.localeCompare(string2, locale, { ignorePunctuation: true })`. E.g. currency symbols, & are not ignored

``` JS
let hiraganaAccent = `${String.fromCharCode(12353)} á` // %u3041 = ぁ
let katakanaNoAccent = `${String.fromCharCode(12449)} a` // %u30A1 = ァ
hiraganaBig.localeCompare(katakanaSmall, "en-US", { sensitivity: "base" }) // 0; base: a ≠ b, a = á, a = A
```

- List of all `CompareOptions` combinations always throwing `PlatformNotSupportedException`:

`IgnoreCase`,

`IgnoreNonSpace`,

`IgnoreNonSpace | IgnoreCase`,

`IgnoreSymbols | IgnoreCase`,

`IgnoreSymbols | IgnoreNonSpace`,

`IgnoreSymbols | IgnoreNonSpace | IgnoreCase`,

`IgnoreWidth`,

`IgnoreWidth | IgnoreCase`,

`IgnoreWidth | IgnoreNonSpace`,

`IgnoreWidth | IgnoreNonSpace | IgnoreCase`,

`IgnoreWidth | IgnoreSymbols`

`IgnoreWidth | IgnoreSymbols | IgnoreCase`

`IgnoreWidth | IgnoreSymbols | IgnoreNonSpace`

`IgnoreWidth | IgnoreSymbols | IgnoreNonSpace | IgnoreCase`

`IgnoreKanaType | IgnoreWidth`

`IgnoreKanaType | IgnoreWidth | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreNonSpace`

`IgnoreKanaType | IgnoreWidth | IgnoreNonSpace | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreCase`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreNonSpace`

`IgnoreKanaType | IgnoreWidth | IgnoreSymbols | IgnoreNonSpace | IgnoreCase`
13 changes: 13 additions & 0 deletions src/libraries/Common/src/Interop/Browser/Interop.CompareInfo.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Runtime.CompilerServices;

internal static partial class Interop
{
internal static unsafe partial class JsGlobalization
{
[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal static extern unsafe int CompareString(out string exceptionMessage, in string culture, char* str1, int str1Len, char* str2, int str2Len, global::System.Globalization.CompareOptions options);
}
}
15 changes: 15 additions & 0 deletions src/libraries/Common/src/Interop/Browser/Interop.TextInfo.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Runtime.CompilerServices;

internal static partial class Interop
{
internal static unsafe partial class JsGlobalization
{
[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal static extern unsafe void ChangeCaseInvariant(out string exceptionMessage, char* src, int srcLen, char* dstBuffer, int dstBufferCapacity, bool bToUpper);
[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal static extern unsafe void ChangeCase(out string exceptionMessage, in string culture, char* src, int srcLen, char* dstBuffer, int dstBufferCapacity, bool bToUpper);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -348,10 +348,14 @@ public static string GetDistroVersionString()
private static readonly Lazy<bool> m_isInvariant = new Lazy<bool>(()
=> (bool?)Type.GetType("System.Globalization.GlobalizationMode")?.GetProperty("Invariant", BindingFlags.NonPublic | BindingFlags.Static)?.GetValue(null) == true);

private static readonly Lazy<bool> m_isHybrid = new Lazy<bool>(()
=> (bool?)Type.GetType("System.Globalization.GlobalizationMode")?.GetProperty("Hybrid", BindingFlags.NonPublic | BindingFlags.Static)?.GetValue(null) == true);

private static readonly Lazy<Version> m_icuVersion = new Lazy<Version>(GetICUVersion);
public static Version ICUVersion => m_icuVersion.Value;

public static bool IsInvariantGlobalization => m_isInvariant.Value;
public static bool IsHybridGlobalizationOnWasm => m_isHybrid.Value && (IsBrowser || IsWasi);
public static bool IsNotInvariantGlobalization => !IsInvariantGlobalization;
public static bool IsIcuGlobalization => ICUVersion > new Version(0, 0, 0, 0);
public static bool IsNlsGlobalization => IsNotInvariantGlobalization && !IsIcuGlobalization;
Expand Down
Loading

0 comments on commit d34a2a2

Please sign in to comment.