-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving default string globalization experience #207
base: main
Are you sure you want to change the base?
Conversation
|
||
To solve this problem for the .NET ecosystem at large, we propose to take the following actions on `string`. | ||
|
||
1. Obsolete (as warning) all linguistic-by-default methods on `string` with a new __SYSLIBxxxx__ identifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to see 1. be Create fixers to migrate call sites
and 2. to be Obsolete (as warning)
Having a fixer is critical IMO and the most important part considering how widespread the existing methods are.
|
||
## A note on `StringComparison` and `StringComparer` | ||
|
||
One notable aspect of this proposal is the complete restructuring of the `StringComparison` and `StringComparer` types. It is important to note that all members of these types will be preserved. The proposal here is to hide all existing fields / static properties and to create new "simpler" names _Normal_ and _IgnoreCase_. Both of these additions alias their ordinal siblings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it is worth mentioning whoever want to create a linguistic StringComparer, can call the extension method CompareInfo.GetStringComparer
. This will be explicit as it will be created from the CompareInfo which show the intend of using the linguistic functionality.
- Ordinal = 4, | ||
- OrdinalIgnoreCase = 5, | ||
+ Normal = 4, // aliases Ordinal | ||
+ IgnoreCase = 5, // aliases OrdinalIgnoreCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the replacement for CurrentCulture
and InvariantCulture
?
How will this impact cases like CompareOptions.NumericOrdering
: dotnet/runtime#13979 (comment)?
How will this impact cases like RemoveStart
and RemoveEnd
: dotnet/runtime#14386
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use CompareInfo instead for the operation. e.g. CultureInfo.CurrentCulture.CompareInfo.Compare(...)
. I think we need to revisit dotnet/runtime#13979 or at least will make NumericOrdering is cultural independent.
In reply to: 615181346 [](ancestors = 615181346)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would make this so much harder when you do want CurrentCulture{IgnoreCase}. The concern expressed in this doc is around the defaults and the inconsistency with defaults we currently expose. But with StringComparison/StringComparer, a developer is explicitly specifying which they want. I see only downside to changing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind with @stephentoub. Having Normal
and IgnoreCase
present on StringComparer
seems sufficient as it maps to the methods. Using the enum/comparer is already the advanced mode, so hiding the other concepts feels a bit like we're trying too hard to be helpful.
|
||
The second characteristic requires a bit more explanation. Popular programming languages like Java, JavaScript, Go, Python, C/C++, and others treat string APIs as ordinal. .NET is unique in that we build globalization concepts directly into many basic `string` APIs. Unfortunately, evidence shows that .NET's behavior in this regard is largely unexpected and confusing to our developer audience. | ||
|
||
[Java 16's `String` class](https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/lang/String.html) (which is the closest to .NET's `string` class) is a good example of this principle. Per their own documentation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good that we're looking to learn from other ecosystems. Is there anything we can learn from the other ecosystems you mention above, beyond your note that they separate string from globalization? (Go, Rust etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, absolutely. Most other ecosystems also enforce opinionated behaviors regarding UTF-8 string well-formedness. But that continues to be a point of contention for us. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an important thing to keep in mind about scenarios like Rust
is that they design most of their types around the principle of "zero cost abstraction": https://doc.rust-lang.org/book/ch00-00-introduction.html#people-who-value-speed-and-stability
Rust is for people who crave speed and stability in a language. By speed, we mean the speed of the programs that you can create with Rust and the speed at which Rust lets you write them. The Rust compiler’s checks ensure stability through feature additions and refactoring. This is in contrast to the brittle legacy code in languages without these checks, which developers are often afraid to modify. By striving for zero-cost abstractions, higher-level features that compile to lower-level code as fast as code written manually, Rust endeavors to make safe code be fast code as well.
So while things like str
are always sliced to valid UTF-8 (unless using one of the unchecked
slicing methods), they ensure the relevant other components are exposed to allow working with and dealing with "raw parts", that is the bytes
rather than runes
(to put it into .NET terms) so power users have the ability to do things themselves when and as appropriate.
Likewise, I don't believe rust has a case insensitive comparer in std
today. You need to pull it in through an external crate (basically nupkg
) and specify it as the comparer to use in your match
expressions (in this way, its more like IComparer
, but where the implementation is a struct
and done via generic specialization so its "zero cost" and inlining/etc happen).
1. Mark all linguistic-related APIs as `[EditorBrowsable(EditorBrowsableState.Never)]`, but do not obsolete them. (Exception: do not mark APIs if they're in the _System.Globalization_ namespace.) | ||
2. Create analyzers to detect `Comparer<string>.Default`, `new SortedSet<string>()`, and similar code patterns that implicitly rely on the current culture. | ||
|
||
### List of API obsoletions and additions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to release an updated version of netstandard or are the developers targetting multiple versions going to be faced with warnings everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge here is we are going to have a new APIs which will be in .NET 6.0 and up. I don't think we can add these to the netstandard.
CC @terrajobst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These wouldn't be in standard. If you're multi-targeting, I believe per the way this proposal is written, the only methods that would be marked [Obsolete]
are methods that the code analyzers CA1307 and CA1310 would have flagged anyway as likely incorrect usage. A significant number of methods would be marked EB-never but wouldn't produce warnings.
Can't we have e.g. |
With the naming convention above, we can omit repeating |
Maybe we can even provide |
Also |
The issue here is if you are developer new to .NET, would the name |
Co-authored-by: Alexander Köplinger <[email protected]>
- public static int Compare(System.String? strA, System.String? strB, bool ignoreCase) { throw null; } | ||
- public static int Compare(System.String? strA, System.String? strB, bool ignoreCase, System.Globalization.CultureInfo? culture) { throw null; } | ||
- public static int Compare(System.String? strA, System.String? strB, System.Globalization.CultureInfo? culture, System.Globalization.CompareOptions options) { throw null; } | ||
+ public static int CompareString(System.String? strA, System.String? strB) { throw null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird that string
will be two times in the sentence when calling these APIs (string.CompareString
) I guess there is no better naming/options?... I definitely like this better than including Ordinal
in the name though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downside is that the good names were taken by the method we want to steer people away from. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is the thing that urks me. Our goal is to present the developer with a simpler view of the world, but the new names look fairly misplaced and odd :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PowerShell repo we follow simple rule - "use explicitly culture/ordinal/ignorecase everywhere".
You will find that this code is full of explicit Ordinal and OrdinalIgnorecase names
This makes the code incredibly understandable and maintainable.
Despite this, we've fixed a few case of unexpected use of culture. So the PR with new conception is great!
But the new names looks weird for me too.
As I said, having Ordinal and OrdinalIgnorecase names in PowerShell code makes it better in my opinion.
Notice, we already have CompareOrdinal() (and internal EqualsOrdinal/EqualsOrdinalIgnoreCase). Why not follow this pattern?
CompareToString -> CompareToOrdinal
CompareToStringIgnoreCase -> CompareToOrdinalIgnoreCase
And this is completely consistent with the new concept.
It also makes it easier to learn and understand for people coming from other languages.
This concept (must) be consistent with the fact that the best practice is to explicitly express intent. Here this means using explicit method names or options with Ordinal/OrdinalIgnorecase.
This makes creating fixers easier. We can't just say "replace CompareTo() with CompareToOrdinal()" because the source code may contain a bug. But we can recommend both options (ordinal and culture).
|
||
1. `string` and related APIs should default to ordinal behavior unless the caller makes an explicit request for a different behavior. | ||
|
||
2. Developers should not be exposed to any globalization-related concepts unless they explicitly seek globalization support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this claim specific to APIs on String? For example, how does this relate to formatting/parsing APIs on types like Int32, DateTime, etc., which all factor in the current culture by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I should clarify this. The concept of "ordinal" only applies to APIs which operate on or transform strings / sequences of chars. (Think: ReadOnlySpan<char>
APIs in MemoryExtensions
.)
For APIs which are about changing representation between string and non-string types (e.g., string <-> int), I don't suggest any behavioral change. These APIs necessarily have to perform some kind of linguistic translation, so they may as well fall back to current culture as they do today. That also helps solidify Parse
and ToString
as helpers for "please convert this data from / to a representation which is understandable by the current user."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of "ordinal" only applies to APIs which operate on or transform strings / sequences of chars.
Yes. I was commenting on the second goal/requirement: "Developers should not be exposed to any globalization-related concepts unless they explicitly seek globalization support." There are many globalization-related concepts unrelated to ordinal.
|
||
And we propose to take the following actions on `string`-like APIs. | ||
|
||
1. Mark all linguistic-related APIs as `[EditorBrowsable(EditorBrowsableState.Never)]`, but do not obsolete them. (Exception: do not mark APIs if they're in the _System.Globalization_ namespace.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my previous comment, how does this relate to methods that implement interfaces or override virtuals? e.g. Int32.ToString() factoring in the current current. The point earlier talked about not needing to understand globalization concepts... was that too broad, and this is all really just about linguistics-related globalization concepts?
|
||
While searching through internal and public sources, we discovered significant usage of the members `StringComparison.InvariantCulture` and `StringComparison.InvariantCultureIgnoreCase`. The usage numbers for these APIs is much higher than would be reasonably expected given their scenario of performing linguistic (somewhat "fuzzy") text matching. This may be explained as an interesting byproduct of having drilled [__CA1305__](https://docs.microsoft.com/dotnet/fundamentals/code-analysis/quality-rules/ca1305) (specify `CultureInfo.InvariantCulture` when calling `ToString(IFormatProvider, ...)`) and ["prefer `string.ToUpperInvariant` over `string.ToUpper`"](https://docs.microsoft.com/dotnet/api/system.string.toupper) into developers' muscle memory over the past decades. This can form a positive association that makes developers believe that __"invariant"__ is always an appropriate default to use. | ||
|
||
So when rules like [__CA1307__](https://docs.microsoft.com/dotnet/fundamentals/code-analysis/quality-rules/ca1307) (pass an explicit `StringComparison` value) are enforced, the developer is now forced to choose between ordinal, invariant, or current-culture. It is not reasonable to expect the average developer to understand the difference between ordinal and linguistic behavior. But it _is_ reasonable for an average developer to think to themselves, "I was told _invariant_ was correct in all other situations where I encountered it. It must also be the appropriate default here.". (It would further be reasonable that this decision may be made via muscle memory, with the developer never consciously realizing that they were presented a choice.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the argument for hiding CurrentCulture/CurrentCultureIgnoreCase. If someone explicitly types "CurrentCulture", that's what they get; it's very clear.
|
||
## Future work and ecosystem implications | ||
|
||
This work has implications for GUI stacks, such as WPF and WinForms. GUI scenarios often involve performing linguistic, current-culture collation of data, such as alphabetically sorting entries in a dropdown list. It should be expected that WinForms, WPF, and similar code bases will be disproportionately hit with warnings. They have a few options available to them. In descending order of preference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar code bases will be disproportionately hit with warnings
Not only hit with warnings, but there's a good chance pushing them to use ordinal behavior will actually introduce bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the weird part about all of this. The rule of thumb is: if you're sorting data for display directly in client UI, the default (current culture) behavior is probably appropriate. If you're on a server, a current culture comparison is probably not appropriate. So it largely depends on the context.
However, there is one universal truth: the pattern if (string.Compare(...) == 0)
is almost certainly a bug, regardless of context. There are exceptions, sure, but they're very rare.
Maybe the answer is that we don't hide the original methods, but we instead query what you're doing with the result. If we see a comparison result == 0
or != 0
we flag it. This will catch fewer misusages of the comparison APIs, but experience tells me that the warnings will be more actionable / of higher quality.
@GrabYourPitchforks should this be marked with api-ready-for-review so we can discuss this? |
I expect @stephentoub would want to have a chance to be part of API review before anything's definitely decided, but no doubt helpful to start discussions. |
@terrajobst I suspect if we wanted to have a public discussion, it should be a design-specific discussion with no expectation that we would finalize an API immediately. We'd want Steve and usability experts to be involved prior to finalization. This document also suggests usability testing to help test our theories. |
While I appreciate the goal of having it be easy to do the correct thing and avoid accidental use of lingusitic operations, I have a lot of concerns here about ergonomics, and poor user experience. You already mention the new developer using an old tutotrial and seeing warnings. But what about experienced developers who move to a newer .NET and are confused/frustrated about overloads they know exist (and are use to using, because they are the "correct" way to do things as of right now) don't show up in Intellisense because they have been made EBNever. I do have to wonder if for application code, users would not be better served by have an easy way to set the Default culture of all threads to some special variant of the Invariant culture that does ordinal comparisons/equality etc instead of the unicode default algorithms. This would actually make the existing APIs generally do the right thing, and would not not be seriously regressing things for people creating GUI apps, Sadly such users will always need to decide for each case if they want a linguistic operation, or a ordinal one. This design doc says "It is not reasonable to expect the average developer to understand the difference between ordinal and linguistic behavior", but for GUI app developers kind of do need to understand. For things like parsing file formats, they often want ordinal, but for things related to user input, they often want linguistic. Obviously libraries would still need to avoid accidentally using linguistic methods if they really want ordinal, unless the libraries will only be used by programs that use this special default culture. |
I love the goal of this proposal but the new |
We considered but rejected this proposal. The reason is that it would affect more fundamental APIs like |
I am positive towards increasing the pit of success when it comes to string APIs. No developers expect strings API to behave differently depending on global thread static variable (current culture) until the first long-night debugging session due to a localization bug. I am concerned about the string interpolation feature in C# and F# and how that can be made less dependent on current culture. I realize there's a If possible, I would like to see an optional analyzer that warns for all usages string interpolation without |
Being culture aware in those members are so problematic:
With my personal bias, as a Chinese speaker, casing and sorting are totally nothing in Chinese. The only thing I benefit from being culture aware is displaying DateTime. |
We will soon be planning for .NET 8. Is it an appropriate time to pick this up again? |
Pinging @dotnet/project-system as we are considering modernising the resource editor in Visual Studio. |
Minor suggestion that doesn't change anything fundamental about the discussion: how about |
Draft design document for how we can flatten the learning curve for using
string
, removing globalization concepts for developers who needn't be exposed to them, and widening the pit of success for people calling string-based APIs.This is a draft design document and does not represent a final plan or committed work.
See also: