Add support for Unicode codepoints #217

smarr · 2018-01-24T17:28:44Z

These changes introduce support for converting characters, i.e., elements in a string into their Integer representation in terms of Unicode codepoints. It also introduces support for creating characters from Integers representing codepoints.

Overview:

convert an array of characters to a String
convert an integer into a character
convert a one-element String into a 16-bit code point
signal ArgumentError, i.e., violations expectations on method argument
fix parsing issues with unicode character literals (couldn't parse multi-part characters)
document design of Exceptions vs Errors

@daumayr, you didn't say anything with respect to the ArgumentError naming question.
Which system should we use as general style guide, Java, Dart, something else?
ArgumentError is not in Newspeak.

TODOs

figure out whether it is sufficient to return char when converting length-1 Strings to number. How to deal with characters that are 2x 16-bit, i.e., how to support the full unicode code point range?
- The solution is to use full Integer for this, and don't mess around with it on the SOMns level
add tests for all methods, and exercise some unicode characters, emojis, etc.
add helper for signaling exception with receiver class
document style and naming preference in /docs/
add comment on Errors vs. Exceptions to Kernel
fix issue about character literals and emojis not parsing correctly.

smarr · 2018-01-24T19:32:18Z

@daumayr hmmm, is there a good reason why we have a Character class? It currently doesn't do anything beside providing factory methods. Might those belong into String?

smarr · 2018-01-24T22:17:09Z

@daumayr can we debate the current design, and restrictions?

I think, it would be less restrictive if String>>#from: would be String>>#fromStringArray: and then accept strings of any length in the array.

smarr · 2018-01-24T22:33:28Z

I would argue we want to have something similar to this:

class  String = ()() : (
  public fromStringArray: arr = (
    ^ arr inject: '' into: [:a :b | a + b ]
  )
)

Possibly with a check that b is really a string though, because we don't want implicit conversions.

daumayr · 2018-01-25T08:39:39Z

@smarr ArgumentError is from Dart, I don't have a strong preference. Generally Errors should not be caught by programs. In Java there is IllegalArgumentException, so whether we wan't users to catch it plays a role in the final decision.

String>>#fromStringArray: sounds reasonable

The methods from Character could be moved to String.
Maybe String>>#charValueAt: and a static #fromCharValue: ?

smarr · 2018-01-25T10:59:54Z

Ok good, I should document the Dart preference as part of a style guide in the docs, I think.

And, I probably need to document that character literals don't support long unicode code points (like for emojis).

smarr · 2018-02-02T11:25:12Z

@daumayr could you have a look at this again and see whether it satisfies all your requirements?

@Richard-Roberts might be interesting to you. We now got more robust unicode support.

daumayr · 2018-02-07T09:09:15Z

looks good

- added KernelObj.signalExceptionWithClass helper method - avoids code duplication and is explicit about what is done Signed-off-by: Stefan Marr <[email protected]>

@fallback

Since our integers can encode all unicode characters, it would be sad, if we don’t actually use it. This implementation extracts some bits of java.lang.Character to enable Truffle-base specialization. - added tests - turn Character class>>from: to String class>>fromCodepoint: - also fix implementation, make sure specializastions are not overlapping - added tests for argument error - make sure we have specialization for Symbol>>#charAt: - otherwise, it falls back to the substring-based implementation in String when it failed with SSymbol objects - fixed literal parsing, became relevant for unicode character literals - use @fallback instead of @specialization without guard - fallback implies negation, specialization without guard is just the most generic case Signed-off-by: Stefan Marr <[email protected]>

Signed-off-by: Stefan Marr <[email protected]>

This includes general suggestions on errors vs. exceptions. Signed-off-by: Stefan Marr <[email protected]>

smarr added enhancement Improves the implementation with something noteworthy language-design Not everything is in the spec, sometimes, we need to decide what's best. labels Jan 24, 2018

smarr added this to the v0.6.0 - Black Diamonds milestone Jan 24, 2018

smarr self-assigned this Jan 24, 2018

smarr changed the title ~~Add support for Characters~~ Add support for Unicode codepoints Feb 1, 2018

smarr force-pushed the da-characters branch from 494b2e4 to 10921be Compare February 2, 2018 11:18

daumayr and others added 4 commits February 10, 2018 16:55

Add Character/String Conversion

d874c77

- added KernelObj.signalExceptionWithClass helper method - avoids code duplication and is explicit about what is done Signed-off-by: Stefan Marr <[email protected]>

Set JVMCI_VERSION_CHECK=ignore in launcher

072675d

Signed-off-by: Stefan Marr <[email protected]>

Added documentation on naming conventions

e4cd701

This includes general suggestions on errors vs. exceptions. Signed-off-by: Stefan Marr <[email protected]>

smarr force-pushed the da-characters branch from 527050f to e4cd701 Compare February 10, 2018 16:55

smarr merged commit 35f8b68 into smarr:dev Feb 11, 2018

smarr deleted the da-characters branch February 11, 2018 00:41

smarr mentioned this pull request Feb 14, 2018

Redesign Streams and Character Encoding #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Unicode codepoints #217

Add support for Unicode codepoints #217

smarr commented Jan 24, 2018 •

edited

Loading

smarr commented Jan 24, 2018

smarr commented Jan 24, 2018

smarr commented Jan 24, 2018 •

edited

Loading

daumayr commented Jan 25, 2018

smarr commented Jan 25, 2018

smarr commented Feb 2, 2018

daumayr commented Feb 7, 2018

Add support for Unicode codepoints #217

Add support for Unicode codepoints #217

Conversation

smarr commented Jan 24, 2018 • edited Loading

smarr commented Jan 24, 2018

smarr commented Jan 24, 2018

smarr commented Jan 24, 2018 • edited Loading

daumayr commented Jan 25, 2018

smarr commented Jan 25, 2018

smarr commented Feb 2, 2018

daumayr commented Feb 7, 2018

smarr commented Jan 24, 2018 •

edited

Loading

smarr commented Jan 24, 2018 •

edited

Loading