Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Unicode codepoints #217

Merged
merged 4 commits into from
Feb 11, 2018
Merged

Add support for Unicode codepoints #217

merged 4 commits into from
Feb 11, 2018

Conversation

smarr
Copy link
Owner

@smarr smarr commented Jan 24, 2018

These changes introduce support for converting characters, i.e., elements in a string into their Integer representation in terms of Unicode codepoints. It also introduces support for creating characters from Integers representing codepoints.

Overview:

  • convert an array of characters to a String
  • convert an integer into a character
  • convert a one-element String into a 16-bit code point
  • signal ArgumentError, i.e., violations expectations on method argument
  • fix parsing issues with unicode character literals (couldn't parse multi-part characters)
  • document design of Exceptions vs Errors

@daumayr, you didn't say anything with respect to the ArgumentError naming question.
Which system should we use as general style guide, Java, Dart, something else?
ArgumentError is not in Newspeak.

TODOs

  • figure out whether it is sufficient to return char when converting length-1 Strings to number. How to deal with characters that are 2x 16-bit, i.e., how to support the full unicode code point range?
    • The solution is to use full Integer for this, and don't mess around with it on the SOMns level
  • add tests for all methods, and exercise some unicode characters, emojis, etc.
  • add helper for signaling exception with receiver class
  • document style and naming preference in /docs/
  • add comment on Errors vs. Exceptions to Kernel
  • fix issue about character literals and emojis not parsing correctly.

@smarr smarr added enhancement Improves the implementation with something noteworthy language-design Not everything is in the spec, sometimes, we need to decide what's best. labels Jan 24, 2018
@smarr smarr added this to the v0.6.0 - Black Diamonds milestone Jan 24, 2018
@smarr smarr self-assigned this Jan 24, 2018
@smarr
Copy link
Owner Author

smarr commented Jan 24, 2018

@daumayr hmmm, is there a good reason why we have a Character class? It currently doesn't do anything beside providing factory methods. Might those belong into String?

@smarr
Copy link
Owner Author

smarr commented Jan 24, 2018

@daumayr can we debate the current design, and restrictions?

I think, it would be less restrictive if String>>#from: would be String>>#fromStringArray: and then accept strings of any length in the array.

@smarr
Copy link
Owner Author

smarr commented Jan 24, 2018

I would argue we want to have something similar to this:

class  String = ()() : (
  public fromStringArray: arr = (
    ^ arr inject: '' into: [:a :b | a + b ]
  )
)

Possibly with a check that b is really a string though, because we don't want implicit conversions.

@daumayr
Copy link
Contributor

daumayr commented Jan 25, 2018

@smarr ArgumentError is from Dart, I don't have a strong preference. Generally Errors should not be caught by programs. In Java there is IllegalArgumentException, so whether we wan't users to catch it plays a role in the final decision.

String>>#fromStringArray: sounds reasonable

The methods from Character could be moved to String.
Maybe String>>#charValueAt: and a static #fromCharValue: ?

@smarr
Copy link
Owner Author

smarr commented Jan 25, 2018

Ok good, I should document the Dart preference as part of a style guide in the docs, I think.

And, I probably need to document that character literals don't support long unicode code points (like for emojis).

@smarr smarr changed the title Add support for Characters Add support for Unicode codepoints Feb 1, 2018
@smarr
Copy link
Owner Author

smarr commented Feb 2, 2018

@daumayr could you have a look at this again and see whether it satisfies all your requirements?

@Richard-Roberts might be interesting to you. We now got more robust unicode support.

@daumayr
Copy link
Contributor

daumayr commented Feb 7, 2018

looks good

daumayr and others added 4 commits February 10, 2018 16:55
- added KernelObj.signalExceptionWithClass helper method
  - avoids code duplication and is explicit about what is done

Signed-off-by: Stefan Marr <[email protected]>
Since our integers can encode all unicode characters,
it would be sad, if we don’t actually use it.

This implementation extracts some bits of java.lang.Character
to enable Truffle-base specialization.

- added tests
- turn Character class>>from: to String class>>fromCodepoint:
  - also fix implementation, make sure specializastions are not overlapping
  - added tests for argument error

- make sure we have specialization for Symbol>>#charAt:
  - otherwise, it falls back to the substring-based
    implementation in String when it failed with SSymbol objects

- fixed literal parsing, became relevant for unicode character literals

- use @fallback instead of @specialization without guard
  - fallback implies negation, specialization without guard is just the
    most generic case

Signed-off-by: Stefan Marr <[email protected]>
This includes general suggestions on errors vs. exceptions.

Signed-off-by: Stefan Marr <[email protected]>
@smarr smarr merged commit 35f8b68 into smarr:dev Feb 11, 2018
@smarr smarr deleted the da-characters branch February 11, 2018 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the implementation with something noteworthy language-design Not everything is in the spec, sometimes, we need to decide what's best.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants