Leaving ICU behind (proof of concept) #68

lemire · 2023-01-19T02:55:50Z

This is not yet quite correct, but it passes 'most' of our tests (it is currently only failing toascii_encoding). It requires much more work but this PR shows how we could do away with the ICU dependency by implementing just what we need. Essentially, we need these components...

Validating exiting punycode.
Transcribing non-ASCII into punycode.
An up-to-data list of code points that are either invalid (must be rejected) or that must be omitted. And other fancy internationalization issues. For Unicode 15, the table is https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt

RFC 3454 tells us that we need to

Map characters (see section 3),
Normalize (see section 4),
Reject forbidden characters,
Check for right-to-left characters and if so, check all requirements (see section 6),
Optionally reject based on unassigned code points (section 7).

Reference to the standard: https://www.unicode.org/reports/tr46/#ToUnicode

You can verify that it is 'relatively' far along... comment out toascii_encoding() and it works...

index 4dd7dd8..f3d79bc 100644
--- a/tests/wpt_tests.cpp
+++ b/tests/wpt_tests.cpp
@@ -312,7 +312,7 @@ bool urltestdata_encoding() {
 int main() {
   std::cout << "Running WPT tests.\n" << std::endl;
 
-  if (percent_encoding() && setters_tests_encoding() && toascii_encoding() &&
+  if (percent_encoding() && setters_tests_encoding() && //toascii_encoding() &&
       urltestdata_encoding()) {
     std::cout << "WPT tests are ok." << std::endl;
     return EXIT_SUCCESS;

However, I stress that the unicode validation and processing is only a sketch. It needs to be completed. It is not conceptually very hard. I did not look into how people implement these things. But it is mostly a matter of having the right tables. I rolled my own for now.

Whether we want to go down this route is up to us. Having fewer dependencies is clearly a positive. However, rolling our own requires us to do more work and more testing.

anonrig · 2023-01-19T03:17:06Z

src/puny.cpp

+  while (!input.empty()) {
+    size_t loc_dot = input.find('.');
+    size_t loc_full_stop = input.find("\u3002"); // complete as needed
+    size_t loc = std::min(loc_dot, loc_full_stop);


This makes 2 find operations and a std::min. We can reduce it. First finding either characters and depending on that finding the other. Wonder it will be faster...

@anonrig

That's the least of your concerns. My code is only an illustration. The specification says...

The domain labels of a domain domain are the result of strictly splitting domain on U+002E (.).

But it is not sufficient, the tests say...

"Ideographic full stop (full-width period for Chinese, etc.) should be treated as a dot. U+3002 is mapped to U+002E (dot)"

And this refers to the following specification...

https://www.unicode.org/reports/tr46/#ToUnicode

I did not implement it, I only sketched it (to show how it might be done).

The ICU implementation is here...

https://github.com/unicode-org/icu/blob/bb0e745e25c99cc57055caf45c81b95ef63b25d4/icu4c/source/common/uts46.cpp#L1298

The list of characters that are mapped to the dot include...

U+3002 (ideographic full stop)

U+FF0E (fullwidth full stop)

U+FF61 (halfwidth ideographic full stop)

I only included 3002 here.

The specification says...

In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators: U+002E ( . ) FULL STOP U+FF0E ( ． ) FULLWIDTH FULL STOP U+3002 ( 。 ) IDEOGRAPHIC FULL STOP U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP

anonrig · 2023-01-19T03:20:54Z

src/puny.cpp

+      if (!label_string.has_value()) {
+        return std::nullopt;
+      }
+      if (label_string.value().empty()) {


this can be simplified to label_string.value_or.empty(). Removes the previous branch

anonrig · 2023-01-19T03:23:19Z

src/puny.cpp

+  // and labels cannot exceed 64 characters. However, wpt_tests seems to want
+  // to allow more general cases?
+  // Though wpt_tests does not want limits, let us put one anyhow. If someone
+  // has a domain with over 1MB, we refuse to work on it (safety!).


Should we add this to readme? Sounds like a good feature.

ICU will check for violation of these conditions...

https://github.com/unicode-org/icu/blob/bb0e745e25c99cc57055caf45c81b95ef63b25d4/icu4c/source/common/uts46.cpp#L32

ada/src/parser.cpp

Line 79 in 2a6208e

// A domain name label is longer than 63 bytes.

But the tests (WPT) seem to expect that this be ignored... Search for the tests...

"Label longer than 63 code points",

So it wants to allow longer labels. I am not sure I understand.

anonrig · 2023-01-19T03:26:22Z

src/puny.cpp

+  auto loc = input.rfind('-');
+  uint32_t count = 0;
+  if (loc != std::string_view::npos) {
+    for(uint8_t c : input.substr(0, loc)) {


std::any_of would make this easier to read

anonrig · 2023-01-19T03:27:20Z

src/puny.cpp

+  for (auto iterator = input.begin(); iterator != input.end();) {
+    int start_i = i;
+    int w = 1;
+    for (int k = 36;; k += 36) {


We can use simd for this. Nice 👍

The punycode algorithm is not easily vectorizable. It is designed from the ground up for character-by-character processing. I am not saying it could not be accelerated, but it would be a major challenge.

anonrig · 2023-01-19T03:29:43Z

src/parser.cpp

-
-    out.value().resize(length);
+    out = ada::puny::convert_domain_to_puny(input, be_strict);
+    if(!out.has_value()) { return false; }
    if(std::any_of(out.value().begin(), out.value().end(), ada::unicode::is_forbidden_domain_code_point)) {


Can we move this inside the punycode?

Yes, this line is probably useless.

anonrig · 2023-01-19T03:30:44Z

src/puny.cpp

+  return false;
+}
+
+bool is_ignorable(uint32_t code_point) noexcept {


Can you add a documentation for this?

It depends on the unicode version you want to implement. For Unicode 15, the table is there:

https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt

So you have mapped values (you need to transform it from one code point to another), valid values (you just keep it) and forbidden values (you must produce an error).

lemire · 2023-01-19T18:41:41Z

I am going to close this for now.

lemire added 2 commits January 19, 2023 02:44

Leaving ICU behind.

51d52bd

Adding empty line.

1229621

anonrig reviewed Jan 19, 2023

View reviewed changes

lemire closed this Jan 19, 2023

lemire mentioned this pull request Jan 27, 2023

Write our to_ascii function (drop ICU dependency) #89

Closed

anonrig deleted the dlemire/leaving_icu branch May 25, 2023 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaving ICU behind (proof of concept) #68

Leaving ICU behind (proof of concept) #68

lemire commented Jan 19, 2023 •

edited

Loading

anonrig Jan 19, 2023

lemire Jan 19, 2023 •

edited

Loading

anonrig Jan 19, 2023

anonrig Jan 19, 2023

lemire Jan 19, 2023

anonrig Jan 19, 2023

anonrig Jan 19, 2023

lemire Jan 19, 2023

anonrig Jan 19, 2023

lemire Jan 19, 2023

anonrig Jan 19, 2023

lemire Jan 19, 2023

lemire commented Jan 19, 2023

Leaving ICU behind (proof of concept) #68

Leaving ICU behind (proof of concept) #68

Conversation

lemire commented Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

lemire Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Jan 19, 2023

lemire commented Jan 19, 2023 •

edited

Loading

lemire Jan 19, 2023 •

edited

Loading