-
-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaving ICU behind (proof of concept) #68
Conversation
while (!input.empty()) { | ||
size_t loc_dot = input.find('.'); | ||
size_t loc_full_stop = input.find("\u3002"); // complete as needed | ||
size_t loc = std::min(loc_dot, loc_full_stop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes 2 find operations and a std::min. We can reduce it. First finding either characters and depending on that finding the other. Wonder it will be faster...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the least of your concerns. My code is only an illustration. The specification says...
The domain labels of a domain domain are the result of strictly splitting domain on U+002E (.).
But it is not sufficient, the tests say...
"Ideographic full stop (full-width period for Chinese, etc.) should be treated as a dot. U+3002 is mapped to U+002E (dot)"
And this refers to the following specification...
https://www.unicode.org/reports/tr46/#ToUnicode
I did not implement it, I only sketched it (to show how it might be done).
The ICU implementation is here...
The list of characters that are mapped to the dot include...
- U+3002 (ideographic full stop)
- U+FF0E (fullwidth full stop)
- U+FF61 (halfwidth ideographic full stop)
I only included 3002 here.
The specification says...
In this document, a label is a substring of a domain name. That substring is bounded on both sides by either the start or the end of the string, or any of the following characters, called label-separators:
U+002E ( . ) FULL STOP
U+FF0E ( . ) FULLWIDTH FULL STOP
U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP
if (!label_string.has_value()) { | ||
return std::nullopt; | ||
} | ||
if (label_string.value().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be simplified to label_string.value_or.empty(). Removes the previous branch
// and labels cannot exceed 64 characters. However, wpt_tests seems to want | ||
// to allow more general cases? | ||
// Though wpt_tests does not want limits, let us put one anyhow. If someone | ||
// has a domain with over 1MB, we refuse to work on it (safety!). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this to readme? Sounds like a good feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ICU will check for violation of these conditions...
Line 79 in 2a6208e
// A domain name label is longer than 63 bytes. |
But the tests (WPT) seem to expect that this be ignored... Search for the tests...
"Label longer than 63 code points",
So it wants to allow longer labels. I am not sure I understand.
auto loc = input.rfind('-'); | ||
uint32_t count = 0; | ||
if (loc != std::string_view::npos) { | ||
for(uint8_t c : input.substr(0, loc)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::any_of would make this easier to read
for (auto iterator = input.begin(); iterator != input.end();) { | ||
int start_i = i; | ||
int w = 1; | ||
for (int k = 36;; k += 36) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use simd for this. Nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The punycode algorithm is not easily vectorizable. It is designed from the ground up for character-by-character processing. I am not saying it could not be accelerated, but it would be a major challenge.
|
||
out.value().resize(length); | ||
out = ada::puny::convert_domain_to_puny(input, be_strict); | ||
if(!out.has_value()) { return false; } | ||
if(std::any_of(out.value().begin(), out.value().end(), ada::unicode::is_forbidden_domain_code_point)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this inside the punycode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this line is probably useless.
return false; | ||
} | ||
|
||
bool is_ignorable(uint32_t code_point) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a documentation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the unicode version you want to implement. For Unicode 15, the table is there:
https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt
So you have mapped values (you need to transform it from one code point to another), valid values (you just keep it) and forbidden values (you must produce an error).
I am going to close this for now. |
This is not yet quite correct, but it passes 'most' of our tests (it is currently only failing
toascii_encoding
). It requires much more work but this PR shows how we could do away with the ICU dependency by implementing just what we need. Essentially, we need these components...RFC 3454 tells us that we need to
Reference to the standard: https://www.unicode.org/reports/tr46/#ToUnicode
You can verify that it is 'relatively' far along... comment out
toascii_encoding()
and it works...However, I stress that the unicode validation and processing is only a sketch. It needs to be completed. It is not conceptually very hard. I did not look into how people implement these things. But it is mostly a matter of having the right tables. I rolled my own for now.
Whether we want to go down this route is up to us. Having fewer dependencies is clearly a positive. However, rolling our own requires us to do more work and more testing.