Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add short regex pattern compatible with ES2018 that matches whatever emoji are supported natively #3

Closed
slevithan opened this issue Jun 21, 2024 · 3 comments

Comments

@slevithan
Copy link

slevithan commented Jun 21, 2024

I'm trying to write a version of \p{RGI_Emoji} (for use in fabian-hiller/valibot#666 and elsewhere) that is compatible with ES2018 and does not rely on a giant listing of code points. I'm okay with the list of emoji being tied to whatever version of Unicode that the JS environment supports natively. I'm finding the Unicode spec not very easy to follow for this purpose.

Here's what I have so far:

/(?:(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F\u20E3?)(?:\u200D(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F))*|[\u{1F1E6}-\u{1F1FF}]{2}|\u{1F3F4}[\u{E0061}-\u{E007A}]{5}\u{E007F})/u

This matches every emoji from the full RGI_Emoji list in this repo (here). It also correctly excludes things that are matched by \p{Emoji} like digits, *, and symbols (e.g. 👁, , 🏳, ) that are only emoji when followed by U+FE0F (VS16). And it correctly excludes things like bare U+200D (ZWJ) matched by \p{Emoji_Component}.

The one issue seems to be that it matches some \p{Emoji_Modifier_Base} code points that are not technically emoji without a following VS16, which \p{RGI_Emoji} therefore does not match on their own. The one example I've found is U+1F575 (🕵). I don't know if there are other cases like this, but I suspect there are. The complicating factor is that there are other emoji like 👂 (U+1F442), 🤘 (U+1F918), and 💃 (U+1F483) that are matched by \p{Emoji_Modifier_Base} and do not use/require a following VS16.

So, two questions:

  1. Is there a general way to fix this pattern to exclude symbols like U+1F575 that are matched by \p{Emoji_Modifier_Base} but are not matched by \p{RGI_Emoji}?
  2. If so, would it make sense to include such an ES2018-compatible regex in this library?
@mathiasbynens
Copy link
Owner

This is a fascinating puzzle :)

Is there a general way to fix this pattern to exclude symbols like U+1F575 that are matched by \p{Emoji_Modifier_Base} but are not matched by \p{RGI_Emoji}?

I can’t think of one. cc @markusicu

I’ll note that for production apps, I’ve learned the hard way that RGI_Emoji does not fully meet user expectations w.r.t. what is / isn’t an emoji, and so I’ve stopped using this package in favor of emoji-test-regex-pattern based on Unicode’s emoji-test.txt. Here’s its current list of special cases that cannot easily be expressed generally: https://github.com/mathiasbynens/emoji-test-regex-pattern/blob/b702b8672f70010966305501791a5cb1f7ba07a5/script/get-sequences.js#L1-L53 The same might apply to your use case.

@slevithan
Copy link
Author

slevithan commented Jun 27, 2024

That's helpful! Okay, so I guess I'm actually more interested in matching the same emoji list as emoji-test-regex-pattern, but in a general way. And cool, my pattern already matches all of the special cases in emoji-test-regex-pattern, because they all follow standard emoji sequence patterns. And while the special cases in that list allow e.g. '\u{1F93C}\u{1F3FB}' (wrestlers: light skin), my pattern additionally allows things like '\u{1F93C}\u{1F3FB}\u200D\u2640\uFE0F' (women wrestling: light skin), which indeed has emoji designs on many platforms.

As for the special case I highlighted of U+1F575 (🕵) without a following U+FE0F (VS16) even though the VS16 is required by \p{RGI_Emoji}, I tested the 134 code points in \p{Emoji_Modifier_Base} as of Unicode 15.1.0, and it turns out this is one of just nine code points like it:

'\u261d \u26f9 \u270c \u270d \u{1f3cb} \u{1f3cc} \u{1f574} \u{1f575} \u{1f590}'
// '☝ ⛹ ✌ ✍ 🏋 🏌 🕴 🕵 🖐'

These are all common-sense emoji despite the lack of a following VS16, and all render as colorful images as compared to monochrome text variants even without VS16 (at least on Windows 11 where I'm currently viewing them). So I think it makes sense for me to include them as "underqualified emoji" exceptions, similar to emoji-test-regex-pattern's list of "overqualified emoji" exceptions (where VS16 is added when it's not expected/needed). Aside: All of these are already matched by emoji-regex.

One adjustment I'll make, though: To support all emoji tag sequences (including the Texas flag supported by WhatsApp and OpenMoji), I will change \u{1F3F4}[\u{E0061}-\u{E007A}]{5}\u{E007F} to \u{1F3F4}[\u{E0061}-\u{E007A}]{2}[\u{E0030}-\u{E0039}\u{E0061}-\u{E007A}]{1,3}\u{E007F}.

@slevithan
Copy link
Author

I've published a version of the regex here (with some updates/fixes) as emoji-regex-xs, in case it also represents the right tradeoffs for anyone else's use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants