Skip to content
New issue

Have a question about this project?Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of serviceand privacy statement.We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use utf8proc to detect emojis #29645

Closed
wants to merge 1 commit into from

Conversation

dundargoc
Copy link
Member

@dundargoc dundargoc commented Jul 10, 2024

More accurately, it will check if the codepoint has the
"Extended_Pictographic" property, which according to
https:// unicode.org/reports/tr51/#Emoji_Properties_and_Data_Filesis
described as:

"The Extended_Pictographic characters contain all the Emoji characters
except for some Emoji_Component characters. "This should in most cases
align with what people refer to when they say "emoji".

@dundargoc
Copy link
Member Author

I think this new solution should be superior but it wouldn't hurt if some emoji experts/connoisseurs tested this PR.

@github-actions github-actions bot added build building and installing Neovim using the provided scripts breaking-change labels Jul 10, 2024
@clason
Copy link
Member

clason commented Jul 10, 2024

How is this change breaking precisely?

@dundargoc
Copy link
Member Author

How is this change breaking precisely?

Previously, neovim counted a codepoint as an emoji if it were part of the group "Emoji", "Emoji_Presentation", "Emoji_Modifier", "Emoji_Modifier_Base" and "Emoji_Component" in unicode.org/reports/tr51. In this PR, neovim counts an emoji if it is part of the "Extended_Pictographic".

I am not 100% sure what this entails in practice. Here's alist of changed codepoints.Note that this list is not exhaustive as all emojis inemoji-data.txtisn't explicitly listed.

@clason
Copy link
Member

clason commented Jul 10, 2024

Could also count as "fix" (not breaking) if we declare utf8proc as the source of truth for this (which arguably we should)?

More accurately, it will check if the codepoint has the
"Extended_Pictographic" property, which according to
https:// unicode.org/reports/tr51/#Emoji_Properties_and_Data_Filesis
described as:

"The Extended_Pictographic characters contain all the Emoji characters
except for some Emoji_Component characters." This should in most cases
align with what people refer to when they say "emoji".
@dundargoc dundargoc changed the title build!: use utf8proc to detect emojis fix: use utf8proc to detect emojis Jul 10, 2024
@zeertzjq
Copy link
Member

It's strange to replaceemoji_allwithout replacingemoji_wide.

@dundargoc
Copy link
Member Author

So it turns out it's not possible to determine from codepoint alone if a character is an emoji, as there are codepoints that can be both text and emoji depending on variant selector(?). Closing for the time being until we have a better solution in mind.

@dundargoc dundargoc closed this Aug 10, 2024
@dundargoc dundargoc deleted the build/utf8proc/emoji branch August 10, 2024 12:13
@justinmk justinmk added the unicode 💩 (multibyte) unicode characters label Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build building and installing Neovim using the provided scripts unicode 💩 (multibyte) unicode characters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants