Topics

[espeak-ng:master] reported: Added initial disability related emojis from Unicode 12.0. #github

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By valdisvi:

List of all emoji descriptions can be found here: https://github.com/unicode-org/cldr/blob/master/common/annotations/

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By valdisvi:

List of all emoji descriptions can be found here: https://github.com/unicode-org/cldr/blob/master/common/annotations/ (should filter for type="tts" entries).

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By valdisvi:

I found there already is emoji script to generate lists. I'll try to update all lists.

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By rhdunn:

I generate the initial list in the English files from e.g. https://www.unicode.org/Public/emoji/12.0/emoji-test.txt, then use that script with a stable version of CLDR downloaded locally. I haven't found an easy way to maintain this other than comparing it with the previous version (e.g. via the diff command).

NOTE: I would prefer the different versions of the emoji file (currently 12.0 and 12.1) to be added in separate commits to make it easier to track updates. Likewise for the CLDR updates (although the various translations should all be added together). New emoji files should be added separately using the last version of the CLDR used.

The emoji script is called in https://github.com/espeak-ng/espeak-ng/blob/master/Makefile.am#L436 for creating the other _emoji files.

You should be able to do something like:

mv dictsource/en_emoji{,.bak}
rm dictsource/*_emoji
mv dictsource/en_emoji{.bak,}
CLDR_PATH=/opt/cldr/35.1 make

That updates all the translations, including any new ones added in the en_emoji file.

It will also use the data/annotationsEspeak file to pick up additional symbol translations. My intention is ultimately to have all the various symbol names defined there for the supported languages. It could also be used for emoji files for languages not supported in the CLDR (although I would prefer they get added there).

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By valdisvi:

That seems quite complicated (probably for historical reasons) and mixes compilation of dictionaries with creation of source (i.e. xx_emoji) files. Why not to generate xx_emoji files from xml files of cloned cldr project directly with some script in tools? E.g. for all languages in dictsource folder concatenate type="tts" entries from ../annotations/xx.xml and .../annotationsDerived/xx.xml files into xx_emoji file. And then compilation just compiles xx_emoji files into dictionary files as all other (e.g. xx_list, xx_rules, xx_extra) ones.

espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #638 Added initial disability related emojis from Unicode 12.0.
By rhdunn:

The generation of the xx_emoji files is what the make file is effectively doing -- they are there for maintainers when generating the files. If you can make that process easier, go ahead.

IIRC, not all CLDR files represent languages supported in espeak/espeak-ng, and they can have different names in the CLDR to the ones in espeak. Thus, if a new CLDR file appears with tts entries, it is not necessarily easy to automate adding it.

As for basing them on the en_emoji file, that is so they are in the correct order and grouped. That could be handled by the script, as the ordering is in unicode character order and the grouping is by the unicode Block property. The additional groups (country flags, keycaps, family, gendered sequences, etc.) are harder to group in an automated way, but could be doable.

I would start by generating the en_emoji file from the CLDR 33.1 data, as that is the version espeak currently uses. You should ideally not see any differences. Likewise when extending it to the other languages. You should then be able to update it.

Another thing to note is that the CLDR has various locales for the emoji names (e.g. en_AU, fr_CA, and de_CH). Ideally, the ones that are different from the base locale should be included as variants in the xx_emoji file. Currently, there are no variant numbers allocated for some of the locales, so would need adding as appropriate.

What I would like to be able to do is have lines like "test tEst $lang=en-CA" (I'm not sure on the syntax for that) and use that in the emoji files so you could then have country-dependent variants of the emoji names (and other things like letter pronunciations) more easily defined. I have various issues open relating to supporting this.