Topics

[espeak-ng:master] reported: Fixing sequences of ? and ! #github


espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #747 Fixing sequences of ? and !
By BenTalagan:

I do agree with you, all the more that the tokenization source code is really hard to read currently, and it often mixes languages. A clean and customizable tokenization algorithm would be great as it would also allow to provide a structured output from espeak-ng. This PR is really an emergency hack to fix an annoying corner case.


espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #747 Fixing sequences of ? and !
By rhdunn:

I think it makes sense to patch the current logic to address current bugs. My main comment for this fix would be that it would be better to check the ucd_properties value (to handle other Unicode question and exclamation marks). That is read/used by punct_data below this check, so it would be better to make use of that (e.g. punct_data & (CLAUSE_QUESTION | CLAUSE_EXCLAMATION)) on the if statement, then in the while loop use clause_type_from_codepoint(c2) instead of punct_data when checking the clause flags.

The more general changes will still need C code to process the logic (like with number processing, tokenizing SSML/HTML, etc.). It makes sense to have the configuration (like the current tr_languages.c code) to be in language configuration files. And the espeak code definitely needs improving w.r.t. Unicode processing. What I'm not sure about is what the balance between configuration and general processing logic will be.


espeak-ng@groups.io Integration <espeak-ng@...>
 

[espeak-ng:master] New Comment on Pull Request #747 Fixing sequences of ? and !
By BenTalagan:

Ok here's a second attempt based on your remarks. Tests are passing. If the PR is ok for you, don't forget to squash-merge all commits into one to get rid of the first attempts :-)