Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

Closed
wants to merge 12 commits into from

Conversation

CapitainFlam
Copy link

better french input for better french output ?

Hi !
⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️

This commit comes from different discussion, that I propose you to have a look to understand the context.

and it is somehow connected to :

TL;DR:
I'm trying to improve input quality for french sentence collector, to avoid
A/ garbage IN - garbage OUT, and
B/ avoid to remove in the end (in the CorporaCreator) the garbage that went through collecting, recording, review... to finally being dropped and not being included in the final release for training batchs.
Better to remove it as soon as sentence collector.

What have you done ? 😱

To do so, I duplicated a EN to FR file in sentence collector, and modified it according to a previous job made by Nicolas Panel for CorporaCreator (and sadly not commited).

According to discussion (links here... Did you follow it from the list above ?!), it's recommanded that I create a WIP pull request, to allow everyone to comment, throw tomatos and/or additionnal commits to it.

footnote : it can be hard (...it IS hard !!!) to understand REGEX (REGular EXpressions). Do not hesitate to catch up with https://regex101.com/ to understand and test it.

CapitainFlam and others added 2 commits September 6, 2022 22:54
...some of my first commits, and my first steps in JS.
reviews are welcome.
(...please neither shoot or shout at me ;-D )

Co-Authored-By: Nicolas Panel <[email protected]>
@CapitainFlam CapitainFlam changed the title [WIP] [WIP] additionnal lib/cleanup for French language to improve quality of inputs Sep 13, 2022
roman + century numerals,
roman numerals,
full convertion of all ACRONYMES
eleventh, ...
first, second,...
1/2 => 1 sur 2
Comment on lines 75 to 96
.replace(/(^|\s)Ie(r)? s.(\s|\.|,|\?|!|$)/g, ' premier siècle ')
.replace(/(^|\s)II(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' deuxième siècle ')
.replace(/(^|\s)III(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' troisième siècle ')
.replace(/(^|\s)IV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatrième siècle ')
.replace(/(^|\s)V(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' cinquième siècle ')
.replace(/(^|\s)VI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' sixième siècle ')
.replace(/(^|\s)VII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' septième siècle ')
.replace(/(^|\s)VIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' huitième siècle ')
.replace(/(^|\s)(VIIII|IX)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' neuvième siècle ')
.replace(/(^|\s)X(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dixième siècle ')
.replace(/(^|\s)XI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' onzième siècle ')
.replace(/(^|\s)XII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' douxième siècle ')
.replace(/(^|\s)XIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' treizième siècle ')
.replace(/(^|\s)(XIIII|XIV)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatorzième siècle ')
.replace(/(^|\s)XV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quinzième siècle ')
.replace(/(^|\s)XVI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' seixième siècle ')
.replace(/(^|\s)XVII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-septième siècle ')
.replace(/(^|\s)XVIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-huitième siècle ')
.replace(/(^|\s)(XIX|XVIIII)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix_neuvième siècle ')
.replace(/(^|\s)XX(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingtième siècle ')
.replace(/(^|\s)XXI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-et-unième siècle ')
.replace(/(^|\s)XXII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-deuxième siècle ')
Copy link

@drzraf drzraf Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a dedicated function like
https://stackoverflow.com/a/9083076 in combination with passing a callback to replace() (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace)

  • You could drop the test for the s. suffix.
  • It would be reusable for the non-ordinal case you handled below

Copy link
Author

@CapitainFlam CapitainFlam Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well,

First,

I have my regex to work with :
-- kudos to gregory-r in this post in stackoverflow, licence CC BY-SA

function deromanize(roman) {
  var r = 0;
  // regular expressions to check if valid Roman Number.
  if (!/^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/.test(roman))
    throw new Error('Invalid Roman Numeral.');

  roman.replace(/[MDLV]|C[MD]?|X[CL]?|I[XV]?/g, function(i) {
    r += {M:1000, CM:900, D:500, CD:400, C:100, XC:90, L:50, XL:40, X:10, IX:9, V:5, IV:4, I:1}[i]; 
  });

  return r;
}

So, I'll put it here to not forget it, because :

Second,

I don't have a clue of how to implement a function in JS ! (I know Pascal and VB ✌️ , but never learned JS 😭... I have to jump in the rabbit hole 🕳️ 🐇 (it is this way) )

Edit: My feeling about that ? "Why JS? because!" xkcd explined it well.

...To however follow, either the white rabbit, or crumbs, including future me, well, this link
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement
could be a good start to look at.

/me : jumps in the 🕳️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before you jump into that rabbit hole, you might want to check out my other comment regarding the roman numerals, as I don't think we need this at all. I'll also have a look at your other issue and add the documentation to the README for better understanding

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I don't understand that, collecting sentences, we can NOT "correct and clean" before "validation".

no problem, I'll wait for the issue #636 😸

Comment on lines 122 to 135
//first, second, etc.
.replace(/(^|\s)1er?s?(\s|\.|,|\?|!|$)/g, ' premier ')
.replace(/(^|\s)1(e|è)res?(\s|\.|,|\?|!|$)/g, ' premier ')
.replace(/(^|\s)2(e|è)?me?s?(\s|\.|,|\?|!|$)/g, ' deuxième ')
.replace(/(^|\s)2n?ds?(\s|\.|,|\?|!|$)/g, ' second ')
.replace(/(^|\s)2n?des?(\s|\.|,|\?|!|$)/g, ' seconde ')
.replace(/(^|\s)3i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' troisième ')
.replace(/(^|\s)4i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' quatrième ')
.replace(/(^|\s)5i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' cinquième ')
.replace(/(^|\s)6i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' sixième ')
.replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ')
.replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ')
.replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ')
.replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth using a library like this one or this one (although these are not actually handle ordinal but be useful for plain numbers)

That may help other languages too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know how to import external library, I think it's a bit out of my league right now.

Comment on lines 140 to 167
.replace(/(^|\s)ANPE(\s|\.|,|\?|!|$)/g, ' Agence Nationale Pour l\'Emploi ')
.replace(/(^|\s)APL(\s|\.|,|\?|!|$)/g, ' Aide personnalisée au logement ')
.replace(/(^|\s)CDI(\s|\.|,|\?|!|$)/g, ' Contrat à Durée Indéterminée ')
.replace(/(^|\s)CICE(\s|\.|,|\?|!|$)/g, ' Crédit d\'impôt pour la compétitivité et l\'emploi ')
.replace(/(^|\s)DRH(\s|\.|,|\?|!|$)/g, ' Direction des Ressources Humaines ')
.replace(/(^|\s)EDF(\s|\.|,|\?|!|$)/g, ' Electricité de France ')
.replace(/(^|\s)FN(\s|\.|,|\?|!|$)/g, ' Front National ')
.replace(/(^|\s)HLM(\s|\.|,|\?|!|$)/g, ' Habitation à Loyer Modéré ')
.replace(/(^|\s)IGN(\s|\.|,|\?|!|$)/g, ' Institut Géographique National ')
.replace(/(^|\s)INPI(\s|\.|,|\?|!|$)/g, ' Institut National de la Propriété Intellectuelle ')
.replace(/(^|\s)ISF(\s|\.|,|\?|!|$)/g, ' Impôt sur la fortune ')
.replace(/(^|\s)IUT(\s|\.|,|\?|!|$)/g, ' Institut Universitaire de Technologie ')
.replace(/(^|\s)LREM(\s|\.|,|\?|!|$)/g, ' La Réplublique En Marche ')
.replace(/(^|\s)NUPES(\s|\.|,|\?|!|$)/g, ' Nupès ')
.replace(/(^|\s)PHP(\s|\.|,|\?|!|$)/g, ' Protocole Hypertexte Protocolaire ')
.replace(/(^|\s)PMA(\s|\.|,|\?|!|$)/g, ' Procréation médicalement assistée ')
.replace(/(^|\s)PME(\s|\.|,|\?|!|$)/g, ' Petite et Moyenne Entreprise ')
.replace(/(^|\s)RN(\s|\.|,|\?|!|$)/g, ' Rassemblement National ')
.replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ')
.replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ')
.replace(/(^|\s)RSI(\s|\.|,|\?|!|$)/g, ' Régime Social des Indépendants ')
.replace(/(^|\s)RTE(\s|\.|,|\?|!|$)/g, ' Réseau de Transport d\'Électricité ')
.replace(/(^|\s)SNCF(\s|\.|,|\?|!|$)/g, ' Société Nationale des Chemins de Fer ')
.replace(/(^|\s)TGV(\s|\.|,|\?|!|$)/g, ' Train à Grande Vitesse ')
.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the last character was a ? or a ! I don't think it should be removed. It serves for the intonation purpose and isn't part of the acronym.

Copy link
Author

@CapitainFlam CapitainFlam Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Est ce que le commit aef4284 répond au problème ?

EDIT : sorry, replied in FR. Translation : does the commit answer the issue ?

.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')

//replace fraction 1/2 => '1 sur 2'
¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. This could very well be a date. Il est né le 25/12 qui ne devrait pas être remplacé de la sorte et prend priorité sur la regexp suivante (date format mm/yy)

Copy link
Author

@CapitainFlam CapitainFlam Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh shoot ! you're right !
To solve it, we shall test numbers, and see "if it's in date [1..31]/[1..12] (dd/mm) or [1..12]/[00..99] or [1..12]/[1700..2030] (mm/yy(yy)?) range, it shall be a date stamp, otherwise it's a fraction ?!"
Wow! It escalated quickly!
Sometimes, only context will show us that "mettre 1/4 de litre de lait" will go to "mettre un avril de litre de lait"... But it shall be more readable than "mettre de litre de lait".
On the other hand, if we manage poolry date (not checking mm/yy but only mm/yyyy), it will go like : "il est né en 12/88." to "il est né en douze sur quatre-vingt huit." instead of "il est née en décembre quatre-vingt huit."... Again, it's more readable than ""il est né en."

IMHO, [1..31]/[1..12] (dd/mm) or [1..12]/[1700..2030] (mm/yyyy) range shall work fine. The reste will go through 'fraction replacement', and will always be better that 'full removal'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's first wait for issue #636 resolution, 'coz I don't want to work for nothing :-/

Comment on lines 174 to 175
.replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy
.replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the impact of creating blank in the middle of a sentence after the replacement of a date, I'm a proponent of either:

  • omitting these sentence altogether
  • making deep replacement (spelling date, month, year) which may or may not be practicable.

error: 'Sentence should not contain numbers - les phrases ne doivent pas contenir de nombres',
}, {
regex: /[<>+*#@%^[\]()/]/,
error: 'Sentence should not contain symbols - les phrases de doivent pas contenir de symboles \(\*, \#, \(, etc\)',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

­I suggested substituting "–" (long hyphen) with the common ("-") counterpart and testing against it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 18fdf37

// as users wouldn't know how to pronounce the uppercase letters.
regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/,
error: 'Sentence should not contain abbreviations - Les phrases ne doivent pas contenir des abréviations ou sigles',
}];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing something about the french apostrophe (´)? I don't know how the model would benefit (or not) from this situation. I'm pretty sure text processing would do the uniformisation (before TTS for example) so I believe we should make it too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 1a0cff5

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, sources files like https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt are using french apostrophe ?!
(it only shows in "ui-monospace" code font, not the usual "Segoe UI" font.

line 2 : J’appelle maintenant, dans le texte de la commission, les articles du projet de loi.
if I change it myself : 
line 2 : J'appelle maintenant, dans le texte de la commission, les articles du projet de loi.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... commit 8d7d973

server/lib/cleanup/languages/fr.js Outdated Show resolved Hide resolved
.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Acronyms are a strange case : They are actually spelled as distincts letters (except for NUPES).

But we don't want "STT" to process R.S.A. as erre et ça but we currently have nothing to train it that way.

Should we expect a specific future dictionary or just not strip some of them in the first place (but remove the dots)?

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️

No worries, you're doing fine :) Thanks for your contributions!

I have a few comments, but generally I think it's a good thing to update these files.

server/lib/cleanup/languages/fr.js Outdated Show resolved Hide resolved
server/lib/cleanup/languages/fr.js Outdated Show resolved Hide resolved
.replace(/(^|\s)(XIX|XVIIII)(\s|\.|,|\?|!|$)/g, ' dix-neuf ')
.replace(/(^|\s)XX(\s|\.|,|\?|!|$)/g, ' vingt ')
.replace(/(^|\s)XXI(\s|\.|,|\?|!|$)/g, ' vingt-et-un ')
.replace(/(^|\s)XXII(\s|\.|,|\?|!|$)/g, ' vingt-deux ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove all of the roman numerals, as they are not allowed from the beginning (see the abbreviation pattern in the validation file).

.replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ')
.replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ')
.replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ')
.replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed, as numbers are not allowed anyway in the Sentence Collector.

.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed, as abbreviations are not allowed in the validation file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think I understand wrongly how it works. Let's solve this #636 issue first 😸

.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')

//replace fraction 1/2 => '1 sur 2'
¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers are not allowed by the validation file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep it for "future-proof rules" and/or "bulk load rules if you don't have yours" ?
Let's discuss it here : https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411/15

//dates, digits and numbers fr-FR cleanup
//todo : CONVERT TO TEXT instead of removing it
.replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy
.replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers are not allowed by the validation file.

server/lib/validation/languages/fr.js Outdated Show resolved Hide resolved
@MichaelKohler
Copy link
Member

Also note https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411, though that probably doesn't matter too much for this here :)

of course :)

Co-authored-by: Michael Kohler <[email protected]>
Notepad++ > Edit > transform TABS in SPACE.
Yeahhhhhh
as requested, only keeping FR and removing EN error description...
But adding EN comment, for ^FR readers sake.
We do not remove PUNCTUATION (?!, etc.) after finding them, we just put them back  as we found them.
remplacing – (long hyphen) by - (short hyphen)
Normalize ´ (french apostroph) into ' (usual apostroph)
@drzraf
Copy link

drzraf commented Mar 15, 2023

Any hope to get this in, in one shape or another?

@MichaelKohler
Copy link
Member

The Sentence Collector has now been integrated into the Common Voice platform. Therefore I'm archiving this project here. The validation files now live here and I'm sure it would still benefit from the validation rules being added there: https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation. Unfortunately moving this PR over there is way harder than manually recreating it. Would you mind creating a new PR for this? Sorry for the troubles.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants