You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
['m', 'ɔ', 'b', 'i', 'l', 'i', 'z', 'ə', '̀', 'ʀ', 'ɑ̃']
m ɔ b i l i z ə ̀ ʀ ɑ̃
m~ɔ~b~i~l~i~z~ə~̀~ʀ~ɑ̃
when using space as a delimiter the diacritic attaches itself to the next letter, when using any other delimiter like tilder, it outputs an extra delimiter and the diacritic then modifies the delimiter (in this case a tilder, but the same happens with any chosen delimiter).
This happens in other languages as well, so far I've tried Portuguese, Italian, same thing.
Is this expected behavior or is there some kind of trick I am unaware of? To my understanding a diacritic is not considered an additional phoneme, but instead a modifier. I also understand that unicode uses a postfix notation for diacritics, so is this perhaps an encoding issue?
The text was updated successfully, but these errors were encountered:
Ok, for anyone facing this same issue, I have written a solution for postprocessing the delimited strings:
def split_ipa(transliterated_text, delimiter='|'):
# Split the string based on the specified delimiter
parts = transliterated_text.split(delimiter)
# Initialize an empty list to hold the corrected segments
corrected_parts = []
# Loop through the parts to reattach any diacritics to their base character
for part in parts:
if corrected_parts and unicodedata.category(part[0]) == 'Mn':
# If the part starts with a diacritic, attach it to the previous part
corrected_parts[-1] += part
else:
# Otherwise, add the part to the list as a new segment
corrected_parts.append(part)
return corrected_parts
Now if you run the following code the delimited string is correctly split:
I've run into some issues in several languages, where diacritics lead to strange behavior.
Example in French:
which yields the outputs
when using space as a delimiter the diacritic attaches itself to the next letter, when using any other delimiter like tilder, it outputs an extra delimiter and the diacritic then modifies the delimiter (in this case a tilder, but the same happens with any chosen delimiter).
This happens in other languages as well, so far I've tried Portuguese, Italian, same thing.
Is this expected behavior or is there some kind of trick I am unaware of? To my understanding a diacritic is not considered an additional phoneme, but instead a modifier. I also understand that unicode uses a postfix notation for diacritics, so is this perhaps an encoding issue?
The text was updated successfully, but these errors were encountered: