Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit rules to add a lang to ZIM Language metadata #212

Closed
benoit74 opened this issue Jun 28, 2024 · 0 comments · Fixed by #230
Closed

Revisit rules to add a lang to ZIM Language metadata #212

benoit74 opened this issue Jun 28, 2024 · 0 comments · Fixed by #230

Comments

@benoit74
Copy link
Collaborator

Languages advertised in a TED ZIM is now based on the fact that the video has audio or subtitle in a given language.

This computation is done in compute_zim_languages function at

ted/src/ted2zim/scraper.py

Lines 343 to 397 in 9fc26d4

def compute_zim_languages(self):
"""Compute the ZIM language metadata based on expected videos"""
# count the number of videos per audio language
audio_lang_counts = {
lang: len(list(group))
for lang, group in groupby(
[video["native_talk_language"] for video in self.videos]
)
}
# count the number of videos per subtitle language
subtitle_lang_counts = {
lang: len(list(group))
for lang, group in groupby(
[
subtitle["languageCode"]
for video in self.videos
for subtitle in video["subtitles"]
]
)
}
# Attribute 10 "points" score to language in video audio and 1 "point" score
# to language in video subtitle
scored_languages = {
k: 10 * audio_lang_counts.get(k, 0) + subtitle_lang_counts.get(k, 0)
for k in list(audio_lang_counts.keys()) + list(subtitle_lang_counts.keys())
}
sorted_ted_languages = [
lang_code
for lang_code, _ in sorted(
scored_languages.items(), key=lambda item: -item[1]
)
]
# compute the mappings from TED to ISO639-3 code and set ZIM language
mapping = tedlang.ted_to_iso639_3_langcodes(sorted_ted_languages)
self.zim_languages = ",".join(
[mapping[code] for code in sorted_ted_languages if mapping[code]]
)
# Display a clear warning on languages which have been ignored due to missing
# ISO639-3 codes
ignored_ted_codes = [code for code in sorted_ted_languages if not mapping[code]]
if len(ignored_ted_codes):
logger.warning(
"Some languages have not been added to ZIM metadata due to missing "
f"ISO639-3 code: {ignored_ted_codes}"
)
if not self.disable_metadata_checks:
# Validate ZIM languages
validate_language("Language", self.zim_languages)

In order to properly sort the language in the list, we attribute 10 points to audio, and 1 point to subtitle.

We want to revisit this to only consider languages which are present in at least 30 or 50% of the videos. We are not really sure about the appropriate percentage.

The plan is hence:

  • add a new CLI argument which will be a float number between 0 and 1 (0.5 meaning 50%, ...) ; default value should be 50%
  • use this threshold in the compute_zim_languages to consider only languages which are present in at least 50% of the videos (a language is considered to be present in a given video if it is available as audio or as subtitle) for the computation of the list of languages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant