Revisit rules to add a lang to ZIM Language metadata #212

benoit74 · 2024-06-28T12:49:57Z

Languages advertised in a TED ZIM is now based on the fact that the video has audio or subtitle in a given language.

This computation is done in compute_zim_languages function at

Lines 343 to 397 in 9fc26d4

    
           def compute_zim_languages(self): 
        
               """Compute the ZIM language metadata based on expected videos""" 
        
               # count the number of videos per audio language 
        
               audio_lang_counts = { 
        
                   lang: len(list(group)) 
        
                   for lang, group in groupby( 
        
                       [video["native_talk_language"] for video in self.videos] 
        
                   ) 
        
               } 
        
               # count the number of videos per subtitle language 
        
               subtitle_lang_counts = { 
        
                   lang: len(list(group)) 
        
                   for lang, group in groupby( 
        
                       [ 
        
                           subtitle["languageCode"] 
        
                           for video in self.videos 
        
                           for subtitle in video["subtitles"] 
        
                       ] 
        
                   ) 
        
               } 
        
               # Attribute 10 "points" score to language in video audio and 1 "point" score 
        
               # to language in video subtitle 
        
               scored_languages = { 
        
                   k: 10 * audio_lang_counts.get(k, 0) + subtitle_lang_counts.get(k, 0) 
        
                   for k in list(audio_lang_counts.keys()) + list(subtitle_lang_counts.keys()) 
        
               } 
        
               sorted_ted_languages = [ 
        
                   lang_code 
        
                   for lang_code, _ in sorted( 
        
                       scored_languages.items(), key=lambda item: -item[1] 
        
                   ) 
        
               ] 
        
               # compute the mappings from TED to ISO639-3 code and set ZIM language 
        
               mapping = tedlang.ted_to_iso639_3_langcodes(sorted_ted_languages) 
        
               self.zim_languages = ",".join( 
        
                   [mapping[code] for code in sorted_ted_languages if mapping[code]] 
        
               ) 
        
               # Display a clear warning on languages which have been ignored due to missing 
        
               # ISO639-3 codes 
        
               ignored_ted_codes = [code for code in sorted_ted_languages if not mapping[code]] 
        
               if len(ignored_ted_codes): 
        
                   logger.warning( 
        
                       "Some languages have not been added to ZIM metadata due to missing " 
        
                       f"ISO639-3 code: {ignored_ted_codes}" 
        
                   ) 
        
               if not self.disable_metadata_checks: 
        
                   # Validate ZIM languages 
        
                   validate_language("Language", self.zim_languages)

In order to properly sort the language in the list, we attribute 10 points to audio, and 1 point to subtitle.

We want to revisit this to only consider languages which are present in at least 30 or 50% of the videos. We are not really sure about the appropriate percentage.

The plan is hence:

add a new CLI argument which will be a float number between 0 and 1 (0.5 meaning 50%, ...) ; default value should be 50%
use this threshold in the compute_zim_languages to consider only languages which are present in at least 50% of the videos (a language is considered to be present in a given video if it is available as audio or as subtitle) for the computation of the list of languages

The text was updated successfully, but these errors were encountered:

benoit74 added the enhancement label Jun 28, 2024

benoit74 added this to the 3.1.0 milestone Jun 28, 2024

benoit74 added the good first issue label Jun 28, 2024

benoit74 mentioned this issue Jun 28, 2024

Better support of multi lang (Metadata) #210

Open

kelson42 mentioned this issue Jun 30, 2024

How many different language tracks are available #211

Closed

elfkuzco mentioned this issue Oct 24, 2024

use language threshold to compute zim language metadata #230

Merged

benoit74 closed this as completed in #230 Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit rules to add a lang to ZIM Language metadata #212

Revisit rules to add a lang to ZIM Language metadata #212

benoit74 commented Jun 28, 2024

Revisit rules to add a lang to ZIM Language metadata #212

Revisit rules to add a lang to ZIM Language metadata #212

Comments

benoit74 commented Jun 28, 2024