Proposal: Music lyrics file format #519
Replies: 2 comments 3 replies
-
FYI, the karaoke SRT files are now a supported output of Whisper (git version) using the
|
Beta Was this translation helpful? Give feedback.
-
Just bumping this up since I remember hearing somewhere recently (I can't remember where now?) that "Karaoke" was now starting to get used incorrectly within Podcasting 2.0 to describe word timestamps. To clarify, Karaoke lyrics are called "Karaoke" style when you have both word timestamps and "also" the lines themselves. The lines are the fundamental unit, and the word timestamps are added onto that. This is especially important in Karaoke because Karaoke is all about singing along to the song with the correct rhythm and phrasing, and so you don't merely want a block of wrapped text with line breaks at random or unspecified positions, it is important that the lines reflect the phrasing of the song so that the phrases can easily be read and sung. I will link another video below which contains an hour's worth of Karaoke songs (large sample sizes can be helpful for analysis) from an old CD, to help give an idea of what they are. The Karaoke-style lyrics were encoded on the CD in subcode using the CD+G extension, and Karaoke machines used that extra data on the CD to display the Karaoke-style lyrics. These days, Karaoke-style lyrics have been adapted to many popular subtitle formats including SRT and ASS as mentioned earlier. https://www.youtube.com/watch?v=6ZI0fNdkAWI I thought it important to clarify this before people take a word that has a particular meaning, and use it (e.g. in marketing campaigns) to mean something different. That would cause confusion for people who are searching for this term, particularly for music artists who may be searching for a transcription service that actually supports true Karaoke-style lyrics editing. |
Beta Was this translation helpful? Give feedback.
-
Background
The discussion of transcripts so far has focused on regular speech transcription. We have the well-supported SRT format which has traditionally been the format that allows publishers to have some level of control over where to insert line breaks so that the captions are more readable, subject to publisher guidelines on maximum line widths or hyphens. And we have a new JSON format which following proposal #484 could be redefined more usefully into a format specific for word timestamps, allowing apps to highlight words as they're being spoken (among other flexible use cases). The distinction here is that the JSON format will allow the app to display the words however it wants to, while the SRT format allows the publisher to indicate how lines are arranged.
Now, music lyrics is a distinct use case with its own considerations. We definitely want publishers to have control over where to insert line breaks so that they align with phrases in the song, but we also want to have word timestamps so that we can achieve karaoke-style rendering (e.g. the bouncing ball landing on each word at the right time when you're supposed to sing it. For this use case, the proposed JSON format is not the solution, because while it does capture word-level timestamps, it does not give the publisher any control over what part of the lyrics goes on line 1, and what goes on line 2, and when a new subtitle block should begin. It is also not completely about publisher control here, it is also the case that publishers provide extremely useful information about where the most appropriate places to break lines should be, that the players could not easily infer on their own.
Example of karaoke-style captions: https://www.youtube.com/watch?v=niYRQNSIAgI
Existing standards
There are various existing formats out there that people use to encode timestamped lyrics or karaoke style lyrics, including LRC, SRT, VTT and ASS. What all of these have in common is that they were designed for pre-formatted output so that the publisher has control over how lyrics should be rendered, i.e. where to insert line breaks (and optionally allowing control over font colors and styles). The second thing they have in common is that they all optionally allow for the embedding of word timestamps within the same file.
The ass format has a lot of flexibility to control rendering, including positioning - which is probably too much flexibility for what we want. There is a tradeoff between allowing the publisher to choose the font colour for the currently highlighted word, and letting the app choose it, because quite likely the app designer will be going for a certain aesthetic and may want to have control over what colour scheme they want to use in their app.
The proposal
With these considerations in mind, I propose that we go with the simplest format that allows the publisher to inform the app where the line breaks should go, and what the timestamps are for each word, and then let the app then decide on the font styles and colours appropriate for the app's colour palette. This can be done with minimal extensions to the SRT and VTT formats, following practices that are already widely adopted for karaoke captioning.
Simply use the
<u>underline</u>
tag to surround the currently highlighted word within the current SRT or VTT block. For example, suppose we take Stayin' Alive. at a minimum, the publisher may want to inform us that lines should be broken up precisely as follows (given a 32 character line limit):Now to instead embed timestamps for each word and thereby allow for karaoke-style rendering, we would have a separate block for each highlighted word:
The
<u>
tag is one of the few HTML tags that is accepted within both SRT and VTT, and is traditionally used to highlight words one at a time while maintaining information about line breaks. Although the original semantic meaning of<u>
is underline, the spec would allow for apps to have a say in how they render the current word (i.e. rather than simply underlining the current word, have a bouncing ball hit the current word marked by<u>
).Note that if we were to invent a new format from scratch, we could encode the same information more efficiently, although in terms of adoption, there is a benefit to sticking to existing standards. E.g. by using SRT, we can benefit from the fact that many existing subtitling tools and transcription services already support this format.
Implications
Since the same SRT format can be used to encode captions with or without word timestamps, apps might need to have special logic in them to detect the presence of
<u>
tags around each word if they want to activate special rendering in that case (e.g. bouncing balls), and if not, the default rendering of SRT should continue to work just fine, since it's still just an SRT file. Otherwise, we may want to consider a transcript tag attribute to indicate whether word timestamps are embedded or not, since the content type itself doesn't indicate that information. The spec already providesrel="captions"
although as currently worded, this only indicates the presence of timestamps, not the presence specifically of "word" timestamps.Beta Was this translation helpful? Give feedback.
All reactions