Proposal: Music lyrics file format #519

ryan-lp · 2023-04-28T10:10:06Z

ryan-lp
Apr 28, 2023

Background

The discussion of transcripts so far has focused on regular speech transcription. We have the well-supported SRT format which has traditionally been the format that allows publishers to have some level of control over where to insert line breaks so that the captions are more readable, subject to publisher guidelines on maximum line widths or hyphens. And we have a new JSON format which following proposal #484 could be redefined more usefully into a format specific for word timestamps, allowing apps to highlight words as they're being spoken (among other flexible use cases). The distinction here is that the JSON format will allow the app to display the words however it wants to, while the SRT format allows the publisher to indicate how lines are arranged.

Now, music lyrics is a distinct use case with its own considerations. We definitely want publishers to have control over where to insert line breaks so that they align with phrases in the song, but we also want to have word timestamps so that we can achieve karaoke-style rendering (e.g. the bouncing ball landing on each word at the right time when you're supposed to sing it. For this use case, the proposed JSON format is not the solution, because while it does capture word-level timestamps, it does not give the publisher any control over what part of the lyrics goes on line 1, and what goes on line 2, and when a new subtitle block should begin. It is also not completely about publisher control here, it is also the case that publishers provide extremely useful information about where the most appropriate places to break lines should be, that the players could not easily infer on their own.

Example of karaoke-style captions: https://www.youtube.com/watch?v=niYRQNSIAgI

Existing standards

There are various existing formats out there that people use to encode timestamped lyrics or karaoke style lyrics, including LRC, SRT, VTT and ASS. What all of these have in common is that they were designed for pre-formatted output so that the publisher has control over how lyrics should be rendered, i.e. where to insert line breaks (and optionally allowing control over font colors and styles). The second thing they have in common is that they all optionally allow for the embedding of word timestamps within the same file.

The ass format has a lot of flexibility to control rendering, including positioning - which is probably too much flexibility for what we want. There is a tradeoff between allowing the publisher to choose the font colour for the currently highlighted word, and letting the app choose it, because quite likely the app designer will be going for a certain aesthetic and may want to have control over what colour scheme they want to use in their app.

The proposal

With these considerations in mind, I propose that we go with the simplest format that allows the publisher to inform the app where the line breaks should go, and what the timestamps are for each word, and then let the app then decide on the font styles and colours appropriate for the app's colour palette. This can be done with minimal extensions to the SRT and VTT formats, following practices that are already widely adopted for karaoke captioning.

Simply use the underline tag to surround the currently highlighted word within the current SRT or VTT block. For example, suppose we take Stayin' Alive. at a minimum, the publisher may want to inform us that lines should be broken up precisely as follows (given a 32 character line limit):

1
(block start timestamp) --> (block end timestamp)
Ah, ha, ha, ha,
stayin' alive, stayin' alive

Now to instead embed timestamps for each word and thereby allow for karaoke-style rendering, we would have a separate block for each highlighted word:

1
(word start timestamp) --> (word end timestamp)
<u>Ah,</u> ha, ha, ha,
stayin' alive, stayin' alive

2
(word start timestamp) --> (word end timestamp)
Ah, <u>ha,</u> ha, ha,
stayin' alive, stayin' alive

(etc)

The  tag is one of the few HTML tags that is accepted within both SRT and VTT, and is traditionally used to highlight words one at a time while maintaining information about line breaks. Although the original semantic meaning of  is underline, the spec would allow for apps to have a say in how they render the current word (i.e. rather than simply underlining the current word, have a bouncing ball hit the current word marked by ).

Note that if we were to invent a new format from scratch, we could encode the same information more efficiently, although in terms of adoption, there is a benefit to sticking to existing standards. E.g. by using SRT, we can benefit from the fact that many existing subtitling tools and transcription services already support this format.

Implications

Since the same SRT format can be used to encode captions with or without word timestamps, apps might need to have special logic in them to detect the presence of  tags around each word if they want to activate special rendering in that case (e.g. bouncing balls), and if not, the default rendering of SRT should continue to work just fine, since it's still just an SRT file. Otherwise, we may want to consider a transcript tag attribute to indicate whether word timestamps are embedded or not, since the content type itself doesn't indicate that information. The spec already provides rel="captions" although as currently worded, this only indicates the presence of timestamps, not the presence specifically of "word" timestamps.

ryan-lp · 2023-04-28T15:54:34Z

ryan-lp
Apr 28, 2023
Author

FYI, the karaoke SRT files are now a supported output of Whisper (git version) using the --highlight_words True option:

whisper --output_format srt,vtt --word_timestamps True --highlight_words True --max_line_width 32 --max_line_count 2 episode.mp3

3 replies

daveajones Apr 29, 2023
Maintainer

Could you produce a sample file and link it here?

ryan-lp Apr 29, 2023
Author

Here's the first 30 seconds of PC20 episode 130:

ffmpeg -t 30 -i https://op3.dev/e/mp3s.nashownotes.com/PC20-130-2023-04-21-Final.mp3 pc20-130-short.mp3
whisper --model large --word_timestamps True --highlight_words True --max_line_width 32 --max_line_count 2 pc20-130-short.mp3

The output pc20-130-short.srt.txt (remove the .txt extension) is supported by any player that already supports SRT. The exact same  tag also works in the same fashion for VTT.

For convenience I have backed and embedded the subtitles below:

ffmpeg -f lavfi -i color=size=720x120:rate=25:color=black -i pc20-130-short.mp3 -vf "subtitles=pc20-130-short.srt:force_style='Fontsize=70'" -shortest pc20-130-short.mp4

pc20-130-short.mp4

Comparing file sizes to JSON word timestamps and plain SRT files:

$ ls -l
-rw-r--r-- 1 ryan ryan 6080 Apr 29 12:18 pc20-130-short.json
-rw-r--r-- 1 ryan ryan  690 Apr 29 11:51 pc20-130-short-plain.srt
-rw-r--r-- 1 ryan ryan 8132 Apr 29 11:51 pc20-130-short.srt
$ gzip *
$ ls -l
-rw-r--r-- 1 ryan ryan  945 Apr 29 12:18 pc20-130-short.json.gz
-rw-r--r-- 1 ryan ryan  414 Apr 29 11:51 pc20-130-short-plain.srt.gz
-rw-r--r-- 1 ryan ryan 1618 Apr 29 11:51 pc20-130-short.srt.gz

dhk2 Apr 30, 2023

I like it, easily usable for seeking with transcript search and clip generators while still being backward and sideways compatible.

ryan-lp · 2023-09-16T16:07:55Z

ryan-lp
Sep 16, 2023
Author

Just bumping this up since I remember hearing somewhere recently (I can't remember where now?) that "Karaoke" was now starting to get used incorrectly within Podcasting 2.0 to describe word timestamps.

To clarify, Karaoke lyrics are called "Karaoke" style when you have both word timestamps and "also" the lines themselves. The lines are the fundamental unit, and the word timestamps are added onto that. This is especially important in Karaoke because Karaoke is all about singing along to the song with the correct rhythm and phrasing, and so you don't merely want a block of wrapped text with line breaks at random or unspecified positions, it is important that the lines reflect the phrasing of the song so that the phrases can easily be read and sung.

I will link another video below which contains an hour's worth of Karaoke songs (large sample sizes can be helpful for analysis) from an old CD, to help give an idea of what they are. The Karaoke-style lyrics were encoded on the CD in subcode using the CD+G extension, and Karaoke machines used that extra data on the CD to display the Karaoke-style lyrics. These days, Karaoke-style lyrics have been adapted to many popular subtitle formats including SRT and ASS as mentioned earlier.

https://www.youtube.com/watch?v=6ZI0fNdkAWI

I thought it important to clarify this before people take a word that has a particular meaning, and use it (e.g. in marketing campaigns) to mean something different. That would cause confusion for people who are searching for this term, particularly for music artists who may be searching for a transcription service that actually supports true Karaoke-style lyrics editing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Music lyrics file format #519

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Proposal: Music lyrics file format #519

ryan-lp Apr 28, 2023

Background

Existing standards

The proposal

Implications

Replies: 2 comments · 3 replies

ryan-lp Apr 28, 2023 Author

daveajones Apr 29, 2023 Maintainer

ryan-lp Apr 29, 2023 Author

dhk2 Apr 30, 2023

ryan-lp Sep 16, 2023 Author

ryan-lp
Apr 28, 2023

Replies: 2 comments 3 replies

ryan-lp
Apr 28, 2023
Author

daveajones Apr 29, 2023
Maintainer

ryan-lp Apr 29, 2023
Author

ryan-lp
Sep 16, 2023
Author