Proposal: Music lyrics format and general word timestamp format (WebVTT version) #599

ryan-lp · 2024-01-29T13:52:30Z

ryan-lp
Jan 29, 2024

This proposal aims to integrate word timestamps within WebVTT transcripts, addressing the specific requirements of apps dealing with music lyrics.

An important aspect to music lyrics is that the publishers typically want to indicate the phrasing via line breaks:

00:44.000 --> 00:48.000
Ah, ha, ha, ha,
stayin' alive, stayin' alive

For Karaoke-style lyrics, we also need the word timestamps in order to highlight the current word. The JSON format can give us only word timestamps and not line breaks, however we can encode both line breaks and word timestamps together within the single VTT format:

00:44.000 --> 00:48.000
Ah, <00:44.500>ha, <00:45.000>ha, <00:45.500>ha,
<00:46.000>stayin' <00:46.500>alive, <00:47.000>stayin' <00:47.500>alive

See this video for a demonstration of how Karaoke-style lyrics can be rendered.

Given the relatively lower adoption of VTT compared to SRT, this presents us with an opportune moment to consider such a change to the VTT spec and to get it right from the beginning, before too many people adopt the standard. It is crucial to avoid the issues encountered with the JSON format where it is impossible for apps to rely on the fidelity of the timestamps due to the large number of JSON files that have already been published with widely varying degrees of fidelity. By strictly defining the conditions for timestamp tag usage in VTT, we can establish a robust standard from the outset. In particular, we should require that if timestamp tags are present, they must be at the fidelity of word level timestamps.

Handling of long lines and word wrap

Unlike SRT, VTT specifies line wrapping behaviour. In practice, this means that SRT subtitles rely on the author to always manually insert line breaks to avoid overflow, while VTT allows for the automatic wrapping of long lines. The specification should still define standard limits to prevent overflow: cues should adhere to a maximum line width of 32 characters when two lines are present, and 64 characters when a single line is present (see #370 for an important caveat). After surveying the current usage of VTT in practice, many are doing this already, which should make it an easier proposal to adopt officially.

Minimum duration

To improve accessibility for those who rely on captions, we should recommend apps to render cues with a minimum duration. For instance, cues with two lines of 32 characters provide ample time for viewers to read the content, while cues with a single line of 32 characters may pass by too quickly to be read. In practice, the minimum duration for cue rendering can be satisfied by scrolling. Unlike SRT, VTT specifies scrolling behaviour such that when a short cue appears briefly before the next cue takes its place, the first cue doesn't actually disappear, it just scrolls up to make room for the new cue. In the case of a Karaoke music app, it can perhaps be flipped so you can see the next lyric before it arrives.

Note: This proposal is the VTT version of my earlier SRT proposal #519 . Compared to the SRT proposal, the VTT proposal is more storage efficient since it does not require repeating whole cues/blocks to highlight each of the individual words in the cue/block. This also makes it less likely to cause issues for Apple.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Music lyrics format and general word timestamp format (WebVTT version) #599

{{title}}

Replies: 0 comments

Select a reply

Proposal: Music lyrics format and general word timestamp format (WebVTT version) #599

ryan-lp Jan 29, 2024

Handling of long lines and word wrap

Minimum duration

Replies: 0 comments

ryan-lp
Jan 29, 2024