Proposal: Music lyrics format and general word timestamp format (WebVTT version) #599
ryan-lp
started this conversation in
Spec Proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This proposal aims to integrate word timestamps within WebVTT transcripts, addressing the specific requirements of apps dealing with music lyrics.
An important aspect to music lyrics is that the publishers typically want to indicate the phrasing via line breaks:
For Karaoke-style lyrics, we also need the word timestamps in order to highlight the current word. The JSON format can give us only word timestamps and not line breaks, however we can encode both line breaks and word timestamps together within the single VTT format:
See this video for a demonstration of how Karaoke-style lyrics can be rendered.
Given the relatively lower adoption of VTT compared to SRT, this presents us with an opportune moment to consider such a change to the VTT spec and to get it right from the beginning, before too many people adopt the standard. It is crucial to avoid the issues encountered with the JSON format where it is impossible for apps to rely on the fidelity of the timestamps due to the large number of JSON files that have already been published with widely varying degrees of fidelity. By strictly defining the conditions for timestamp tag usage in VTT, we can establish a robust standard from the outset. In particular, we should require that if timestamp tags are present, they must be at the fidelity of word level timestamps.
Handling of long lines and word wrap
Unlike SRT, VTT specifies line wrapping behaviour. In practice, this means that SRT subtitles rely on the author to always manually insert line breaks to avoid overflow, while VTT allows for the automatic wrapping of long lines. The specification should still define standard limits to prevent overflow: cues should adhere to a maximum line width of 32 characters when two lines are present, and 64 characters when a single line is present (see #370 for an important caveat). After surveying the current usage of VTT in practice, many are doing this already, which should make it an easier proposal to adopt officially.
Minimum duration
To improve accessibility for those who rely on captions, we should recommend apps to render cues with a minimum duration. For instance, cues with two lines of 32 characters provide ample time for viewers to read the content, while cues with a single line of 32 characters may pass by too quickly to be read. In practice, the minimum duration for cue rendering can be satisfied by scrolling. Unlike SRT, VTT specifies scrolling behaviour such that when a short cue appears briefly before the next cue takes its place, the first cue doesn't actually disappear, it just scrolls up to make room for the new cue. In the case of a Karaoke music app, it can perhaps be flipped so you can see the next lyric before it arrives.
Note: This proposal is the VTT version of my earlier SRT proposal #519 . Compared to the SRT proposal, the VTT proposal is more storage efficient since it does not require repeating whole cues/blocks to highlight each of the individual words in the cue/block. This also makes it less likely to cause issues for Apple.
Beta Was this translation helpful? Give feedback.
All reactions