Replies: 5 comments 15 replies
-
@eonglints yup! i can add the prepended way this week! (and make sure it works with @crowsonkb version of classifier free guidance) |
Beta Was this translation helpful? Give feedback.
-
Pretty much exactly what I've been hoping to do with text-conditioned AudioLM has just been demonstrated with a new TTS paper from Microsoft: https://valle-demo.github.io/. Pretty impressive results. |
Beta Was this translation helpful? Give feedback.
-
Speaking of text conditioning, it looks like Google's just published a new paper on this: https://google-research.github.io/seanet/musiclm/examples/ |
Beta Was this translation helpful? Give feedback.
-
Wow, they just demonstrated so many things I've wanted to try with this code... I can't access the paper yet, I wonder how similar it is to what's @lucidrains has built here? I'm excited for this MusicCaps data set too. |
Beta Was this translation helpful? Give feedback.
-
you are legendary! yes it is very interesting how they overcame the problem of scarce training data by using a shared embedding space (Mulan). |
Beta Was this translation helpful? Give feedback.
-
Hey, so I'm wondering about the various options for text conditioning. At the moment, it would appear we're set up to condition using cross-attention in each of the transformers. I was wondering whether we should also consider simply pre-pending text tokens before the audio tokens? Similar to the approach taken (using image tokens) in DALL-E 1 (paper, code 🙏 @lucidrains), Taming Transformers (paper, code) and Make-A-Scene (paper, unofficial code) as well as Tortoise TTS (blog post, code) with audio tokens.
I think pre-pending the text tokens would only be required for the semantic transformer as the resulting semantic tokens should encode all the phonetic content, leaving the SoundStream transformers to deal with audio decoding and further conditioning on speaker identity and other acoustic attributes.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions