Text conditioning #32

eonglints · 2022-12-13T15:22:50Z

eonglints
Dec 13, 2022

Hey, so I'm wondering about the various options for text conditioning. At the moment, it would appear we're set up to condition using cross-attention in each of the transformers. I was wondering whether we should also consider simply pre-pending text tokens before the audio tokens? Similar to the approach taken (using image tokens) in DALL-E 1 (paper, code 🙏 @lucidrains), Taming Transformers (paper, code) and Make-A-Scene (paper, unofficial code) as well as Tortoise TTS (blog post, code) with audio tokens.

I think pre-pending the text tokens would only be required for the semantic transformer as the resulting semantic tokens should encode all the phonetic content, leaving the SoundStream transformers to deal with audio decoding and further conditioning on speaker identity and other acoustic attributes.

Thoughts?

lucidrains · 2022-12-13T18:35:16Z

lucidrains
Dec 13, 2022
Maintainer

@eonglints yup! i can add the prepended way this week! (and make sure it works with @crowsonkb version of classifier free guidance)

2 replies

eonglints Dec 13, 2022
Author

You're a legend, thank you! 🙏

eonglints Jan 6, 2023
Author

Just a friendly nudge :) Happy to start working on it myself too. Thanks for the other latest additions/fixes!

eonglints · 2023-01-06T14:41:24Z

eonglints
Jan 6, 2023
Author

Pretty much exactly what I've been hoping to do with text-conditioned AudioLM has just been demonstrated with a new TTS paper from Microsoft: https://valle-demo.github.io/. Pretty impressive results.
It borrows heavily from AudioLM but drops the semantic transformer and uses Encodec rather than SoundStream.

12 replies

eonglints Jan 9, 2023
Author

@eonglints oh i did not know that! i thought fairseq was MIT licensed though, weird that Encodec would not be the same... ok, maybe i'll continue to try to make soundstream work then (i'll look to encodec to figure out how they got around the complex valued activation not being distributed friendly)

Yeah, at the moment they've opted for an Attribution-NonCommercial 4.0 International license (it's in the MetaAI repo but not in fairseq by the way). One of the authors has said "We are considering the possibility of releasing this code under MIT, but it will take a bit of time to get the legal approval." So, who knows... Given how useful SoundStream-like representations are for a whole range of tasks for the audio/speech/music communities, I would say it would be great to get this version over the finishing line.

lucidrains Jan 9, 2023
Maintainer

yea, same old same old. I was at the other side of the fence once

ok, I'll definitely put more work into this!

i will also avoid integrating Encodec then, given the licensing

lucidrains Jan 14, 2023
Maintainer

@eonglints ok, it should work now, if one simply sets 654d496#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R194 to True!

lucidrains Jan 14, 2023
Maintainer

although, i'm still more of a believer in cross attention, but definitely this will let us easily compare the two approaches

eonglints Jan 14, 2023
Author

Amazing, thank you. And yes, I think the jury's still out, especially for text-audio language models, so it will be great to compare the two approaches. I'm currently training all three Transformers and hoping to be able to share some preliminary text-conditioned results fairly soon.

LWprogramming · 2023-01-27T06:00:17Z

LWprogramming
Jan 27, 2023

Speaking of text conditioning, it looks like Google's just published a new paper on this: https://google-research.github.io/seanet/musiclm/examples/

0 replies

djqualia · 2023-01-27T06:28:34Z

djqualia
Jan 27, 2023

Wow, they just demonstrated so many things I've wanted to try with this code... I can't access the paper yet, I wonder how similar it is to what's @lucidrains has built here?

I'm excited for this MusicCaps data set too.

1 reply

lucidrains Jan 28, 2023
Maintainer

it is slightly different, they involve an audio CLIP model from https://arxiv.org/abs/2208.12415 I'll be building it out at https://github.com/lucidrains/musiclm-pytorch

djqualia · 2023-01-28T03:04:53Z

djqualia
Jan 28, 2023

you are legendary!

yes it is very interesting how they overcame the problem of scarce training data by using a shared embedding space (Mulan).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text conditioning #32

{{title}}

Replies: 5 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text conditioning #32

eonglints Dec 13, 2022

Replies: 5 comments · 15 replies

lucidrains Dec 13, 2022 Maintainer

eonglints Dec 13, 2022 Author

eonglints Jan 6, 2023 Author

eonglints Jan 6, 2023 Author

eonglints Jan 9, 2023 Author

lucidrains Jan 9, 2023 Maintainer

lucidrains Jan 14, 2023 Maintainer

lucidrains Jan 14, 2023 Maintainer

eonglints Jan 14, 2023 Author

LWprogramming Jan 27, 2023

djqualia Jan 27, 2023

lucidrains Jan 28, 2023 Maintainer

djqualia Jan 28, 2023

eonglints
Dec 13, 2022

Replies: 5 comments 15 replies

lucidrains
Dec 13, 2022
Maintainer

eonglints Dec 13, 2022
Author

eonglints Jan 6, 2023
Author

eonglints
Jan 6, 2023
Author

eonglints Jan 9, 2023
Author

lucidrains Jan 9, 2023
Maintainer

lucidrains Jan 14, 2023
Maintainer

lucidrains Jan 14, 2023
Maintainer

eonglints Jan 14, 2023
Author

LWprogramming
Jan 27, 2023

djqualia
Jan 27, 2023

lucidrains Jan 28, 2023
Maintainer

djqualia
Jan 28, 2023