-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mandrain support? #10
Comments
I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:
class ProsodyPredictor(nn.Module):
def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
super().__init__()
self.embedding = nn.Embedding(n_prods, prod_embd * 2)
self.text_encoder = DurationEncoder(sty_dim=style_dim,
d_model=d_hid,
nlayers=nlayers,
dropout=dropout)
self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
self.duration_proj = LinearNorm(d_hid, 1)
self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
self.duration_proj = LinearNorm(d_hid, 1)
self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
self.F0 = nn.ModuleList()
self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
self.N = nn.ModuleList()
self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
def forward(self, texts, prosody, style, text_lengths, alignment, m):
prosody = self.embedding(prosody)
texts = torch.cat([texts, prosody], axis=1)
d = self.text_encoder(texts, style, text_lengths, m)
batch_size = d.shape[0]
text_size = d.shape[1]
# predict duration
input_lengths = text_lengths.cpu().numpy()
x = nn.utils.rnn.pack_padded_sequence(
d, input_lengths, batch_first=True, enforce_sorted=False)
m = m.to(text_lengths.device).unsqueeze(1)
self.lstm.flatten_parameters()
x, _ = self.lstm(x)
x, _ = nn.utils.rnn.pad_packed_sequence(
x, batch_first=True)
x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]])
x_pad[:, :x.shape[1], :] = x
x = x_pad.to(x.device)
duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))
en = (d.transpose(-1, -2) @ alignment)
return duration.squeeze(-1), en
def F0Ntrain(self, x, s):
x, _ = self.shared(x.transpose(-1, -2))
F0 = x.transpose(-1, -2)
for block in self.F0:
F0 = block(F0, s)
F0 = self.F0_proj(F0)
N = x.transpose(-1, -2)
for block in self.N:
N = block(N, s)
N = self.N_proj(N)
return F0.squeeze(1), N.squeeze(1)
def length_to_mask(self, lengths):
mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
mask = torch.gt(mask+1, lengths.unsqueeze(1))
return mask
where X and $ represent the SOS and EOS. I'll leave this issue open for someone to fork the repo and modify it for Mandarin and Japanese support. I'm unfortunately too busy to work on it now. |
For Japanese, you can do the same thing: The conversion table from kana to IPA is the following (again phonemizer doesn't work for me). kana_mapper = OrderedDict([
("ゔぁ","bˈa"),
("ゔぃ","bˈi"),
("ゔぇ","bˈe"),
("ゔぉ","bˈo"),
("ゔゃ","bˈʲa"),
("ゔゅ","bˈʲɯ"),
("ゔゃ","bˈʲa"),
("ゔょ","bˈʲo"),
("ゔ","bˈɯ"),
("あぁ","aː"),
("いぃ","iː"),
("いぇ","je"),
("いゃ","ja"),
("うぅ","ɯː"),
("えぇ","eː"),
("おぉ","oː"),
("かぁ","kˈaː"),
("きぃ","kˈiː"),
("くぅ","kˈɯː"),
("くゃ","kˈa"),
("くゅ","kˈʲɯ"),
("くょ","kˈʲo"),
("けぇ","kˈeː"),
("こぉ","kˈoː"),
("がぁ","gˈaː"),
("ぎぃ","gˈiː"),
("ぐぅ","gˈɯː"),
("ぐゃ","gˈʲa"),
("ぐゅ","gˈʲɯ"),
("ぐょ","gˈʲo"),
("げぇ","gˈeː"),
("ごぉ","gˈoː"),
("さぁ","sˈaː"),
("しぃ","ɕˈiː"),
("すぅ","sˈɯː"),
("すゃ","sˈʲa"),
("すゅ","sˈʲɯ"),
("すょ","sˈʲo"),
("せぇ","sˈeː"),
("そぉ","sˈoː"),
("ざぁ","zˈaː"),
("じぃ","dʑˈiː"),
("ずぅ","zˈɯː"),
("ずゃ","zˈʲa"),
("ずゅ","zˈʲɯ"),
("ずょ","zˈʲo"),
("ぜぇ","zˈeː"),
("ぞぉ","zˈeː"),
("たぁ","tˈaː"),
("ちぃ","tɕˈiː"),
("つぁ","tsˈa"),
("つぃ","tsˈi"),
("つぅ","tsˈɯː"),
("つゃ","tɕˈa"),
("つゅ","tɕˈɯ"),
("つょ","tɕˈo"),
("つぇ","tsˈe"),
("つぉ","tsˈo"),
("てぇ","tˈeː"),
("とぉ","tˈoː"),
("だぁ","dˈaː"),
("ぢぃ","dʑˈiː"),
("づぅ","dˈɯː"),
("づゃ","zˈʲa"),
("づゅ","zˈʲɯ"),
("づょ","zˈʲo"),
("でぇ","dˈeː"),
("どぉ","dˈoː"),
("なぁ","nˈaː"),
("にぃ","nˈiː"),
("ぬぅ","nˈɯː"),
("ぬゃ","nˈʲa"),
("ぬゅ","nˈʲɯ"),
("ぬょ","nˈʲo"),
("ねぇ","nˈeː"),
("のぉ","nˈoː"),
("はぁ","hˈaː"),
("ひぃ","çˈiː"),
("ふぅ","ɸˈɯː"),
("ふゃ","ɸˈʲa"),
("ふゅ","ɸˈʲɯ"),
("ふょ","ɸˈʲo"),
("へぇ","hˈeː"),
("ほぉ","hˈoː"),
("ばぁ","bˈaː"),
("びぃ","bˈiː"),
("ぶぅ","bˈɯː"),
("ふゃ","ɸˈʲa"),
("ぶゅ","bˈʲɯ"),
("ふょ","ɸˈʲo"),
("べぇ","bˈeː"),
("ぼぉ","bˈoː"),
("ぱぁ","pˈaː"),
("ぴぃ","pˈiː"),
("ぷぅ","pˈɯː"),
("ぷゃ","pˈʲa"),
("ぷゅ","pˈʲɯ"),
("ぷょ","pˈʲo"),
("ぺぇ","pˈeː"),
("ぽぉ","pˈoː"),
("まぁ","mˈaː"),
("みぃ","mˈiː"),
("むぅ","mˈɯː"),
("むゃ","mˈʲa"),
("むゅ","mˈʲɯ"),
("むょ","mˈʲo"),
("めぇ","mˈeː"),
("もぉ","mˈoː"),
("やぁ","jˈaː"),
("ゆぅ","jˈɯː"),
("ゆゃ","jˈaː"),
("ゆゅ","jˈɯː"),
("ゆょ","jˈoː"),
("よぉ","jˈoː"),
("らぁ","ɽˈaː"),
("りぃ","ɽˈiː"),
("るぅ","ɽˈɯː"),
("るゃ","ɽˈʲa"),
("るゅ","ɽˈʲɯ"),
("るょ","ɽˈʲo"),
("れぇ","ɽˈeː"),
("ろぉ","ɽˈoː"),
("わぁ","ɯˈaː"),
("をぉ","oː"),
("う゛","bˈɯ"),
("でぃ","dˈi"),
("でぇ","dˈeː"),
("でゃ","dˈʲa"),
("でゅ","dˈʲɯ"),
("でょ","dˈʲo"),
("てぃ","tˈi"),
("てぇ","tˈeː"),
("てゃ","tˈʲa"),
("てゅ","tˈʲɯ"),
("てょ","tˈʲo"),
("すぃ","sˈi"),
("ずぁ","zˈɯa"),
("ずぃ","zˈi"),
("ずぅ","zˈɯ"),
("ずゃ","zˈʲa"),
("ずゅ","zˈʲɯ"),
("ずょ","zˈʲo"),
("ずぇ","zˈe"),
("ずぉ","zˈo"),
("きゃ","kˈʲa"),
("きゅ","kˈʲɯ"),
("きょ","kˈʲo"),
("しゃ","ɕˈʲa"),
("しゅ","ɕˈʲɯ"),
("しぇ","ɕˈʲe"),
("しょ","ɕˈʲo"),
("ちゃ","tɕˈa"),
("ちゅ","tɕˈɯ"),
("ちぇ","tɕˈe"),
("ちょ","tɕˈo"),
("とぅ","tˈɯ"),
("とゃ","tˈʲa"),
("とゅ","tˈʲɯ"),
("とょ","tˈʲo"),
("どぁ","dˈoa"),
("どぅ","dˈɯ"),
("どゃ","dˈʲa"),
("どゅ","dˈʲɯ"),
("どょ","dˈʲo"),
("どぉ","dˈoː"),
("にゃ","nˈʲa"),
("にゅ","nˈʲɯ"),
("にょ","nˈʲo"),
("ひゃ","çˈʲa"),
("ひゅ","çˈʲɯ"),
("ひょ","çˈʲo"),
("みゃ","mˈʲa"),
("みゅ","mˈʲɯ"),
("みょ","mˈʲo"),
("りゃ","ɽˈʲa"),
("りぇ","ɽˈʲe"),
("りゅ","ɽˈʲɯ"),
("りょ","ɽˈʲo"),
("ぎゃ","gˈʲa"),
("ぎゅ","gˈʲɯ"),
("ぎょ","gˈʲo"),
("ぢぇ","dʑˈe"),
("ぢゃ","dʑˈa"),
("ぢゅ","dʑˈɯ"),
("ぢょ","dʑˈo"),
("じぇ","dʑˈe"),
("じゃ","dʑˈa"),
("じゅ","dʑˈɯ"),
("じょ","dʑˈo"),
("びゃ","bˈʲa"),
("びゅ","bˈʲɯ"),
("びょ","bˈʲo"),
("ぴゃ","pˈʲa"),
("ぴゅ","pˈʲɯ"),
("ぴょ","pˈʲo"),
("うぁ","ɯˈa"),
("うぃ","ɯˈi"),
("うぇ","ɯˈe"),
("うぉ","ɯˈo"),
("うゃ","ɯˈʲa"),
("うゅ","ɯˈʲɯ"),
("うょ","ɯˈʲo"),
("ふぁ","ɸˈa"),
("ふぃ","ɸˈi"),
("ふぅ","ɸˈɯ"),
("ふゃ","ɸˈʲa"),
("ふゅ","ɸˈʲɯ"),
("ふょ","ɸˈʲo"),
("ふぇ","ɸˈe"),
("ふぉ","ɸˈo"),
("あ","a"),
("い","i"),
("う","ɯ"),
("え","e"),
("お","o"),
("か","kˈa"),
("き","kˈi"),
("く","kˈɯ"),
("け","kˈe"),
("こ","kˈo"),
("さ","sˈa"),
("し","ɕˈi"),
("す","sˈɯ"),
("せ","sˈe"),
("そ","sˈo"),
("た","tˈa"),
("ち","tɕˈi"),
("つ","tsˈɯ"),
("て","tˈe"),
("と","tˈo"),
("な","nˈa"),
("に","nˈi"),
("ぬ","nˈɯ"),
("ね","nˈe"),
("の","nˈo"),
("は","hˈa"),
("ひ","çˈi"),
("ふ","ɸˈɯ"),
("へ","hˈe"),
("ほ","hˈo"),
("ま","mˈa"),
("み","mˈi"),
("む","mˈɯ"),
("め","mˈe"),
("も","mˈo"),
("ら","ɽˈa"),
("り","ɽˈi"),
("る","ɽˈɯ"),
("れ","ɽˈe"),
("ろ","ɽˈo"),
("が","gˈa"),
("ぎ","gˈi"),
("ぐ","gˈɯ"),
("げ","gˈe"),
("ご","gˈo"),
("ざ","zˈa"),
("じ","dʑˈi"),
("ず","zˈɯ"),
("ぜ","zˈe"),
("ぞ","zˈo"),
("だ","dˈa"),
("ぢ","dʑˈi"),
("づ","zˈɯ"),
("で","dˈe"),
("ど","dˈo"),
("ば","bˈa"),
("び","bˈi"),
("ぶ","bˈɯ"),
("べ","bˈe"),
("ぼ","bˈo"),
("ぱ","pˈa"),
("ぴ","pˈi"),
("ぷ","pˈɯ"),
("ぺ","pˈe"),
("ぽ","pˈo"),
("や","jˈa"),
("ゆ","jˈɯ"),
("よ","jˈo"),
("わ","ɯˈa"),
("ゐ","i"),
("ゑ","e"),
("ん","ɴ"),
("っ","ʔ"),
("ー","ː"),
("ぁ","a"),
("ぃ","i"),
("ぅ","ɯ"),
("ぇ","e"),
("ぉ","o"),
("ゎ","ɯˈa"),
("ぉ","o"),
("を","o")
])
nasal_sound = OrderedDict([
# before m, p, b
("ɴm","mm"),
("ɴb", "mb"),
("ɴp", "mp"),
# before k, g
("ɴk","ŋk"),
("ɴg", "ŋg"),
# before t, d, n, s, z, ɽ
("ɴt","nt"),
("ɴd", "nd"),
("ɴn","nn"),
("ɴs", "ns"),
("ɴz","nz"),
("ɴɽ", "nɽ"),
("ɴɲ", "ɲɲ"),
])
def hiragana2IPA(text):
orig = text
for k, v in kana_mapper.items():
text = text.replace(k, v)
for k, v in nasal_sound.items():
text = text.replace(k, v)
return text You also need to add the intonations for each word with Open JTalk.
where L and H represent low tone and high tone, respectively. |
data/VCTK-Corpus/VCTK-Corpus/wav24/p275/p275_380.wav|$ɪts ɐ ɹˈiːəl pɹˈɑːbləm$$|XXXX X XXXXXX XXXXXXXXXXX|155 |
@c9412600 That was a typo that should not be included, I have fixed it. 155 is the speaker id (never used during training, just for clarification), and X means no intonation (in contrast to 1, 2, 3, 4, 5 that represent the actual tones in Mandarin). |
@yl4579 Thank you for sharing so many ideas! Use Aishell3 dataset, I can synthesize normal audio, and it sounds good. But when generate a unseen speaker, the timbre doesn't sound like its origin, is there any way to improve its timbre similarity to unseen speaker? |
@yl4579 I would like to ask if there is any change to the Vietnamese language |
@CONGLUONG12 I don't think there is any change needed for Vietnamese. You only need to find a conversion table between chu quoc ngu and IPA (maybe phonemizer works for this case?) and label the tones (there should be six of them, so |
I have some questions about how to inference in mandarin . _pad = "$" Second: if my asr trained with pinyin(like 'wo3 shi4 shui2') not ipa, is it ok for the inference? |
I use pinyin for asr and styletts, can generate a normal and good results. |
could you share some details like: _pad = "$" |
For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.
_pause = ["sil", "eos", "sp", ...] |
thank you very much! |
sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models. |
Hi, liuhuang31 |
HI, JohnHerry For example, give a text “去上学校”:
(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor). (3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad. |
Thanks for the detailed information. it helps me a lot. |
hello, the pypinyin does not perform well someways. So i use another phoneme set, not like pypinyin. In this way, how can i prepare the filelists and how to train or finetune? |
Hi, zdj97 As pypinyin or any other phoneme set, their role is to convert text into phoneme, so just samply use the new phoneme set. And remember use the new phoneme set to re-train the asr model. |
hello, what tools did you use to convert the LJspeech and LibriTTS databases to IPAs |
hi, i did not convert LJspeech or VCTK to IPAs. So, i did not use the pretrained models in this scripts. |
Hi @yl4579 , A stupid question: how can I convert
to
Is the conversion done by meldataset.py during training or do I need to write a preprocessor to convert it before training? Thanks |
@yihuitang You need to code it yourself because the meldataset.py was written for English support only. I have provided the conversion table, so it should not be difficult for you to convert it to the desired format. I couldn't find the exact code to generate the dataset unfortunately, but all you need is to split the text by space, get the number (tone), convert the pinyin to IPA using the table I provided and repeat the number (tone) for N times where N is the length of the phoneme. |
@yl4579 thanks for your prompt reply. I'll start with the code of converting the format.
|
@yihuitang In my case, I separated between words because I used the PL-BERT trained jointly with Chinese, Japanese, and English and word boundaries were used when pre-training the PL-BERT, but you may not need to do that. If you do not plan to use any language model or if your language model is at character level (for example, your grapheme in PL-BERT is the character instead of a word), I don't think there is any difference. Note that words were separated by "%" in the AiShell dataset, so "zhang1 hui4 yi2" is one word, and "chu1 yan3 de5" is another word. This is why they were converted to "ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ" in my case, where the only space is between these two words, not syllables. |
@yl4579 , Thanks for your guidance. I do plan to use your PL-BERT later if I can successfully implement Mandarin in StyleTTS. I would also like to train StyleTTS with customized data, which has no "%" in the dataset. So for the customized dataset with PL-BERT, I should use option 1 (no space). Am I right?
|
Hi @yl4579 , a quick update: I've created a script to convert pinyins to IPAs and get filelists in the desired format for Mandarin. Here are train and val lists for aishell3 dataset: Class ProsodyPredictor is also updated with your code above. And then I tried to update meldataset.py but got stuck.
|
@liuhuang31 HI, liuhuang31, thank you very mush. It is helpful. |
Hi, yihui |
@JohnHerry did you try to add a space in the mapping? For example: |
hello @liuhuang31 , |
@skysbird Hi skysbird, |
if i want to use punctuation to control the pause, i must have the dataset that has punctuation ? am i right? and yes i'm Chinese, maybe we are in the same wechat group :) |
@skysbird hi, yes, you dataset must has punctuation or other pause symbols to control the pause. |
Thanks for the advice. I have noticed that there is a stress symbol in the IPA mapped from PinYin. I think Chinese utterance do not need the stress symbol, but it can be viewed as d spliter between Shengmu and Yunmu. |
We had tried to use BERT-based Text-Prosody predictor, but it not good enough. especially the PP(#2) and IP(#3), they get low precision and recall. And what is more, the text prosody [or pause] result is everage, I think it is not good for multi-speaker models. |
@liuhuang31 Excuse me,i see symbols include "_pause", that means asr‘s text label come from mfa result? |
@sunnnnnnnny hi, sunnnnnnnny. First use frontend predict text's prosody(pause), then use mfa results to change its prosody. For asr, the #1 will not be use and remove, #2 #3 is as a phoneme. |
thank you quick reply; i see it; |
@liuhuang31 excuse me, can you share some train's 1st stage loss curves? |
thanks a lot! |
Hi, liuhuang31, In the StyleTTS paper, it has a "style vector" to help predict duration, speech speed, emotion. is that means there will be no need for the frontend prosody anymore? Have you tried the traning instantce without those prosody syllables? How about that? |
1 similar comment
Hi, liuhuang31, In the StyleTTS paper, it has a "style vector" to help predict duration, speech speed, emotion. is that means there will be no need for the frontend prosody anymore? Have you tried the traning instantce without those prosody syllables? How about that? |
Known, Thank you very much. |
"q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3" -> the symbols you use are YunMu without tones (tones are placed in the last column) |
@zhouyong64 you can use: Finally i use (5) to train the model. |
Hi, thanks for your helpful information. Can you also help to provide the inference code sample for other language like Chinese for StyleTTS? Many thanks in advance. |
Thank you very much for sharing. I have a question about the symbols ˈ in phoneme sequence. Are there any linguistic considerations ? The symbols ˈ appear to be consistent with accent marks in English. |
Thank you for providing the open-source StyleTTS2 code and the related work. Using the mapping table to convert Pinyin to IPA phonemes seems to be inconsistent with the Chinese phonemes in the multilingual-pl-bert training dataset (https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha/viewer/zh). The Chinese phonemes in the multilingual-phonemes-10k-alpha dataset are in a different format. Does this mean that if I want to use multilingual-pl-bert in StyleTTS2, I need to use a different mapping relationship? text:
phonemes:
|
mandrain support?
The text was updated successfully, but these errors were encountered: