-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Sentence Encoders #3
Comments
Hello, To calculate sentence embeddings for arbitrary texts, just use the existing pipeline, e.g. like it is used in tasks/similarity/sim.sh. It should be straight-forward, you only need to call the bash functions "Tokenize" and "Embed" on your data. There is no need to calculate new BPE or binarization vocabularies. Don't hesitate to contact me again if you need further assistance |
Is there anything new regarding this? Greetings Seb |
I know that this issue is closed but I'd like to continue the discussion. This year we had at least two big events that is expected to vocabulary context: BLM and especially all related to covid-19. Since embeddings basically encode word with the help of context it meets and the context in real world changed drastically, it is expected that retraining from texts that are available today would not only add new words but also change encodings for existing words. Do you plan to retrain the embeddings it or give the possibility to do it to users? |
I'll open the issue again. If it's necessary, it can be closed and locked. |
Hi @Fethbita! Also, there is code for training new LASER models from scratch (https://github.com/facebookresearch/fairseq/tree/nllb/examples/laser) and for distilling them for new languages (https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/laser_distillation). I hope this satisfies the request both for new embeddings and for the tools to update them by yourself. |
The
install_models.sh
file downloads 3 files, one is theblstm.ep7.9langs-v1.bpej20k.model.py
file and the other two areep7.9langs-v1.bpej20k.bin.9xx
&ep7.9langs-v1.bpej20k.codes.9xx
.mlenc.py
file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding."Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.
The text was updated successfully, but these errors were encountered: