About Sentence Encoders #3

Fethbita · 2018-07-11T13:37:06Z

The install_models.sh file downloads 3 files, one is the blstm.ep7.9langs-v1.bpej20k.model.py file and the other two are ep7.9langs-v1.bpej20k.bin.9xx & ep7.9langs-v1.bpej20k.codes.9xx. mlenc.py file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding."
Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.

The text was updated successfully, but these errors were encountered:

hoschwenk · 2018-07-11T13:46:54Z

Hello,
Thanks for your interest in this work.
In the current version of the code, there is no support to train new sentence embeddings on your own data or for other languages. Normally, this should not be necessary since our experiments have shown that the embeddings seem to be very generic and perform well on several tasks.
We plan to provide code to train new encoders in the future.

To calculate sentence embeddings for arbitrary texts, just use the existing pipeline, e.g. like it is used in tasks/similarity/sim.sh. It should be straight-forward, you only need to call the bash functions "Tokenize" and "Embed" on your data. There is no need to calculate new BPE or binarization vocabularies.

Don't hesitate to contact me again if you need further assistance

SbstnErhrdt · 2018-12-18T18:35:36Z

We plan to provide code to train new encoders in the future.

Is there anything new regarding this?

Greetings Seb

olga-gorun · 2020-10-15T11:44:11Z

I know that this issue is closed but I'd like to continue the discussion. This year we had at least two big events that is expected to vocabulary context: BLM and especially all related to covid-19. Since embeddings basically encode word with the help of context it meets and the context in real world changed drastically, it is expected that retraining from texts that are available today would not only add new words but also change encodings for existing words. Do you plan to retrain the embeddings it or give the possibility to do it to users?

Fethbita · 2020-10-15T12:11:17Z

I'll open the issue again. If it's necessary, it can be closed and locked.

avidale · 2023-06-08T16:24:32Z

Hi @Fethbita!
Last year, the embeddings were retrained for 200 languages (so-called LASER-2 and LASER-3 embeddings, see https://github.com/facebookresearch/LASER/tree/main/nllb).

Also, there is code for training new LASER models from scratch (https://github.com/facebookresearch/fairseq/tree/nllb/examples/laser) and for distilling them for new languages (https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/laser_distillation).

I hope this satisfies the request both for new embeddings and for the tools to update them by yourself.

Fethbita closed this as completed Jul 11, 2018

Fethbita reopened this Oct 15, 2020

Fethbita closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Sentence Encoders #3

About Sentence Encoders #3

Fethbita commented Jul 11, 2018 •

edited

Loading

hoschwenk commented Jul 11, 2018

SbstnErhrdt commented Dec 18, 2018

olga-gorun commented Oct 15, 2020

Fethbita commented Oct 15, 2020

avidale commented Jun 8, 2023

About Sentence Encoders #3

About Sentence Encoders #3

Comments

Fethbita commented Jul 11, 2018 • edited Loading

hoschwenk commented Jul 11, 2018

SbstnErhrdt commented Dec 18, 2018

olga-gorun commented Oct 15, 2020

Fethbita commented Oct 15, 2020

avidale commented Jun 8, 2023

Fethbita commented Jul 11, 2018 •

edited

Loading