Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: yannvgn/laserembeddings
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.1.3
Choose a base ref
...
head repository: yannvgn/laserembeddings
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.0.0
Choose a head ref

Commits on Oct 3, 2019

  1. add romanization

    also, fix preprocessing steps order
    yannvgn committed Oct 3, 2019
    Copy the full SHA
    b89ed5c View commit details
  2. re-add ROMAN_LC comment

    yannvgn committed Oct 3, 2019
    Copy the full SHA
    8fa8774 View commit details
  3. add Chinese language support

    yannvgn committed Oct 3, 2019
    Copy the full SHA
    557f4c1 View commit details
  4. update readme

    yannvgn committed Oct 3, 2019
    Copy the full SHA
    0874afa View commit details
  5. update readme

    yannvgn committed Oct 3, 2019
    Copy the full SHA
    4961513 View commit details

Commits on Oct 10, 2019

  1. Copy the full SHA
    eb51a26 View commit details
  2. Copy the full SHA
    d42e1e7 View commit details
  3. Copy the full SHA
    2555da7 View commit details

Commits on Nov 1, 2019

  1. Copy the full SHA
    a33e397 View commit details
  2. update travis build steps

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    b84e338 View commit details
  3. fix travis builds

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    7480063 View commit details
  4. fix travis build (windows)

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    c0f26db View commit details
  5. fix travis builds

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    dea1c81 View commit details
  6. fix travis build (windows)

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    a7296ce View commit details
  7. update travis builds

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    265cfd3 View commit details
  8. fix travis builds

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    c79c33c View commit details
  9. Copy the full SHA
    75d134d View commit details
  10. Copy the full SHA
    386cc6c View commit details
  11. Merge pull request #10 from yannvgn/update-travis-build

    Update travis build
    yannvgn authored Nov 1, 2019
    Copy the full SHA
    158d24e View commit details
  12. Copy the full SHA
    e5f9012 View commit details
  13. Merge pull request #7 from yannvgn/add-romanization

    add romanization
    yannvgn authored Nov 1, 2019
    Copy the full SHA
    80479a5 View commit details
  14. update readme

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    3e1a197 View commit details
  15. merge branch next

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    0fb3719 View commit details
  16. Merge pull request #8 from yannvgn/zh-support

    add Chinese language support
    yannvgn authored Nov 1, 2019
    Copy the full SHA
    0121919 View commit details
  17. merge branch next

    yannvgn committed Nov 1, 2019
    Copy the full SHA
    543d364 View commit details

Commits on Nov 3, 2019

  1. Merge pull request #11 from yannvgn/ja-support

    add Japanese language support
    yannvgn authored Nov 3, 2019
    Copy the full SHA
    8b13f54 View commit details

Commits on Dec 5, 2019

  1. Merge pull request #9 from chiragjn/use_fastBPE_pypi

    Use fastBPE package available from pypi
    yannvgn authored Dec 5, 2019
    Copy the full SHA
    3c3f297 View commit details
  2. revert fastBPE switch

    yannvgn committed Dec 5, 2019
    Copy the full SHA
    4c090f2 View commit details
  3. Merge pull request #12 from yannvgn/revert-switch-to-fastbpe

    revert fastBPE switch
    yannvgn authored Dec 5, 2019
    Copy the full SHA
    757a9a3 View commit details
  4. Copy the full SHA
    c656a47 View commit details

Commits on Dec 18, 2019

  1. Copy the full SHA
    dac8cfc View commit details
  2. Copy the full SHA
    4048636 View commit details
  3. Copy the full SHA
    8413a13 View commit details

Commits on Dec 19, 2019

  1. Merge pull request #15 from yannvgn/fix-travis-build-poetry-1

    fix travis configuration (poetry 1.0.0)
    yannvgn authored Dec 19, 2019
    Copy the full SHA
    1df0cc2 View commit details
  2. Copy the full SHA
    c724ebc View commit details
  3. Copy the full SHA
    65f9c11 View commit details
  4. Merge pull request #14 from yannvgn/embed-sentences-multiple-langs

    Allow multiple languages in Laser.embed_sentences
    yannvgn authored Dec 19, 2019
    Copy the full SHA
    c58ea4f View commit details
  5. update readme

    yannvgn committed Dec 19, 2019
    Copy the full SHA
    3c03d98 View commit details
  6. update readme

    yannvgn committed Dec 19, 2019
    Copy the full SHA
    dbf6972 View commit details
  7. Copy the full SHA
    bcf6097 View commit details
  8. Merge pull request #13 from yannvgn/next

    improve language support
    yannvgn authored Dec 19, 2019
    Copy the full SHA
    6934ded View commit details
  9. update readme

    yannvgn committed Dec 19, 2019
    Copy the full SHA
    1fe5e2a View commit details
  10. v1.0.0

    yannvgn committed Dec 19, 2019
    Copy the full SHA
    54dc6b4 View commit details
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -130,3 +130,6 @@ dmypy.json

# Pyre type checker
.pyre/

# PyCharm files
.idea/*
59 changes: 50 additions & 9 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,16 +1,57 @@
dist: xenial
language: python
python:
- "3.6"
- "3.7"
before_install:
- pip install poetry
jobs:
include:
- name: "Python 3.7 on Xenial Linux"
python: 3.7
before_install:
- python -m pip install --upgrade pip
- pip3 install poetry==1.0.*
- pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
- name: "Python 3.6 on Xenial Linux"
python: 3.6
before_install:
- python -m pip install --upgrade pip
- pip3 install poetry==1.0.*
- pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
- name: "Python 3.7 on macOS"
os: osx
osx_image: xcode11.2
language: shell
before_install:
- python3 -m pip install --upgrade pip
- pip3 install poetry==1.0.*
- pip3 install virtualenv
- virtualenv .env
- source .env/bin/activate
- pip3 install torch
- name: "Python 3.7 on Windows"
os: windows
language: shell
before_install:
- choco install python --version 3.7.0
- python -m pip install --upgrade pip
- pip3 install poetry==1.0.*
- poetry config virtualenvs.create false
- pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH
- name: "Python 3.7 on Xenial Linux (wheel installation)"
python: 3.7
before_install:
- python -m pip install --upgrade pip
- pip3 install poetry==1.0.*
- pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
install:
- poetry build
- pip3 install dist/laserembeddings-*.whl
- python -m laserembeddings download-models
script:
- python -c 'from laserembeddings import Laser; laser = Laser(); laser.embed_sentences(["test"], lang="en")'

install:
- if [[ `python --version` =~ 'Python 3.6' ]]; then pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp36-cp36m-linux_x86_64.whl; fi
- if [[ `python --version` =~ 'Python 3.7' ]]; then pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl; fi
- poetry remove torch -n # fix: latest torch wheel (1.1.0.post2) not available for linux
- poetry install -n
- python -m laserembeddings download-models
- python3 -m laserembeddings download-models || python -m laserembeddings download-models

script:
- poetry run pylint laserembeddings
- poetry run pytest
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
<a name="1.0.0"></a>
# [1.0.0](https://github.com/yannvgn/laserembeddings/compare/v0.1.3...v1.0.0) (2019-12-19)

- Greek, Chinese and Japanese are now supported 🇬🇷 🇨🇳 🇯🇵
- Some languages that were only partially supported are now fully supported (New Norwegian, Swedish, Tatar) 🌍
- It should work on Windows now 🙄
- Sentences in different languages can now be processed in the same batch ⚡️

<a name="0.1.3"></a>
# [0.1.3](https://github.com/yannvgn/laserembeddings/compare/v0.1.2...v0.1.3) (2019-10-03)

52 changes: 40 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -7,8 +7,11 @@

laserembeddings is a pip-packaged, production-ready port of Facebook Research's [LASER](https://github.com/facebookresearch/LASER) (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.

🎁 **Version 0.1.3 is out. What's new?**
- A lot of languages that were only partially supported are now fully supported (br, bs, ceb, fr, gl, oc, ug, vi) 🌍
**Version 1.0.0 is here! What's new?**
- Greek, Chinese and Japanese are now supported 🇬🇷 🇨🇳 🇯🇵
- Some languages that were only partially supported are now fully supported (New Norwegian, Swedish, Tatar) 🌍
- It should work on Windows now 🙄
- Sentences in different languages can now be processed in the same batch ⚡️

## Context

@@ -32,6 +35,19 @@ You'll need Python 3.6 or higher.
pip install laserembeddings
```

To install laserembeddings with extra dependencies:

```
# if you need Chinese support:
pip install laserembeddings[zh]
# if you need Japanese support:
pip install laserembeddings[ja]
# or both:
pip install laserembeddings[zh,ja]
```

### Downloading the pre-trained models

```
@@ -47,14 +63,25 @@ from laserembeddings import Laser

laser = Laser()

# if all sentences are in the same language:

embeddings = laser.embed_sentences(
['let your neural network be polyglot',
'use multilingual embeddings!'],
lang='en') # lang is used for tokenization
lang='en') # lang is only used for tokenization

# embeddings is a N*1024 (N = number of sentences) NumPy array
```

If the sentences are not in the same language, you can pass a list of language codes:
```python
embeddings = laser.embed_sentences(
['I love pasta.',
"J'adore les pâtes.",
'Ich liebe Pasta.'],
lang=['en', 'fr', 'de'])
```

If you downloaded the models into a specific directory:

```python
@@ -96,11 +123,7 @@ Here's a summary of the differences:
|----------------------|-------------------------------------|----------------------------------------|--------|
| Normalization / tokenization | [Moses](https://github.com/moses-smt/mosesdecoder) | [Sacremoses](https://github.com/alvations/sacremoses) | Moses is implemented in Perl |
| BPE encoding | [fastBPE](https://github.com/glample/fastBPE) | [subword-nmt](https://github.com/rsennrich/subword-nmt) | fastBPE cannot be installed via pip and requires compiling C++ code |

The following features have not been implemented yet:
- romanize, needed to process Greek (el)
- Chinese text segmentation, needed to process Chinese (zh, cmn, wuu and yue)
- Japanese text segmentation, needed to process Japanese (ja, jpn)
| Japanese segmentation (optional) | [MeCab](https://github.com/taku910/mecab) / [JapaneseTokenizer](https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers) | [mecab-python3](https://github.com/SamuraiT/mecab-python3) | mecab-python3 comes with wheels for major platforms (no compilation needed) |

## Will I get the exact same embeddings?

@@ -124,14 +147,14 @@ A big thanks to the creators of [Sacremoses](https://github.com/alvations/sacrem

## Testing

First you'll need to checkout this repository and install it (in a virtual environment if you want). Also make sure to have [Poetry](https://github.com/sdispater/poetry) installed.
The first thing you'll need is [Poetry](https://github.com/sdispater/poetry). Please refer to the [installation guidelines](https://poetry.eustace.io/docs/#installation).

Clone this repository and install the project:
```
peotry install
poetry install
```

Then, to run the tests:

To run the tests:
```
poetry run pytest
```
@@ -144,6 +167,11 @@ First, download the test data.
python -m laserembeddings download-test-data
```

Install extra dependencies (Chinese and Japanese support):
```
poetry install -E zh -E ja
```

👉 If you want to know more about the contents and the generation of the test data, check out the [laserembeddings-test-data](https://github.com/yannvgn/laserembeddings-test-data) repository.

Then, run the test with `SIMILARITY_TEST` env. variable set to `1`.
2 changes: 1 addition & 1 deletion laserembeddings/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from .laser import Laser

__version__ = '0.1.3'
__version__ = '1.0.0'

__all__ = ['Laser']
42 changes: 29 additions & 13 deletions laserembeddings/__main__.py
Original file line number Diff line number Diff line change
@@ -3,37 +3,53 @@
import urllib.request
import tarfile

IS_WIN = os.name == 'nt'


def non_win_string(s):
return s if not IS_WIN else ''


CONSOLE_CLEAR = non_win_string('\033[0;0m')
CONSOLE_BOLD = non_win_string('\033[0;1m')
CONSOLE_WAIT = non_win_string('⏳')
CONSOLE_DONE = non_win_string('✅')
CONSOLE_STARS = non_win_string('✨')
CONSOLE_ERROR = non_win_string('❌')


def print_usage():
print('Usage:')
print('')
print(
'\033[0;1mpython -m laserembeddings download-models [OUTPUT_DIRECTORY]\033[0;0m'
f'{CONSOLE_BOLD}python -m laserembeddings download-models [OUTPUT_DIRECTORY]{CONSOLE_CLEAR}'
)
print(
' Downloads LASER model files. If OUTPUT_DIRECTORY is omitted,'
'\n'
' the models will be placed into the \033[0;1mdata\033[0;0m directory of the module'
f' the models will be placed into the {CONSOLE_BOLD}data{CONSOLE_CLEAR} directory of the module'
)
print('')
print('\033[0;1mpython -m laserembeddings download-test-data\033[0;0m')
print(
f'{CONSOLE_BOLD}python -m laserembeddings download-test-data{CONSOLE_CLEAR}'
)
print(' downloads data needed to run the tests')
print('')


def download_file(url, dest):
print(f' Downloading {url}...', end='')
print(f'{CONSOLE_WAIT} Downloading {url}...', end='')
sys.stdout.flush()
urllib.request.urlretrieve(url, dest)
print(f'\r Downloaded {url} ')
print(f'\r{CONSOLE_DONE} Downloaded {url} ')


def extract_tar(tar, output_dir):
print(f' Extracting archive...', end='')
print(f'{CONSOLE_WAIT} Extracting archive...', end='')
sys.stdout.flush()
with tarfile.open(tar) as t:
t.extractall(output_dir)
print(f'\r Extracted archive ')
print(f'\r{CONSOLE_DONE} Extracted archive ')


def download_models(output_dir):
@@ -49,22 +65,22 @@ def download_models(output_dir):
os.path.join(output_dir, 'bilstm.93langs.2018-12-26.pt'))

print('')
print("✨ You\'re all set!")
print(f'{CONSOLE_STARS} You\'re all set!')


def download_and_extract_test_data(output_dir):
print(f'Downloading test data into {output_dir}')
print('')

download_file(
'https://github.com/yannvgn/laserembeddings-test-data/releases/download/v1.0.0/laserembeddings-test-data.tar.gz',
'https://github.com/yannvgn/laserembeddings-test-data/releases/download/v1.0.1/laserembeddings-test-data.tar.gz',
os.path.join(output_dir, 'laserembeddings-test-data.tar.gz'))

extract_tar(os.path.join(output_dir, 'laserembeddings-test-data.tar.gz'),
output_dir)

print('')
print("✨ Ready to test all that!")
print(f'{CONSOLE_STARS} Ready to test all that!')


def main():
@@ -90,12 +106,12 @@ def main():
repository_root = os.path.dirname(
os.path.dirname(os.path.realpath(__file__)))

if os.path.basename(repository_root) != 'laserembeddings':
if not os.path.isfile(os.path.join(repository_root, 'pyproject.toml')):
print(
"❌ Looks like you're not running laserembeddings from its source code"
f"{CONSOLE_ERROR} Looks like you're not running laserembeddings from its source code"
)
print(
" → please checkout https://github.com/yannvgn/laserembedings.git"
" → please checkout https://github.com/yannvgn/laserembeddings.git"
)
print(
' then run "python -m laserembeddings download-test-data" from the root of the repository'
6 changes: 3 additions & 3 deletions laserembeddings/embedding.py
Original file line number Diff line number Diff line change
@@ -12,11 +12,11 @@ class BPESentenceEmbedding:
LASER embeddings computation from BPE-encoded sentences.
Args:
encoder (str or BinaryIO): the path to LASER's encoder PyToch model,
encoder (str or BinaryIO): the path to LASER's encoder PyTorch model,
or a binary-mode file object.
max_sentences (int, optional): see ``.encoder.SentenceEncoder``.
max_tokens (int, optional): see ``.encoder.SentenceEncoder``.
mastablex_tokens (bool, optional): if True, mergesort sorting algorithm will be used,
stable (bool, optional): if True, mergesort sorting algorithm will be used,
otherwise quicksort will be used. Defaults to False. See ``.encoder.SentenceEncoder``.
cpu (bool, optional): if True, forces the use of the CPU even a GPU is available. Defaults to False.
"""
@@ -40,7 +40,7 @@ def embed_bpe_sentences(self, bpe_sentences: List[str]) -> np.ndarray:
Computes the LASER embeddings of BPE-encoded sentences
Args:
sentences (List[str]): The list of BPE-encoded sentences
bpe_sentences (List[str]): The list of BPE-encoded sentences
Returns:
np.ndarray: A N * 1024 NumPy array containing the embeddings, N being the number of sentences provided.
18 changes: 11 additions & 7 deletions laserembeddings/laser.py
Original file line number Diff line number Diff line change
@@ -57,19 +57,19 @@ def __init__(self,
if bpe_codes is None:
if not os.path.isfile(self.DEFAULT_BPE_CODES_FILE):
raise FileNotFoundError(
'93langs.fcodes is missing, run "python -m laserembeddings download-models" to fix that 🔧'
'93langs.fcodes is missing, run "python -m laserembeddings download-models" to fix that'
)
bpe_codes = self.DEFAULT_BPE_CODES_FILE
if bpe_vocab is None:
if not os.path.isfile(self.DEFAULT_BPE_VOCAB_FILE):
raise FileNotFoundError(
'93langs.fvocab is missing, run "python -m laserembeddings download-models" to fix that 🔧'
'93langs.fvocab is missing, run "python -m laserembeddings download-models" to fix that'
)
bpe_vocab = self.DEFAULT_BPE_VOCAB_FILE
if encoder is None:
if not os.path.isfile(self.DEFAULT_ENCODER_FILE):
raise FileNotFoundError(
'bilstm.93langs.2018-12-26.pt is missing, run "python -m laserembeddings download-models" to fix that 🔧'
'bilstm.93langs.2018-12-26.pt is missing, run "python -m laserembeddings download-models" to fix that'
)
encoder = self.DEFAULT_ENCODER_FILE

@@ -88,21 +88,25 @@ def _get_tokenizer(self, lang: str) -> Tokenizer:

return self.tokenizers[lang]

def embed_sentences(self, sentences: List[str], lang: str) -> np.ndarray:
def embed_sentences(self, sentences: Union[List[str], str],
lang: Union[str, List[str]]) -> np.ndarray:
"""
Computes the LASER embeddings of provided sentences using the tokenizer for the specified language.
Args:
sentences (List[str]): the sentences to compute the embeddings from.
lang (str): the language code (ISO 639-1) used to tokenize the sentences.
lang (str or List[str]): the language code(s) (ISO 639-1) used to tokenize the sentences
(either as a string - same code for every sentence - or as a list of strings - one code per sentence).
Returns:
np.ndarray: A N * 1024 NumPy array containing the embeddings, N being the number of sentences provided.
"""
sentences = [sentences] if isinstance(sentences, str) else sentences
lang = [lang] * len(sentences) if isinstance(lang, str) else lang
with sre_performance_patch(): # see https://bugs.python.org/issue37723
sentence_tokens = [
self._get_tokenizer(lang).tokenize(sentence)
for sentence in sentences
self._get_tokenizer(sentence_lang).tokenize(sentence)
for sentence, sentence_lang in zip(sentences, lang)
]
bpe_encoded = [
self.bpe.encode_tokens(tokens) for tokens in sentence_tokens
Loading