yannvgn · Oct 3, 2019 · Oct 3, 2019 · Oct 3, 2019 · Oct 3, 2019 · Oct 3, 2019
diff --git a/.gitignore b/.gitignore
@@ -130,3 +130,6 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# PyCharm files
+.idea/*
diff --git a/.travis.yml b/.travis.yml
@@ -1,16 +1,57 @@
-dist: xenial
 language: python
-python:
-  - "3.6"
-  - "3.7"
-before_install:
-  - pip install poetry
+jobs:
+  include:
+    - name: "Python 3.7 on Xenial Linux"
+      python: 3.7
+      before_install:
+        - python -m pip install --upgrade pip
+        - pip3 install poetry==1.0.*
+        - pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+    - name: "Python 3.6 on Xenial Linux"
+      python: 3.6
+      before_install:
+        - python -m pip install --upgrade pip
+        - pip3 install poetry==1.0.*
+        - pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+    - name: "Python 3.7 on macOS"
+      os: osx
+      osx_image: xcode11.2
+      language: shell
+      before_install:
+        - python3 -m pip install --upgrade pip
+        - pip3 install poetry==1.0.*
+        - pip3 install virtualenv
+        - virtualenv .env
+        - source .env/bin/activate
+        - pip3 install torch
+    - name: "Python 3.7 on Windows"
+      os: windows
+      language: shell
+      before_install:
+        - choco install python --version 3.7.0
+        - python -m pip install --upgrade pip
+        - pip3 install poetry==1.0.*
+        - poetry config virtualenvs.create false
+        - pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+      env: PATH=/c/Python37:/c/Python37/Scripts:$PATH
+    - name: "Python 3.7 on Xenial Linux (wheel installation)"
+      python: 3.7
+      before_install:
+        - python -m pip install --upgrade pip
+        - pip3 install poetry==1.0.*
+        - pip3 install torch==1.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+      install:
+        - poetry build
+        - pip3 install dist/laserembeddings-*.whl
+        - python -m laserembeddings download-models
+      script:
+        - python -c 'from laserembeddings import Laser; laser = Laser(); laser.embed_sentences(["test"], lang="en")'
+
 install:
-  - if [[ `python --version` =~ 'Python 3.6' ]]; then pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp36-cp36m-linux_x86_64.whl; fi
-  - if [[ `python --version` =~ 'Python 3.7' ]]; then pip install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl; fi
   - poetry remove torch -n # fix: latest torch wheel (1.1.0.post2) not available for linux
   - poetry install -n
-  - python -m laserembeddings download-models
+  - python3 -m laserembeddings download-models || python -m laserembeddings download-models
+
 script:
   - poetry run pylint laserembeddings
   - poetry run pytest
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,11 @@
+<a name="1.0.0"></a>
+# [1.0.0](https://github.com/yannvgn/laserembeddings/compare/v0.1.3...v1.0.0) (2019-12-19)
+
+- Greek, Chinese and Japanese are now supported 🇬🇷 🇨🇳 🇯🇵 
+- Some languages that were only partially supported are now fully supported (New Norwegian, Swedish, Tatar) 🌍
+- It should work on Windows now 🙄
+- Sentences in different languages can now be processed in the same batch ⚡️
+
 <a name="0.1.3"></a>
 # [0.1.3](https://github.com/yannvgn/laserembeddings/compare/v0.1.2...v0.1.3) (2019-10-03)
 

diff --git a/README.md b/README.md
@@ -7,8 +7,11 @@
 
 laserembeddings is a pip-packaged, production-ready port of Facebook Research's [LASER](https://github.com/facebookresearch/LASER) (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.
 
-🎁 **Version 0.1.3 is out. What's new?**
-- A lot of languages that were only partially supported are now fully supported (br, bs, ceb, fr, gl, oc, ug, vi) 🌍
+✨ **Version 1.0.0 is here! What's new?**
+- Greek, Chinese and Japanese are now supported 🇬🇷 🇨🇳 🇯🇵 
+- Some languages that were only partially supported are now fully supported (New Norwegian, Swedish, Tatar) 🌍
+- It should work on Windows now 🙄
+- Sentences in different languages can now be processed in the same batch ⚡️
 
 ## Context
 
@@ -32,6 +35,19 @@ You'll need Python 3.6 or higher.
 pip install laserembeddings
 ```
 
+To install laserembeddings with extra dependencies:
+
+```
+# if you need Chinese support:
+pip install laserembeddings[zh]
+
+# if you need Japanese support:
+pip install laserembeddings[ja]
+
+# or both:
+pip install laserembeddings[zh,ja]
+```
+
 ### Downloading the pre-trained models
 
 ```
@@ -47,14 +63,25 @@ from laserembeddings import Laser
 
 laser = Laser()
 
+# if all sentences are in the same language:
+
 embeddings = laser.embed_sentences(
     ['let your neural network be polyglot',
      'use multilingual embeddings!'],
-    lang='en')  # lang is used for tokenization
+    lang='en')  # lang is only used for tokenization
 
 # embeddings is a N*1024 (N = number of sentences) NumPy array
 ```
 
+If the sentences are not in the same language, you can pass a list of language codes:
+```python
+embeddings = laser.embed_sentences(
+    ['I love pasta.',
+     "J'adore les pâtes.",
+     'Ich liebe Pasta.'],
+    lang=['en', 'fr', 'de'])
+```
+
 If you downloaded the models into a specific directory:
 
 ```python
@@ -96,11 +123,7 @@ Here's a summary of the differences:
 |----------------------|-------------------------------------|----------------------------------------|--------|
 | Normalization / tokenization | [Moses](https://github.com/moses-smt/mosesdecoder) | [Sacremoses](https://github.com/alvations/sacremoses) | Moses is implemented in Perl |
 | BPE encoding | [fastBPE](https://github.com/glample/fastBPE) | [subword-nmt](https://github.com/rsennrich/subword-nmt) | fastBPE cannot be installed via pip and requires compiling C++ code |
-
-The following features have not been implemented yet:
-- romanize, needed to process Greek (el)
-- Chinese text segmentation, needed to process Chinese (zh, cmn, wuu and yue)
-- Japanese text segmentation, needed to process Japanese (ja, jpn)
+| Japanese segmentation (optional) | [MeCab](https://github.com/taku910/mecab) / [JapaneseTokenizer](https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers) | [mecab-python3](https://github.com/SamuraiT/mecab-python3) | mecab-python3 comes with wheels for major platforms (no compilation needed) |
 
 ## Will I get the exact same embeddings?
 
@@ -124,14 +147,14 @@ A big thanks to the creators of [Sacremoses](https://github.com/alvations/sacrem
 
 ## Testing
 
-First you'll need to checkout this repository and install it (in a virtual environment if you want). Also make sure to have [Poetry](https://github.com/sdispater/poetry) installed.
+The first thing you'll need is [Poetry](https://github.com/sdispater/poetry). Please refer to the [installation guidelines](https://poetry.eustace.io/docs/#installation).
 
+Clone this repository and install the project:
 ```
-peotry install
+poetry install
 ```
 
-Then, to run the tests:
-
+To run the tests:
 ```
 poetry run pytest
 ```
@@ -144,6 +167,11 @@ First, download the test data.
 python -m laserembeddings download-test-data
 ```
 
+Install extra dependencies (Chinese and Japanese support):
+```
+poetry install -E zh -E ja
+```
+
 👉 If you want to know more about the contents and the generation of the test data, check out the [laserembeddings-test-data](https://github.com/yannvgn/laserembeddings-test-data) repository.
 
 Then, run the test with `SIMILARITY_TEST` env. variable set to `1`.

diff --git a/laserembeddings/__init__.py b/laserembeddings/__init__.py
@@ -1,5 +1,5 @@
 from .laser import Laser
 
-__version__ = '0.1.3'
+__version__ = '1.0.0'
 
 __all__ = ['Laser']
diff --git a/laserembeddings/__main__.py b/laserembeddings/__main__.py
@@ -3,37 +3,53 @@
 import urllib.request
 import tarfile
 
+IS_WIN = os.name == 'nt'
+
+
+def non_win_string(s):
+    return s if not IS_WIN else ''
+
+
+CONSOLE_CLEAR = non_win_string('\033[0;0m')
+CONSOLE_BOLD = non_win_string('\033[0;1m')
+CONSOLE_WAIT = non_win_string('⏳')
+CONSOLE_DONE = non_win_string('✅')
+CONSOLE_STARS = non_win_string('✨')
+CONSOLE_ERROR = non_win_string('❌')
+
 
 def print_usage():
     print('Usage:')
     print('')
     print(
-        '\033[0;1mpython -m laserembeddings download-models [OUTPUT_DIRECTORY]\033[0;0m'
+        f'{CONSOLE_BOLD}python -m laserembeddings download-models [OUTPUT_DIRECTORY]{CONSOLE_CLEAR}'
     )
     print(
         '   Downloads LASER model files. If OUTPUT_DIRECTORY is omitted,'
         '\n'
-        '   the models will be placed into the \033[0;1mdata\033[0;0m directory of the module'
+        f'   the models will be placed into the {CONSOLE_BOLD}data{CONSOLE_CLEAR} directory of the module'
     )
     print('')
-    print('\033[0;1mpython -m laserembeddings download-test-data\033[0;0m')
+    print(
+        f'{CONSOLE_BOLD}python -m laserembeddings download-test-data{CONSOLE_CLEAR}'
+    )
     print('   downloads data needed to run the tests')
     print('')
 
 
 def download_file(url, dest):
-    print(f'⏳   Downloading {url}...', end='')
+    print(f'{CONSOLE_WAIT}   Downloading {url}...', end='')
     sys.stdout.flush()
     urllib.request.urlretrieve(url, dest)
-    print(f'\r✅   Downloaded {url}    ')
+    print(f'\r{CONSOLE_DONE}   Downloaded {url}    ')
 
 
 def extract_tar(tar, output_dir):
-    print(f'⏳   Extracting archive...', end='')
+    print(f'{CONSOLE_WAIT}   Extracting archive...', end='')
     sys.stdout.flush()
     with tarfile.open(tar) as t:
         t.extractall(output_dir)
-    print(f'\r✅   Extracted archive    ')
+    print(f'\r{CONSOLE_DONE}   Extracted archive    ')
 
 
 def download_models(output_dir):
@@ -49,22 +65,22 @@ def download_models(output_dir):
         os.path.join(output_dir, 'bilstm.93langs.2018-12-26.pt'))
 
     print('')
-    print("✨ You\'re all set!")
+    print(f'{CONSOLE_STARS} You\'re all set!')
 
 
 def download_and_extract_test_data(output_dir):
     print(f'Downloading test data into {output_dir}')
     print('')
 
     download_file(
-        'https://github.com/yannvgn/laserembeddings-test-data/releases/download/v1.0.0/laserembeddings-test-data.tar.gz',
+        'https://github.com/yannvgn/laserembeddings-test-data/releases/download/v1.0.1/laserembeddings-test-data.tar.gz',
         os.path.join(output_dir, 'laserembeddings-test-data.tar.gz'))
 
     extract_tar(os.path.join(output_dir, 'laserembeddings-test-data.tar.gz'),
                 output_dir)
 
     print('')
-    print("✨ Ready to test all that!")
+    print(f'{CONSOLE_STARS} Ready to test all that!')
 
 
 def main():
@@ -90,12 +106,12 @@ def main():
         repository_root = os.path.dirname(
             os.path.dirname(os.path.realpath(__file__)))
 
-        if os.path.basename(repository_root) != 'laserembeddings':
+        if not os.path.isfile(os.path.join(repository_root, 'pyproject.toml')):
             print(
-                "❌  Looks like you're not running laserembeddings from its source code"
+                f"{CONSOLE_ERROR}  Looks like you're not running laserembeddings from its source code"
             )
             print(
-                "     → please checkout https://github.com/yannvgn/laserembedings.git"
+                "     → please checkout https://github.com/yannvgn/laserembeddings.git"
             )
             print(
                 '       then run "python -m laserembeddings download-test-data" from the root of the repository'

diff --git a/laserembeddings/embedding.py b/laserembeddings/embedding.py
@@ -12,11 +12,11 @@ class BPESentenceEmbedding:
     LASER embeddings computation from BPE-encoded sentences.
 
     Args:
-        encoder (str or BinaryIO): the path to LASER's encoder PyToch model,
+        encoder (str or BinaryIO): the path to LASER's encoder PyTorch model,
             or a binary-mode file object.
         max_sentences (int, optional): see ``.encoder.SentenceEncoder``.
         max_tokens (int, optional): see ``.encoder.SentenceEncoder``.
-        mastablex_tokens (bool, optional): if True, mergesort sorting algorithm will be used,
+        stable (bool, optional): if True, mergesort sorting algorithm will be used,
             otherwise quicksort will be used. Defaults to False. See ``.encoder.SentenceEncoder``.
         cpu (bool, optional): if True, forces the use of the CPU even a GPU is available. Defaults to False.
     """
@@ -40,7 +40,7 @@ def embed_bpe_sentences(self, bpe_sentences: List[str]) -> np.ndarray:
         Computes the LASER embeddings of BPE-encoded sentences
 
         Args:
-            sentences (List[str]): The list of BPE-encoded sentences
+            bpe_sentences (List[str]): The list of BPE-encoded sentences
 
         Returns:
             np.ndarray: A N * 1024 NumPy array containing the embeddings, N being the number of sentences provided.

diff --git a/laserembeddings/laser.py b/laserembeddings/laser.py
@@ -57,19 +57,19 @@ def __init__(self,
         if bpe_codes is None:
             if not os.path.isfile(self.DEFAULT_BPE_CODES_FILE):
                 raise FileNotFoundError(
-                    '93langs.fcodes is missing, run "python -m laserembeddings download-models" to fix that 🔧'
+                    '93langs.fcodes is missing, run "python -m laserembeddings download-models" to fix that'
                 )
             bpe_codes = self.DEFAULT_BPE_CODES_FILE
         if bpe_vocab is None:
             if not os.path.isfile(self.DEFAULT_BPE_VOCAB_FILE):
                 raise FileNotFoundError(
-                    '93langs.fvocab is missing, run "python -m laserembeddings download-models" to fix that 🔧'
+                    '93langs.fvocab is missing, run "python -m laserembeddings download-models" to fix that'
                 )
             bpe_vocab = self.DEFAULT_BPE_VOCAB_FILE
         if encoder is None:
             if not os.path.isfile(self.DEFAULT_ENCODER_FILE):
                 raise FileNotFoundError(
-                    'bilstm.93langs.2018-12-26.pt is missing, run "python -m laserembeddings download-models" to fix that 🔧'
+                    'bilstm.93langs.2018-12-26.pt is missing, run "python -m laserembeddings download-models" to fix that'
                 )
             encoder = self.DEFAULT_ENCODER_FILE
 
@@ -88,21 +88,25 @@ def _get_tokenizer(self, lang: str) -> Tokenizer:
 
         return self.tokenizers[lang]
 
-    def embed_sentences(self, sentences: List[str], lang: str) -> np.ndarray:
+    def embed_sentences(self, sentences: Union[List[str], str],
+                        lang: Union[str, List[str]]) -> np.ndarray:
         """
         Computes the LASER embeddings of provided sentences using the tokenizer for the specified language.
 
         Args:
             sentences (List[str]): the sentences to compute the embeddings from.
-            lang (str): the language code (ISO 639-1) used to tokenize the sentences.
+            lang (str or List[str]): the language code(s) (ISO 639-1) used to tokenize the sentences
+                (either as a string - same code for every sentence - or as a list of strings - one code per sentence).
 
         Returns:
             np.ndarray: A N * 1024 NumPy array containing the embeddings, N being the number of sentences provided.
         """
+        sentences = [sentences] if isinstance(sentences, str) else sentences
+        lang = [lang] * len(sentences) if isinstance(lang, str) else lang
         with sre_performance_patch():  # see https://bugs.python.org/issue37723
             sentence_tokens = [
-                self._get_tokenizer(lang).tokenize(sentence)
-                for sentence in sentences
+                self._get_tokenizer(sentence_lang).tokenize(sentence)
+                for sentence, sentence_lang in zip(sentences, lang)
             ]
             bpe_encoded = [
                 self.bpe.encode_tokens(tokens) for tokens in sentence_tokens
-Original file line number
+Diff line change
@@ @@ -130,3 +130,6 @@ dmypy.json @@
     # Pyre type checker
     .pyre/
+    # PyCharm files
+    .idea/*