Simplify processors - add Fasttokenizers #649

Timoeller · 2020-12-04T09:30:33Z

Simplifying the processor by:

moving functions into dataset_from_dicts
unnesting functions
cleaning up code

Some older commits are done by Bogdan and me, for making FARM work with transformers 3.5.1 and fasttokenizers.

For descriptions about progress see comments below.

…_processor_qa

Timoeller · 2020-12-04T11:26:07Z

270996c introduces Multiprocessing after the Multithreading by Rust tokenizers.

Although Multithreading should be finished by then, forking processes in python results in:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

When setting the env var to false, python mp wont start.

My next experiment will be adding mp on a higher level (calling dataset_from_dicts) again.

Timoeller · 2020-12-07T12:07:52Z

I did some performance benchmarking and found the culprit:
The function offset_to_token_idx() in samples.py takes up 93% of compute time.
Vectorizing the function reduced preprocessing from 199 secs to 21 secs!

I will now work on separating the create_samples_qa and sample_to_features_qa functions into more meaningful functions:

split_documents_into_passages
featurize_text_and_labels

Timoeller · 2020-12-17T11:08:32Z

farm/modeling/prediction_head.py

                        "context": f"{context}",
                        "label": f"{tag}",
-                        "probability": prob,
+                        "probability": np.float32(0.0),


@brandenchan
is the prob completly gone or was this just a quick fix that you forgot to revert?

The old probability calculation was wrong. I have opened issue #658 to address this.

…/FARM into refactor_processor_qa

…sor again

…to refactor_processor_qa

…adjust qa benchmark to new values

* WIP lm finetuning refactoring * WIP refactoring bert style lm * first working version of bert_style_lm * optimize speed of mask_random_words * move get_start_of_words to tokenization module * Update docstrings. fix estimation * add multithreading_rust arg * fix import. fix vocab index out of range * fix empty sequence b * make bert-style to new default for lm finetuning. disable eval_report * change evaluate_every to 1000

…to refactor_processor_qa

Timoeller · 2020-12-22T17:18:49Z

I tested this branch with haystack in the following ways:

running all haystack tests
Tutorial 1
Tutorial 5, when using FarmReader.eval we need to adjust the return of processor.dataset_from_dicts
Colab tutorial 5

Working on fixing the remaining tests.

Timoeller · 2020-12-23T09:45:40Z

I cannot reproduce the failing s3 test and nothing has changed code wise.

I presume it is some CI problem that we can fix later.

Merging now.

* increase transformers version * Make fast tokenizers possible * refactor QA processing * Move all fcts into dataset from dicts for QA * refactor doc classification * refactor bert_style_lm * refactor inference_processor Co-authored-by: Bogdan Kostić <[email protected]> Co-authored-by: brandenchan <[email protected]> Co-authored-by: Malte Pietsch <[email protected]> Former-commit-id: 18e7fc7 Former-commit-id: 4fdadbe87ea1a0dbfdb02959a23e56a653d1aed2

Timoeller and others added 18 commits November 13, 2020 11:59

increase transformers version

6b5e0a1

use correct version

a96aca7

Add tholors proposed changes for ed0243b on top of new tokenizer class

1ea854d

Remove unused imports

f5c77bc

Adjust tests

2f93100

Remove test

6663847

Add transformers bugfix in 3.5.1

6a9c723

Make fast tokenizers possible (not finished)

374e362

Refactor initializing and featurizing samples for FastTokenizers

b3cb744

Merge branch 'master' into update_transformers_3.5.0

8d2152b

Make code more readable

53d533a

Add transformers bugfix in 3.5.1

a40d0d1

Merge commit '6a9c723c8a6be090dcfb99f3c5b18d08afc19d9f' into refactor…

07847aa

…_processor_qa

Merge commit '53d533a65cd1200573bd0f5f97e48c7b74ec7f8d' into refactor…

10ecdb6

…_processor_qa

Remove tokenization into strings, directly convert text to ids

38a79d6

Enable slow mode besides fast

2917335

Move all fcts into dataset from dicts for QA

a6171ec

Add python multiprocessing draft

270996c

Timoeller assigned tholor, Timoeller and brandenchan Dec 4, 2020

Timoeller changed the title ~~WIP: Refactor processor qa~~ WIP: Simplify processors - add Fasttokenizers Dec 4, 2020

brandenchan and others added 5 commits December 4, 2020 15:17

refactor doc classification

a23daab

Add inference flag

4779077

Enable multiprocessing on high level again

49d5a7c

Fix dataset duplication

801108b

Remove inference flag

7d10f74

Trigger CI for PR

a0bd77a

brandenchan added 2 commits December 16, 2020 17:51

Turn off slow tokenization tests qa

4309719

Refactor InferenceProcessor

d0a6f36

Timoeller commented Dec 17, 2020

View reviewed changes

Timoeller and others added 9 commits December 17, 2020 13:11

Fix tests in test QA: test_save_load and test_inference_dicts

dbd973a

Make all tests work in test_question_answering

d77d89c

Fix tokenizer test

ab4e40c

Merge branch 'refactor_processor_qa' of https://github.com/deepset-ai…

ddf02b3

…/FARM into refactor_processor_qa

Simplify tokenization test logging

3095b42

Make s3pooling work by adding slow tokenizer mode for Inferenceproces…

fceaf65

…sor again

Merge branch 'refactor_processor_qa' of github.com:deepset-ai/FARM in…

ecec503

…to refactor_processor_qa

Disable NQ tests

ca7372a

Fix onnx conversion test

909ea9d

Timoeller added the breaking change label Dec 18, 2020

Timoeller and others added 5 commits December 18, 2020 17:55

Add assert for parameter checks in data validation, change num cpus, …

691dddb

…adjust qa benchmark to new values

fix streaming data silo for new signature of dataset_from_dicts

91bea62

Fix doc format

bd12771

Merge branch 'refactor_processor_qa' of github.com:deepset-ai/FARM in…

04cabf6

…to refactor_processor_qa

Timoeller merged commit 18e7fc7 into master Dec 23, 2020

Timoeller mentioned this pull request Dec 23, 2020

Increase transformers version to 3.5.1 #624

Closed

Timoeller changed the title ~~WIP: Simplify processors - add Fasttokenizers~~ Simplify processors - add Fasttokenizers Dec 23, 2020

Timoeller mentioned this pull request Dec 23, 2020

Make tests working on Windows #637

Merged

Timoeller mentioned this pull request Dec 23, 2020

lm_finetuning.py is getting stuck on my machine #652

Closed

Timoeller mentioned this pull request Jan 5, 2021

Increase Transformers Version to 4.0.0 #644

Closed

lalitpagaria mentioned this pull request Jan 6, 2021

Remove TODOs from RAG Generator code deepset-ai/haystack#715

Closed

This was referenced Jan 6, 2021

Refactor Processor Stages in FARM #645

Closed

Clean Processor methods #684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify processors - add Fasttokenizers #649

Simplify processors - add Fasttokenizers #649

Timoeller commented Dec 4, 2020 •

edited

Loading

Timoeller commented Dec 4, 2020

Timoeller commented Dec 7, 2020

Timoeller Dec 17, 2020

brandenchan Jan 4, 2021

Timoeller commented Dec 22, 2020

Timoeller commented Dec 23, 2020

Simplify processors - add Fasttokenizers #649

Simplify processors - add Fasttokenizers #649

Conversation

Timoeller commented Dec 4, 2020 • edited Loading

Timoeller commented Dec 4, 2020

Timoeller commented Dec 7, 2020

Timoeller Dec 17, 2020

Choose a reason for hiding this comment

brandenchan Jan 4, 2021

Choose a reason for hiding this comment

Timoeller commented Dec 22, 2020

Timoeller commented Dec 23, 2020

Timoeller commented Dec 4, 2020 •

edited

Loading