Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify processors - add Fasttokenizers #649

Merged
merged 77 commits into from
Dec 23, 2020
Merged

Conversation

Timoeller
Copy link
Contributor

@Timoeller Timoeller commented Dec 4, 2020

Simplifying the processor by:

  • moving functions into dataset_from_dicts
  • unnesting functions
  • cleaning up code

Some older commits are done by Bogdan and me, for making FARM work with transformers 3.5.1 and fasttokenizers.

For descriptions about progress see comments below.

@Timoeller
Copy link
Contributor Author

270996c introduces Multiprocessing after the Multithreading by Rust tokenizers.

Although Multithreading should be finished by then, forking processes in python results in:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

When setting the env var to false, python mp wont start.

My next experiment will be adding mp on a higher level (calling dataset_from_dicts) again.

@Timoeller Timoeller changed the title WIP: Refactor processor qa WIP: Simplify processors - add Fasttokenizers Dec 4, 2020
@Timoeller
Copy link
Contributor Author

I did some performance benchmarking and found the culprit:
The function offset_to_token_idx() in samples.py takes up 93% of compute time.
Vectorizing the function reduced preprocessing from 199 secs to 21 secs!


I will now work on separating the create_samples_qa and sample_to_features_qa functions into more meaningful functions:

  1. split_documents_into_passages
  2. featurize_text_and_labels

"context": f"{context}",
"label": f"{tag}",
"probability": prob,
"probability": np.float32(0.0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandenchan
is the prob completly gone or was this just a quick fix that you forgot to revert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old probability calculation was wrong. I have opened issue #658 to address this.

Timoeller and others added 5 commits December 18, 2020 17:55
* WIP lm finetuning refactoring

* WIP refactoring bert style lm

* first working version of bert_style_lm

* optimize speed of mask_random_words

* move get_start_of_words to tokenization module

* Update docstrings. fix estimation

* add multithreading_rust arg

* fix import. fix vocab index out of range

* fix empty sequence b

* make bert-style to new default for lm finetuning. disable eval_report

* change evaluate_every to 1000
@Timoeller
Copy link
Contributor Author

I tested this branch with haystack in the following ways:

  • running all haystack tests
  • Tutorial 1
  • Tutorial 5, when using FarmReader.eval we need to adjust the return of processor.dataset_from_dicts
  • Colab tutorial 5

Working on fixing the remaining tests.

@Timoeller
Copy link
Contributor Author

I cannot reproduce the failing s3 test and nothing has changed code wise.

I presume it is some CI problem that we can fix later.

Merging now.

@Timoeller Timoeller merged commit 18e7fc7 into master Dec 23, 2020
@Timoeller Timoeller changed the title WIP: Simplify processors - add Fasttokenizers Simplify processors - add Fasttokenizers Dec 23, 2020
Timoeller added a commit that referenced this pull request Dec 23, 2020
* increase transformers version

* Make fast tokenizers possible

* refactor QA processing

* Move all fcts into dataset from dicts for QA

* refactor doc classification

* refactor bert_style_lm

* refactor inference_processor


Co-authored-by: Bogdan Kostić <[email protected]>
Co-authored-by: brandenchan <[email protected]>
Co-authored-by: Malte Pietsch <[email protected]>
Former-commit-id: 18e7fc7
Former-commit-id: 4fdadbe87ea1a0dbfdb02959a23e56a653d1aed2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants