-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify processors - add Fasttokenizers #649
Conversation
270996c introduces Multiprocessing after the Multithreading by Rust tokenizers. Although Multithreading should be finished by then, forking processes in python results in:
When setting the env var to false, python mp wont start. My next experiment will be adding mp on a higher level (calling dataset_from_dicts) again. |
I did some performance benchmarking and found the culprit: I will now work on separating the create_samples_qa and sample_to_features_qa functions into more meaningful functions:
|
"context": f"{context}", | ||
"label": f"{tag}", | ||
"probability": prob, | ||
"probability": np.float32(0.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandenchan
is the prob completly gone or was this just a quick fix that you forgot to revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old probability calculation was wrong. I have opened issue #658 to address this.
…/FARM into refactor_processor_qa
…to refactor_processor_qa
…adjust qa benchmark to new values
* WIP lm finetuning refactoring * WIP refactoring bert style lm * first working version of bert_style_lm * optimize speed of mask_random_words * move get_start_of_words to tokenization module * Update docstrings. fix estimation * add multithreading_rust arg * fix import. fix vocab index out of range * fix empty sequence b * make bert-style to new default for lm finetuning. disable eval_report * change evaluate_every to 1000
…to refactor_processor_qa
I tested this branch with haystack in the following ways:
Working on fixing the remaining tests. |
I cannot reproduce the failing s3 test and nothing has changed code wise. I presume it is some CI problem that we can fix later. Merging now. |
* increase transformers version * Make fast tokenizers possible * refactor QA processing * Move all fcts into dataset from dicts for QA * refactor doc classification * refactor bert_style_lm * refactor inference_processor Co-authored-by: Bogdan Kostić <[email protected]> Co-authored-by: brandenchan <[email protected]> Co-authored-by: Malte Pietsch <[email protected]> Former-commit-id: 18e7fc7 Former-commit-id: 4fdadbe87ea1a0dbfdb02959a23e56a653d1aed2
Simplifying the processor by:
Some older commits are done by Bogdan and me, for making FARM work with transformers 3.5.1 and fasttokenizers.
For descriptions about progress see comments below.