Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify processors - add Fasttokenizers #649

Merged
merged 77 commits into from
Dec 23, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
6b5e0a1
increase transformers version
Timoeller Nov 13, 2020
a96aca7
use correct version
Timoeller Nov 13, 2020
1ea854d
Add tholors proposed changes for ed0243bac6cb2897e15ec8444c8c9701ce51…
Timoeller Nov 13, 2020
f5c77bc
Remove unused imports
Timoeller Nov 13, 2020
2f93100
Adjust tests
Timoeller Nov 13, 2020
6663847
Remove test
Timoeller Nov 16, 2020
6a9c723
Add transformers bugfix in 3.5.1
Timoeller Nov 17, 2020
374e362
Make fast tokenizers possible (not finished)
bogdankostic Nov 22, 2020
b3cb744
Refactor initializing and featurizing samples for FastTokenizers
bogdankostic Nov 25, 2020
8d2152b
Merge branch 'master' into update_transformers_3.5.0
bogdankostic Nov 25, 2020
53d533a
Make code more readable
bogdankostic Nov 25, 2020
a40d0d1
Add transformers bugfix in 3.5.1
Timoeller Nov 17, 2020
07847aa
Merge commit '6a9c723c8a6be090dcfb99f3c5b18d08afc19d9f' into refactor…
Timoeller Dec 3, 2020
10ecdb6
Merge commit '53d533a65cd1200573bd0f5f97e48c7b74ec7f8d' into refactor…
Timoeller Dec 3, 2020
38a79d6
Remove tokenization into strings, directly convert text to ids
Timoeller Dec 3, 2020
2917335
Enable slow mode besides fast
Timoeller Dec 3, 2020
a6171ec
Move all fcts into dataset from dicts for QA
Timoeller Dec 3, 2020
270996c
Add python multiprocessing draft
Timoeller Dec 4, 2020
a23daab
refactor doc classification
brandenchan Dec 4, 2020
4779077
Add inference flag
brandenchan Dec 4, 2020
49d5a7c
Enable multiprocessing on high level again
Timoeller Dec 7, 2020
801108b
Fix dataset duplication
brandenchan Dec 7, 2020
7d10f74
Remove inference flag
brandenchan Dec 7, 2020
a0bd77a
Trigger CI for PR
brandenchan Dec 7, 2020
17eeb9e
Refactored fill baskets, vectorized offset mappings, cleaned dataset …
Timoeller Dec 7, 2020
c2a0448
Implement reviewer suggestions
brandenchan Dec 7, 2020
49afce7
Fix test
brandenchan Dec 7, 2020
7ccbd40
Small rename
Timoeller Dec 7, 2020
253f8f0
Merge branch 'doc_cls_refactor' into refactor_processor_qa
Timoeller Dec 7, 2020
7090f13
Merge remote-tracking branch 'origin/master' into refactor_processor_qa
Timoeller Dec 7, 2020
26b06b5
Change CI
Timoeller Dec 7, 2020
6eaa5bf
Factor out label creation - not finished
Timoeller Dec 10, 2020
024c899
Separate NQ from SQuAD processing, so that NQ continues to work
Timoeller Dec 11, 2020
c78d663
WIP: Refactor NER
brandenchan Dec 11, 2020
4a2eb37
Isolate label computations into one subfunction of dataset_from_dicts
Timoeller Dec 12, 2020
abfd335
Merge branch 'refactor_processor_qa' of github.com:deepset-ai/FARM in…
Timoeller Dec 12, 2020
9e602c0
Bugfix label creation
Timoeller Dec 12, 2020
3f69bea
Squad working, saving state before removing process_answer fct
Timoeller Dec 12, 2020
53c5e0b
Simplify label creation by unnesting operations
Timoeller Dec 12, 2020
0c3c9d6
remove QAProcessor inheritance, add token strings to processing
Timoeller Dec 12, 2020
4c8021e
Bugfix tokenization - special tokens should only be added when combin…
Timoeller Dec 12, 2020
330c8db
Simplify sample to features qa
Timoeller Dec 13, 2020
db38ced
Remove unused code, move homeless functions into NQProcessor
Timoeller Dec 13, 2020
83d5244
Move conversion functions, add docstrings
Timoeller Dec 13, 2020
1d40623
Use baskets already in tokenization, add docstrings
Timoeller Dec 14, 2020
dd89a3c
Revert "WIP: Refactor NER"
brandenchan Dec 14, 2020
dc54e38
Improve id handling, improve error msgs
Timoeller Dec 14, 2020
4ec59ed
Set fast tokenizers to default
brandenchan Dec 14, 2020
05ebdb0
Neaten logging statement
brandenchan Dec 15, 2020
fa94474
Turn off fast tokenizer tests
brandenchan Dec 15, 2020
ee60467
Turn off slow tokenizers
brandenchan Dec 15, 2020
6cf6fb1
Fix tokenization tests
brandenchan Dec 15, 2020
0b57863
Refactor NER for fast tokenizers (#656)
brandenchan Dec 15, 2020
5d53ce0
Refactor problematic IDs, add problematic ids to QA processing
Timoeller Dec 15, 2020
a3a5f3e
Merge branch 'fix_tests' into refactor_processor_qa
brandenchan Dec 16, 2020
491058a
WIP: Fix NER inference
brandenchan Dec 16, 2020
6d6c5ce
Simplify NER pre and post processing
brandenchan Dec 16, 2020
7c65126
Fix Roberta tokenization test
brandenchan Dec 16, 2020
c790b98
Fix NER tests, only test fast tokenizers
brandenchan Dec 16, 2020
d063df1
Fix processor saving loading test
brandenchan Dec 16, 2020
becd4e2
Fix tokenization tests
brandenchan Dec 16, 2020
4309719
Turn off slow tokenization tests qa
brandenchan Dec 16, 2020
d0a6f36
Refactor InferenceProcessor
brandenchan Dec 17, 2020
dbd973a
Fix tests in test QA: test_save_load and test_inference_dicts
Timoeller Dec 17, 2020
d77d89c
Make all tests work in test_question_answering
Timoeller Dec 17, 2020
ab4e40c
Fix tokenizer test
brandenchan Dec 17, 2020
ddf02b3
Merge branch 'refactor_processor_qa' of https://github.com/deepset-ai…
brandenchan Dec 17, 2020
3095b42
Simplify tokenization test logging
brandenchan Dec 17, 2020
fceaf65
Make s3pooling work by adding slow tokenizer mode for Inferenceproces…
Timoeller Dec 17, 2020
ecec503
Merge branch 'refactor_processor_qa' of github.com:deepset-ai/FARM in…
Timoeller Dec 17, 2020
ca7372a
Disable NQ tests
Timoeller Dec 17, 2020
909ea9d
Fix onnx conversion test
Timoeller Dec 17, 2020
691dddb
Add assert for parameter checks in data validation, change num cpus, …
Timoeller Dec 18, 2020
803b41b
Refactoring Processor for LM Finetuning (FastTokenizers) (#659)
tholor Dec 21, 2020
91bea62
fix streaming data silo for new signature of dataset_from_dicts
tholor Dec 21, 2020
bd12771
Fix doc format
Timoeller Dec 22, 2020
04cabf6
Merge branch 'refactor_processor_qa' of github.com:deepset-ai/FARM in…
Timoeller Dec 22, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ trigger:
pr:
branches:
include:
- '*'

- '*'
jobs:
- job: 'Test'
pool:
Expand Down
15 changes: 8 additions & 7 deletions examples/lm_finetuning.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,22 +19,22 @@ def lm_finetuning():
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)

next_sent_pred_style = "bert-style"
next_sent_pred=True
set_all_seeds(seed=42)
ml_logger = MLFlowLogger(tracking_uri="https://public-mlflow.deepset.ai/")
ml_logger.init_experiment(
experiment_name="Public_FARM", run_name="Run_minimal_example_lm"
experiment_name="LM_refactoring", run_name=f"new, nsp: {next_sent_pred}, {next_sent_pred_style}"
)
##########################
########## Settings
##########################
device, n_gpu = initialize_device_settings(use_cuda=False)
device, n_gpu = initialize_device_settings(use_cuda=True)
n_epochs = 1
batch_size = 32
evaluate_every = 30
evaluate_every = 1000
lang_model = "bert-base-cased"
do_lower_case = False
next_sent_pred_style = "bert-style"

# 1.Create a tokenizer
tokenizer = Tokenizer.load(
Expand All @@ -46,7 +46,7 @@ def lm_finetuning():
data_dir=Path("../data/lm_finetune_nips"),
tokenizer=tokenizer,
max_seq_len=128,
max_docs=20, # We have set max_docs to 20 to speed up data processing
max_docs=None, # You can have set max_docs here to limit the number of docs in the dataset and speed up this example
next_sent_pred_style=next_sent_pred_style
)

Expand Down Expand Up @@ -74,7 +74,7 @@ def lm_finetuning():
learning_rate=2e-5,
device=device,
n_batches=len(data_silo.loaders["train"]),
n_epochs=n_epochs,
n_epochs=n_epochs
)

# 6. Feed everything to the Trainer, which keeps care of growing our model into powerful plant and evaluates it from time to time
Expand All @@ -87,6 +87,7 @@ def lm_finetuning():
lr_schedule=lr_schedule,
evaluate_every=evaluate_every,
device=device,
eval_report=False
)

# 7. Let it grow! Watch the tracked metrics live on the public mlflow server: https://public-mlflow.deepset.ai
Expand Down
2 changes: 1 addition & 1 deletion examples/natural_questions.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def question_answering():

# 1.Create a tokenizer
tokenizer = Tokenizer.load(
pretrained_model_name_or_path=lang_model, do_lower_case=do_lower_case
pretrained_model_name_or_path=lang_model, do_lower_case=do_lower_case, use_fast=False,
)

# Add HTML tag tokens to the tokenizer vocabulary, so they do not get split apart
Expand Down
13 changes: 6 additions & 7 deletions farm/data_handler/data_silo.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def _dataset_from_chunk(cls, chunk, processor):
"""
dicts = [d[1] for d in chunk]
indices = [x[0] for x in chunk]
dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts, indices=indices, return_problematic=True)
dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts, indices=indices)
return dataset, tensor_names, problematic_sample_ids

def _get_dataset(self, filename, dicts=None):
Expand Down Expand Up @@ -176,6 +176,7 @@ def _get_dataset(self, filename, dicts=None):
results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts))

datasets = []
problematic_ids_all = set()

desc = f"Preprocessing Dataset"
if filename:
Expand All @@ -185,8 +186,9 @@ def _get_dataset(self, filename, dicts=None):
datasets.append(dataset)
# update progress bar (last step can have less dicts than actual chunk_size)
pbar.update(min(multiprocessing_chunk_size, pbar.total-pbar.n))
self.processor.problematic_sample_ids.update(problematic_samples)
self.processor.log_problematic()
problematic_ids_all.update(problematic_samples)

self.processor.log_problematic(problematic_ids_all)
# _dataset_from_chunk can return a None in cases where downsampling has occurred
datasets = [d for d in datasets if d]
concat_datasets = ConcatDataset(datasets)
Expand Down Expand Up @@ -221,7 +223,6 @@ def _load_data(self, train_dicts=None, dev_dicts=None, test_dicts=None):
else:
logger.info("No train set is being loaded")
self.data["train"] = None
self.processor.log_problematic()

# dev data
logger.info("")
Expand All @@ -243,7 +244,6 @@ def _load_data(self, train_dicts=None, dev_dicts=None, test_dicts=None):
else:
logger.info("No dev set is being loaded")
self.data["dev"] = None
self.processor.log_problematic()

logger.info("")
logger.info("LOADING TEST DATA")
Expand All @@ -264,7 +264,6 @@ def _load_data(self, train_dicts=None, dev_dicts=None, test_dicts=None):
else:
logger.info("No test set is being loaded")
self.data["test"] = None
self.processor.log_problematic()

if self.caching:
self._save_dataset_to_cache()
Expand Down Expand Up @@ -724,7 +723,7 @@ def _dataset_from_chunk(self, chunk):
logger.info("Skipping a dict chunk as it contains less than 2 documents ...")
return None, None
indices = [x[0] for x in chunk]
datasets, tensor_names = self.processor.dataset_from_dicts(dicts=dicts, indices=indices)
datasets, tensor_names, _ = self.processor.dataset_from_dicts(dicts=dicts, indices=indices)
return datasets, tensor_names

def shuffle_files(self, files, seed=None):
Expand Down
Loading