Data Out: Moving text from Hub to NLP models #492
-
The goal is to design APIs to allow popular NLP models (such as transformers) to easily consume text data from Hub. It should improve the current text processing pipeline, which looks something like this: # an existing repo of text data
ds = Dataset("mynameisvinn/preprocessing")
# fetch from Hub and return a generator X
X = (ds['tweet', i].compute() for i in range(3400))
# return another generator, this time with text data processed according to a user defined function
X_clean = (process(x) for x in X)
# optional step to determine maximum length of sentences in the corpus
# return another generator, this time with text converted to tokens
X_tokens = (tokenize(x) for x in X_clean)
# finally, create a pytorch dataloader
train_loader = DataLoader(X_tokens) This API approach would presumably subclass
I see two possible design patterns: The first will be familiar to Hub users. It follows the ds = Dataset("activeloop/datadatadata", ...)
ds = ds.to_pytorch(transform=tokenize) # pulls data and applies the necessary preprocessing
ds = torch.utils.data.DataLoader(ds,...) The second will be familiar to Hugging Face users. from datasets import load_dataset
squad_dataset = load_dataset('squad') # specify dataset name The latter has the appeal of simplicity and familiarity with downstream users, who are ultimately the ones who matter. The former has the flexibility to deal with different models from different frameworks (not just those working with Hugging Face transformers).
Who should be responsible for applying preprocessing (eg tokenizing, generating attention masks), the user or Hub? Where should this step occur? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Maybe Hub is a good option to do preprocessing for the behalf of user. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Another approach (as suggested by @AbhinavTuli) is to convert text to tokens during assignment. This happens if the user passes a tokenizer while instantiating a ds = hub.Dataset(tag, shape=(10,), schema=schema, mode="w", tokenizer=some_tokenizer) # user specifies tokenizer
schema = {'sentence': Text(shape=(None, ), max_shape=(500, ))} # we still rely on Text schema
for i, sentence in enumerate(sentences):
ds['tokens', i] = sentence # words are converted to tokens during assignment In this case, the actual tokenization occurs before data is pushed into a Dataset, not after it is pulled out. The user-provided tokenizer is then invoked by str_to_int: def str_to_int(assign_value, tokenizer):
...
if tokenizer is not None:
...
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")
assign_value = (np.array(tokenizer(assign_value, add_special_tokens=False)["input_ids"])
if isinstance(assign_value, str) else assign_value) |
Beta Was this translation helpful? Give feedback.
Another approach (as suggested by @AbhinavTuli) is to convert text to tokens during assignment.
This happens if the user passes a tokenizer while instantiating a
Dataset
:In this case, the actual tokenization occurs before data is pushed into a Dataset, not after it is pulled out.
The user-provided tokenizer is then invoked by str_to_int: