Data Out: Moving text from Hub to NLP models #492

mynameisvinn · 2021-01-24T11:54:09Z

mynameisvinn
Jan 24, 2021

The goal is to design APIs to allow popular NLP models (such as transformers) to easily consume text data from Hub.

It should improve the current text processing pipeline, which looks something like this:

# an existing repo of text data
ds = Dataset("mynameisvinn/preprocessing")  

# fetch from Hub and return a generator X
X = (ds['tweet', i].compute() for i in range(3400))  

# return another generator, this time with text data processed according to a user defined function
X_clean = (process(x) for x in X)  

# optional step to determine maximum length of sentences in the corpus

# return another generator, this time with text converted to tokens
X_tokens = (tokenize(x) for x in X_clean) 

# finally, create a pytorch dataloader
train_loader = DataLoader(X_tokens)

This API approach would presumably subclass hub.Dataset and live here.

API syntax.

I see two possible design patterns:

The first will be familiar to Hub users. It follows the to_pytorch or to_tensorflow integrations pattern. It is somewhat verbose but its imperative nature means users know exactly what to expect.

ds = Dataset("activeloop/datadatadata", ...)
ds = ds.to_pytorch(transform=tokenize)  # pulls data and applies the necessary preprocessing
ds = torch.utils.data.DataLoader(ds,...)

The second will be familiar to Hugging Face users.

from datasets import load_dataset

squad_dataset = load_dataset('squad')  # specify dataset name

The latter has the appeal of simplicity and familiarity with downstream users, who are ultimately the ones who matter. The former has the flexibility to deal with different models from different frameworks (not just those working with Hugging Face transformers).

Preprocessing.

Who should be responsible for applying preprocessing (eg tokenizing, generating attention masks), the user or Hub? Where should this step occur?

Answered by mynameisvinn

Jan 26, 2021

Another approach (as suggested by @AbhinavTuli) is to convert text to tokens during assignment.

This happens if the user passes a tokenizer while instantiating a Dataset:

ds = hub.Dataset(tag, shape=(10,), schema=schema, mode="w", tokenizer=some_tokenizer)  # user specifies tokenizer

schema = {'sentence': Text(shape=(None, ), max_shape=(500, ))}  # we still rely on Text schema

for i, sentence in enumerate(sentences):
    ds['tokens', i] = sentence  # words are converted to tokens during assignment

In this case, the actual tokenization occurs before data is pushed into a Dataset, not after it is pulled out.

The user-provided tokenizer is then invoked by str_to_int:

def str_to_int(assign…

View full answer

sparkingdark · 2021-01-24T12:42:37Z

sparkingdark
Jan 24, 2021

Maybe Hub is a good option to do preprocessing for the behalf of user.

2 replies

mynameisvinn Jan 24, 2021
Author

@sparkingdark A few considerations:

How would arbitrary user-defined functions be handled?
Would preprocessing mutate the existing Dataset?
Should preprocessing be handled lazily? If so, how would the user specify it? If not, how do we minimize the number of unnecessarily operations?

DebadityaPal Jan 26, 2021

Here is my take on these points:

How would arbitrary user-defined functions be handled?

For generic functions, like tokenizing the data and converting tokens to ids, we can have these implemented within the hub, for highly specialized functions, it's better to let the user describe those as that gives the user more control of the data pipeline.

Would preprocessing mutate the existing Dataset?

We could use the current DatasetView object and transform each batch of data after fetching it from the main dataset. So we would not have to process the entirety of the main dataset.

Should preprocessing be handled lazily? If so, how would the user specify it? If not, how do we minimize the number of unnecessary operations?

Processing should be done lazily per batch of data. In the usual pipeline, we create a custom data generator that fetches the data from the dataset and performs transformations, so perhaps we can do something like:

def load_data(ds, batch_size):
"""Return a generator to yield (X, y).
"""
   for i in range(n, batch_size):
       X = ds["data", i: i + batch_size]
       y = ds["label", i: i + batch_size]
       """ Perform data transformation here"""
       X = X.transform()
       y = y.transform()
       X = X.compute()
       y = y.compute()
       yield X, y

DebadityaPal · 2021-01-26T14:15:21Z

DebadityaPal
Jan 26, 2021

For the API syntax, I would prefer the first option, since that would be more consistent with the other dataset calls with hub which do not involve NLP or transformers.
For the Preprocessing decision, we could code a preprocessing API within hub, this would make Hub a more comprehensive data management package. It's my personal opinion that it would be really beautiful if we can call a dataset from hub and then use only hub functions before feeding the transformed dataset into the model, without having to create any user-defined functions. That would be the pinnacle of performance.

0 replies

mynameisvinn · 2021-01-26T17:21:10Z

mynameisvinn
Jan 26, 2021
Author

Another approach (as suggested by @AbhinavTuli) is to convert text to tokens during assignment.

This happens if the user passes a tokenizer while instantiating a Dataset:

ds = hub.Dataset(tag, shape=(10,), schema=schema, mode="w", tokenizer=some_tokenizer)  # user specifies tokenizer

schema = {'sentence': Text(shape=(None, ), max_shape=(500, ))}  # we still rely on Text schema

for i, sentence in enumerate(sentences):
    ds['tokens', i] = sentence  # words are converted to tokens during assignment

In this case, the actual tokenization occurs before data is pushed into a Dataset, not after it is pulled out.

The user-provided tokenizer is then invoked by str_to_int:

def str_to_int(assign_value, tokenizer):
   ...
    if tokenizer is not None:
        ...   
        tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")
        assign_value = (np.array(tokenizer(assign_value, add_special_tokens=False)["input_ids"]) 
            if isinstance(assign_value, str) else assign_value)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Out: Moving text from Hub to NLP models #492

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Data Out: Moving text from Hub to NLP models #492

mynameisvinn Jan 24, 2021

Replies: 3 comments · 2 replies

sparkingdark Jan 24, 2021

mynameisvinn Jan 24, 2021 Author

DebadityaPal Jan 26, 2021

DebadityaPal Jan 26, 2021

mynameisvinn Jan 26, 2021 Author

mynameisvinn
Jan 24, 2021

Replies: 3 comments 2 replies

sparkingdark
Jan 24, 2021

mynameisvinn Jan 24, 2021
Author

DebadityaPal
Jan 26, 2021

mynameisvinn
Jan 26, 2021
Author