Update data.py #396

grofte · 2023-07-18T14:54:34Z

If a column has been defined as a ClassLabel then sample_dataset strips that information away and you lose .names.

Test code

from datasets import load_dataset
import datasets
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset: datasets.DatasetDict = load_dataset("SetFit/sst5")
dataset = dataset.class_encode_column("label_text")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset: datasets.Dataset = sample_dataset(dataset["train"], label_column="label_text", num_samples=8)
eval_dataset: datasets.Dataset = dataset["validation"]

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label_text": "label"} # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

print(preds)
print(list(map(lambda x: train_dataset.features["label_text"].names[x], preds)))

I'm sure there's a better way of mapping predictions back but I am not that familiar with HuggingFace or even PyTorch.

You can run the test code without the changes and see that train_dataset.features["label_text"] is a Value class and doesn't have the .names attribute.

If a column has been defined as a ClassLabel then `sample_dataset` strips that information away and you lose names. Test code ```python from datasets import load_dataset import datasets from sentence_transformers.losses import CosineSimilarityLoss from setfit import SetFitModel, SetFitTrainer, sample_dataset # Load a dataset from the Hugging Face Hub dataset: datasets.DatasetDict = load_dataset("SetFit/sst5") dataset = dataset.class_encode_column("label_text") # Simulate the few-shot regime by sampling 8 examples per class train_dataset: datasets.Dataset = sample_dataset(dataset["train"], label_column="label_text", num_samples=8) eval_dataset: datasets.Dataset = dataset["validation"] # Load a SetFit model from Hub model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") # Create trainer trainer = SetFitTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, loss_class=CosineSimilarityLoss, metric="accuracy", batch_size=16, num_iterations=20, # The number of text pairs to generate for contrastive learning num_epochs=1, # The number of epochs to use for contrastive learning column_mapping={"text": "text", "label_text": "label"} # Map dataset columns to text/label expected by trainer ) # Train and evaluate trainer.train() metrics = trainer.evaluate() # Run inference preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]) print(preds) print(list(map(lambda x: train_dataset.features["label_text"].names[x], preds))) ```

grofte · 2023-07-19T08:23:19Z

To be honest it might have been better to copy all the .features before converting to pandas and then reapplying them again once the transformations are done. Would look cleaner too and be more future-proof.

grofte · 2023-07-19T08:44:48Z

For my own usage I have now wrapped sample_data like this:

from setfit import sample_dataset

def conserved_sampling(dataset: datasets.Dataset, label_column: str, num_samples: int) -> datasets.Dataset:
    """Sample a dataset such that the feature classes are conserved."""
    features = dataset.features.copy()
    dataset_sample = sample_dataset(dataset, label_column=label_column, num_samples=num_samples)
    for feature, feature_type in features.items():
        dataset_sample = dataset_sample.cast_column(feature, feature_type)
    return dataset_sample

You could use this inside sample_data instead of what I wrote yesterday.

tomaarsen · 2023-07-25T21:19:38Z

Hello!

I think this is a really great idea, and definitely an oversight in the original implementation. I've discovered that the from_pandas accepts an optional features parameter, so we can use that with the original dataset.features.

I've locally prepared this, and it's looking good. I'll push it to this PR if that's okay with you - that way you still get credited for the eventual merged commit, well deserved for spotting this oversight.

Tom Aarsen

grofte · 2023-07-26T07:44:37Z

Aw, I totally forgot to write a test for this. And it looks much better what you did Tom - thanks!

tomaarsen added the enhancement New feature or request label Jul 25, 2023

Preserve features when calling sample_dataset + tests

b96cc7f

tomaarsen merged commit 41ad3a2 into huggingface:main Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data.py #396

Update data.py #396

grofte commented Jul 18, 2023 •

edited

Loading

grofte commented Jul 19, 2023

grofte commented Jul 19, 2023

tomaarsen commented Jul 25, 2023

grofte commented Jul 26, 2023

Update data.py #396

Update data.py #396

Conversation

grofte commented Jul 18, 2023 • edited Loading

grofte commented Jul 19, 2023

grofte commented Jul 19, 2023

tomaarsen commented Jul 25, 2023

grofte commented Jul 26, 2023

grofte commented Jul 18, 2023 •

edited

Loading