Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data.py #396

Merged
merged 2 commits into from
Jul 25, 2023
Merged

Update data.py #396

merged 2 commits into from
Jul 25, 2023

Conversation

grofte
Copy link
Contributor

@grofte grofte commented Jul 18, 2023

If a column has been defined as a ClassLabel then sample_dataset strips that information away and you lose .names.

Test code

from datasets import load_dataset
import datasets
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset: datasets.DatasetDict = load_dataset("SetFit/sst5")
dataset = dataset.class_encode_column("label_text")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset: datasets.Dataset = sample_dataset(dataset["train"], label_column="label_text", num_samples=8)
eval_dataset: datasets.Dataset = dataset["validation"]

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label_text": "label"} # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

print(preds)
print(list(map(lambda x: train_dataset.features["label_text"].names[x], preds)))

I'm sure there's a better way of mapping predictions back but I am not that familiar with HuggingFace or even PyTorch.

You can run the test code without the changes and see that train_dataset.features["label_text"] is a Value class and doesn't have the .names attribute.

If a column has been defined as a ClassLabel then `sample_dataset` strips that information away and you lose names.

Test code

```python
 from datasets import load_dataset
import datasets
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset: datasets.DatasetDict = load_dataset("SetFit/sst5")
dataset = dataset.class_encode_column("label_text")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset: datasets.Dataset = sample_dataset(dataset["train"], label_column="label_text", num_samples=8)
eval_dataset: datasets.Dataset = dataset["validation"]

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label_text": "label"} # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

print(preds)
print(list(map(lambda x: train_dataset.features["label_text"].names[x], preds)))
```
@grofte
Copy link
Contributor Author

grofte commented Jul 19, 2023

To be honest it might have been better to copy all the .features before converting to pandas and then reapplying them again once the transformations are done. Would look cleaner too and be more future-proof.

@grofte
Copy link
Contributor Author

grofte commented Jul 19, 2023

For my own usage I have now wrapped sample_data like this:

from setfit import sample_dataset

def conserved_sampling(dataset: datasets.Dataset, label_column: str, num_samples: int) -> datasets.Dataset:
    """Sample a dataset such that the feature classes are conserved."""
    features = dataset.features.copy()
    dataset_sample = sample_dataset(dataset, label_column=label_column, num_samples=num_samples)
    for feature, feature_type in features.items():
        dataset_sample = dataset_sample.cast_column(feature, feature_type)
    return dataset_sample

You could use this inside sample_data instead of what I wrote yesterday.

@tomaarsen tomaarsen added the enhancement New feature or request label Jul 25, 2023
@tomaarsen
Copy link
Member

Hello!

I think this is a really great idea, and definitely an oversight in the original implementation. I've discovered that the from_pandas accepts an optional features parameter, so we can use that with the original dataset.features.

I've locally prepared this, and it's looking good. I'll push it to this PR if that's okay with you - that way you still get credited for the eventual merged commit, well deserved for spotting this oversight.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit 41ad3a2 into huggingface:main Jul 25, 2023
@grofte
Copy link
Contributor Author

grofte commented Jul 26, 2023

Aw, I totally forgot to write a test for this. And it looks much better what you did Tom - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants