Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizable tokenizer for RULER #1731

Merged
merged 2 commits into from
Dec 19, 2024
Merged

Conversation

changlan
Copy link
Contributor

@changlan changlan commented Dec 3, 2024

Adding an optional environment variable TOKENIZER_MODEL which controls the tokenizer model to use for RULER data generation. With this option, the dataset length will be more precise when we evaluate models that do not use gpt-4 tokenizer.

@MaiziXiao
Copy link
Collaborator

https://github.com/open-compass/opencompass/blob/main/configs/eval_ruler.py
We have provided the way to use model's own tokenizer to build model specific datasets, you can have a look at the config.

On the other hand, the configuration (**_gen.py) is standard configurations for general evaluations. You are of course welcome to try your own configurations.

@changlan
Copy link
Contributor Author

changlan commented Dec 4, 2024

Thanks for the review. The general workflow we use opencompass is via the CLI: opencompass --models [custom_model_config] --datasets ruler_4k_gen.py ... However, it seems that it is not possible to specify tokenizer for --datasets. Do you think this is a reasonable use case?

@MaiziXiao
Copy link
Collaborator

Thanks for the review. The general workflow we use opencompass is via the CLI: opencompass --models [custom_model_config] --datasets ruler_4k_gen.py ... However, it seems that it is not possible to specify tokenizer for --datasets. Do you think this is a reasonable use case?
That sounds like a reasonable usecase.

Copy link
Collaborator

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MaiziXiao MaiziXiao merged commit d70100c into open-compass:main Dec 19, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants