From e8be74b88030ca4b2f1643c3a66eba6d21fe654e Mon Sep 17 00:00:00 2001 From: Myhs-phz Date: Wed, 15 Jan 2025 14:31:20 +0000 Subject: [PATCH 1/3] create new branch --- docs/zh_cn/advanced_guides/new_dataset.md | 28 ++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/docs/zh_cn/advanced_guides/new_dataset.md b/docs/zh_cn/advanced_guides/new_dataset.md index 0ce7d9e26..0f06c6cf8 100644 --- a/docs/zh_cn/advanced_guides/new_dataset.md +++ b/docs/zh_cn/advanced_guides/new_dataset.md @@ -54,5 +54,31 @@ eval_cfg=mydataset_eval_cfg) ] ``` + + - 为了使用户提供的数据集能够被其他使用者更方便的获取,需要用户在配置文件中给出下载数据集的渠道。具体的方式是首先在`mydataset_datasets`配置中的`path`字段填写用户指定的数据集名称,具体示例如下: + + ```python + mmlu_datasets = [ + dict( + abbr=f'lukaemon_mmlu_{_name}', + type=MMLUDataset, + path='opencompass/mmlu', + ..., + ) + ] + ``` + + - 接着,需要在`opencompass/utils/datasets_info.py`中创建对应名称的字典字段。如果用户已将数据集托管到huggingface或modelscope,那么请在`DATASETS_MAPPING`字典中添加对应名称的字段,并将对应的huggingface或modelscope数据集地址填入`ms_id`和`hf_id`;另外,还允许指定一个默认的`local`地址。具体示例如下: + + ```python + "opencompass/mmlu": { + "ms_id": "opencompass/mmlu", + "hf_id": "opencompass/mmlu", + "local": "./data/mmlu/", + } + ``` + + - 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取,则需要在Pull Request阶段向我们提交数据集文件,我们将代为传输数据集至OSS,并在`DATASET_URL`新建字段。 + - 详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程,启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。 + 详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程,启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。 From a5a030122f5d202e22b2c5734f9c495a2dbb47e1 Mon Sep 17 00:00:00 2001 From: Myhs-phz Date: Thu, 16 Jan 2025 09:25:51 +0000 Subject: [PATCH 2/3] docs new_dataset.md zh --- docs/zh_cn/advanced_guides/new_dataset.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/zh_cn/advanced_guides/new_dataset.md b/docs/zh_cn/advanced_guides/new_dataset.md index 0f06c6cf8..5b87e193e 100644 --- a/docs/zh_cn/advanced_guides/new_dataset.md +++ b/docs/zh_cn/advanced_guides/new_dataset.md @@ -55,13 +55,12 @@ ] ``` - - 为了使用户提供的数据集能够被其他使用者更方便的获取,需要用户在配置文件中给出下载数据集的渠道。具体的方式是首先在`mydataset_datasets`配置中的`path`字段填写用户指定的数据集名称,具体示例如下: + - 为了使用户提供的数据集能够被其他使用者更方便的获取,需要用户在配置文件中给出下载数据集的渠道。具体的方式是首先在`mydataset_datasets`配置中的`path`字段填写用户指定的数据集名称,该名称将以mapping的方式映射到`opencompass/utils/datasets_info.py`中的实际下载路径。具体示例如下: ```python mmlu_datasets = [ dict( - abbr=f'lukaemon_mmlu_{_name}', - type=MMLUDataset, + ..., path='opencompass/mmlu', ..., ) @@ -78,6 +77,20 @@ } ``` + - 为了确保数据来源的可选择性,用户需要根据所提供数据集的下载路径类型来完善数据集脚本`mydataset.py`中的`load`方法的功能。具体而言,需要用户实现根据环境变量`DATASET_SOURCE`的不同设置来切换不同的下载数据源的功能。`opencompass/dataset/cmmlu.py`中的具体示例如下: + + ```python + def load(path: str, name: str, **kwargs): + ... + if environ.get('DATASET_SOURCE') == 'ModelScope': + ... + else: + ... + return dataset + ``` + + + - 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取,则需要在Pull Request阶段向我们提交数据集文件,我们将代为传输数据集至OSS,并在`DATASET_URL`新建字段。 From f361bb0d14fe4a7163e2316dc02cd50136b71b2d Mon Sep 17 00:00:00 2001 From: Myhs-phz Date: Thu, 16 Jan 2025 10:37:16 +0000 Subject: [PATCH 3/3] docs new_dataset.md zh and en --- docs/en/advanced_guides/new_dataset.md | 36 +++++++++++++++++++++++ docs/zh_cn/advanced_guides/new_dataset.md | 11 ++++--- 2 files changed, 41 insertions(+), 6 deletions(-) diff --git a/docs/en/advanced_guides/new_dataset.md b/docs/en/advanced_guides/new_dataset.md index 271a89a25..65184dcb3 100644 --- a/docs/en/advanced_guides/new_dataset.md +++ b/docs/en/advanced_guides/new_dataset.md @@ -53,5 +53,41 @@ Although OpenCompass has already included most commonly used datasets, users nee eval_cfg=mydataset_eval_cfg) ] ``` + + - To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the `path` field in the `mydataset_datasets` configuration, and this name will be mapped to the actual download path in the `opencompass/utils/datasets_info.py` file. Here's an example: + + ```python + mmlu_datasets = [an + dict( + ..., + path='opencompass/mmlu', + ..., + ) + ] + ``` + - Next, you need to create a dictionary key in `opencompass/utils/datasets_info.py` with the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to the `DATASETS_MAPPING` dictionary and fill in the HuggingFace or Modelscope dataset address in the `hf_id` or `ms_id` key, respectively. You can also specify a default local address. Here's an example: + + ```python + "opencompass/mmlu": { + "ms_id": "opencompass/mmlu", + "hf_id": "opencompass/mmlu", + "local": "./data/mmlu/", + } + ``` + + - If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the `DATASET_URL`. + + - To ensure the optionality of data sources, you need to improve the method `load` in the dataset script `mydataset.py`. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variable `DATASET_SOURCE`. It shoule be noted that if the environment variable `DATASET_SOURCE` is not set, the dataset will default to being downloaded from the OSS repository. Here's an example from `opencompass/dataset/cmmlu.py`: + + ```python + def load(path: str, name: str, **kwargs): + ... + if environ.get('DATASET_SOURCE') == 'ModelScope': + ... + else: + ... + return dataset + ``` + Detailed dataset configuration files and other required configuration files can be referred to in the [Configuration Files](../user_guides/config.md) tutorial. For guides on launching tasks, please refer to the [Quick Start](../get_started/quick_start.md) tutorial. diff --git a/docs/zh_cn/advanced_guides/new_dataset.md b/docs/zh_cn/advanced_guides/new_dataset.md index 5b87e193e..8c21f0e22 100644 --- a/docs/zh_cn/advanced_guides/new_dataset.md +++ b/docs/zh_cn/advanced_guides/new_dataset.md @@ -77,10 +77,12 @@ } ``` - - 为了确保数据来源的可选择性,用户需要根据所提供数据集的下载路径类型来完善数据集脚本`mydataset.py`中的`load`方法的功能。具体而言,需要用户实现根据环境变量`DATASET_SOURCE`的不同设置来切换不同的下载数据源的功能。`opencompass/dataset/cmmlu.py`中的具体示例如下: + - 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取,则需要在Pull Request阶段向我们提交数据集文件,我们将代为传输数据集至OSS,并在`DATASET_URL`新建字段。 + + - 为了确保数据来源的可选择性,用户需要根据所提供数据集的下载路径类型来完善数据集脚本`mydataset.py`中的`load`方法的功能。具体而言,需要用户实现根据环境变量`DATASET_SOURCE`的不同设置来切换不同的下载数据源的功能。需要注意的是,若未设置`DATASET_SOURCE`的值,将默认从OSS仓库下载数据。`opencompass/dataset/cmmlu.py`中的具体示例如下: ```python - def load(path: str, name: str, **kwargs): + def load(path: str, name: str, **kwargs): ... if environ.get('DATASET_SOURCE') == 'ModelScope': ... @@ -88,10 +90,7 @@ ... return dataset ``` - - - - - 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取,则需要在Pull Request阶段向我们提交数据集文件,我们将代为传输数据集至OSS,并在`DATASET_URL`新建字段。 + 详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程,启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。