Introduction

This Fine Tuning Dataset Creation Toolkit will help you in creation of JSONL dataset files that are needed for fine tuning Completions models like babbage-002 or davinci-002 and Chat Completions model like gpt-35-turbo-0613 from XLSX and CSV files. These JSONL dataset files can be used to fine tune models in OpenAI or Azure OpenAI.

Prerequisites

Python 3.10+ installed
virtualenv package installed in Python

Initial Setup

Create and activate a virtual environment
Install dependencies
Create a dataset in XLSX or CSV file

1. Create virtual environment

To create a virtual environment, open terminal in your working directory and execute this command :
```
 python -m venv .venv
```
To activate virtual environment, execute this command in the terminal :
```
 ./.venv/Scripts/activate
```

2. Install dependencies

To install the dependencies needed to run this kit, execute this command in terminal :
```
 pip install -r requirements.txt
```

3. Create a dataset in XLSX or CSV file

To create a dataset, create a XLSX or CSV file. You can take reference from XLSX or CSV files inside Sample folder.
This XLSX or CSV file needs to be in your working directory. [IMPORTANT]

How to use?

Convert XLSX/CSV dataset file to JSONL dataset file
Validate JSONL dataset file
Analyze JSONL dataset file [FOR COMPLETIONS JSONL DATASET ONLY]
Convert JSONL dataset files to XLSX and CSV files [EXTRA]

1. Convert XLSX/CSV to JSONL

Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
If you are creating dataset for Completions models like babbage-002 or davinci-002, the script you should be using is Completions Dataset Formatter.py. This is how you should execute the script :
```
 python 'Completions Dataset Formatter.py' [XLSX/CSV Filename]
```
If you are creating dataset for Chat Completions model like gpt-35-turbo-0613, the script you should be using is Chat Completions Dataset Formatter.py. This is how you should execute the script :
```
 python 'Chat Completions Dataset Formatter.py' [XLSX/CSV Filename]
```
Both of them will create a JSONL dataset file in your working directory with name same as input XLSX/CSV file.

2. Validate JSONL dataset file

To validate JSONL files, you can make use of JSONL Validator.py script
This script will return output as Valid or Invalid based on the input file.

This is how you should execute the script :

 python 'JSON Validator.py' [JSONL Filename]

This script can validate JSONL dataset files created for both Completions and Chat Completions models.

3. Analyze JSONL dataset file

This is only applicable to the datasets created for Completions models like babbage-002 or davinci-002.
To analyze the JSONL dataset files, execute this command in your terminal :
```
 openai tools fine_tunes.prepare_data -f [JSON Filename]
```
More details can be found [here]

4. Convert JSONL dataset files to XLSX and CSV files

Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
To create XLSX and CSV files from JSONL file that was created for fine tuning of Completions models like babbage-002 or davinci-002, the script you should be using is Completions - JSONL to CSV and XLSX.py. This is how you should execute the script :
```
 python 'Completions - JSONL to CSV and XLSX.py' [JSONL Filename]
```
To create XLSX and CSV files from JSONL file that was created for fine tuning of Chat Completions model like gpt-35-turbo-0613, the script you should be using is Chat Completions - JSONL to CSV and XLSX.py. This is how you should execute the script :
```
 python 'Chat Completions - JSONL to CSV and XLSX.py' [JSONL Filename]
```
Both of them will create a XLSX and CSV file in your working directory with name same as input JSON file.

Thank you!

This toolkit saved me a lot of time in creating dataset files for fine tuning jobs. If it also helps you to save your time, then please share this with your friends and colleagues. Please don't forget to give it a 🌟. Feel free to raise an issue or send a PR for improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Prerequisites

Initial Setup

1. Create virtual environment

2. Install dependencies

3. Create a dataset in XLSX or CSV file

How to use?

1. Convert XLSX/CSV to JSONL

2. Validate JSONL dataset file

3. Analyze JSONL dataset file

4. Convert JSONL dataset files to XLSX and CSV files

Thank you!

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Sample		Sample
.gitignore		.gitignore
Chat Completions - JSONL to CSV and XLSX.py		Chat Completions - JSONL to CSV and XLSX.py
Chat Completions Dataset Formatter.py		Chat Completions Dataset Formatter.py
Completions - JSONL to CSV and XLSX.py		Completions - JSONL to CSV and XLSX.py
Completions Dataset Formatter.py		Completions Dataset Formatter.py
JSONL Validator.py		JSONL Validator.py
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

License

spirus-dev/fine-tuning-dataset-creation-toolkit

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prerequisites

Initial Setup

1. Create virtual environment

2. Install dependencies

3. Create a dataset in XLSX or CSV file

How to use?

1. Convert XLSX/CSV to JSONL

2. Validate JSONL dataset file

3. Analyze JSONL dataset file

4. Convert JSONL dataset files to XLSX and CSV files

Thank you!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages