This Fine Tuning Dataset Creation Toolkit
will help you in creation of JSONL dataset files that are needed for fine tuning Completions
models like babbage-002
or davinci-002
and Chat Completions
model like gpt-35-turbo-0613
from XLSX and CSV files. These JSONL dataset files can be used to fine tune models in OpenAI or Azure OpenAI.
Python 3.10+
installedvirtualenv
package installed in Python
- Create and activate a virtual environment
- Install dependencies
- Create a dataset in XLSX or CSV file
-
To create a virtual environment, open terminal in your working directory and execute this command :
python -m venv .venv
-
To activate virtual environment, execute this command in the terminal :
./.venv/Scripts/activate
-
To install the dependencies needed to run this kit, execute this command in terminal :
pip install -r requirements.txt
- To create a dataset, create a XLSX or CSV file. You can take reference from XLSX or CSV files inside
Sample
folder. - This XLSX or CSV file needs to be in your working directory.
[IMPORTANT]
- Convert XLSX/CSV dataset file to JSONL dataset file
- Validate JSONL dataset file
- Analyze JSONL dataset file
[FOR COMPLETIONS JSONL DATASET ONLY]
- Convert JSONL dataset files to XLSX and CSV files
[EXTRA]
-
Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
-
If you are creating dataset for
Completions
models likebabbage-002
ordavinci-002
, the script you should be using isCompletions Dataset Formatter.py
. This is how you should execute the script :python 'Completions Dataset Formatter.py' [XLSX/CSV Filename]
-
If you are creating dataset for
Chat Completions
model likegpt-35-turbo-0613
, the script you should be using isChat Completions Dataset Formatter.py
. This is how you should execute the script :python 'Chat Completions Dataset Formatter.py' [XLSX/CSV Filename]
-
Both of them will create a JSONL dataset file in your working directory with name same as input XLSX/CSV file.
-
To validate JSONL files, you can make use of
JSONL Validator.py
script -
This script will return output as
Valid
orInvalid
based on the input file. -
This is how you should execute the script :
python 'JSON Validator.py' [JSONL Filename]
-
This script can validate JSONL dataset files created for both
Completions
andChat Completions
models.
-
This is only applicable to the datasets created for
Completions
models likebabbage-002
ordavinci-002
. -
To analyze the JSONL dataset files, execute this command in your terminal :
openai tools fine_tunes.prepare_data -f [JSON Filename]
-
More details can be found [here]
-
Based on the model your JSONL dataset file will be targetting for fine tuning, there are different scripts that you can use
-
To create XLSX and CSV files from JSONL file that was created for fine tuning of
Completions
models likebabbage-002
ordavinci-002
, the script you should be using isCompletions - JSONL to CSV and XLSX.py
. This is how you should execute the script :python 'Completions - JSONL to CSV and XLSX.py' [JSONL Filename]
-
To create XLSX and CSV files from JSONL file that was created for fine tuning of
Chat Completions
model likegpt-35-turbo-0613
, the script you should be using isChat Completions - JSONL to CSV and XLSX.py
. This is how you should execute the script :python 'Chat Completions - JSONL to CSV and XLSX.py' [JSONL Filename]
-
Both of them will create a XLSX and CSV file in your working directory with name same as input JSON file.
This toolkit saved me a lot of time in creating dataset files for fine tuning jobs. If it also helps you to save your time, then please share this with your friends and colleagues. Please don't forget to give it a 🌟. Feel free to raise an issue or send a PR for improvements.