ChatMGL: A Large Language Model Fine-tuned for Data Science Questions

Manos Chatzakis ([email protected]), Ioannis Bantzis ([email protected]), Lluka Stojollari ([email protected])

Getting started

To start, install all needed python packages. You need to support the python version(s) listed in "python.txt".

pip install -U -r requirements.txt

Our implementation should work with any Python version >= 3.7, but we have tested our work only with the listings in "python.txt".

Model Download

The final ChatMGL model has a size of >3GBs, thus it cannot be uploaded to this GitHub repo. However, we provide a google drive link containing ChatMGL:

https://drive.google.com/drive/folders/1Klcx6gJHiJIj-BS6vB2-OM9Gx49qyxPd

We also provide a UNIX script that downloads the dataset inside /models/chatMGL/ directory. To run it, use:

chmod u+x get_chatMGL.sh
./get_chatMGL.sh

This script requires gdown package, included in the "requirements.txt" file.

Testing run

We provide a bash script to begin with. The script provides ChatMGL responses for the file "prompts.json". To run it:

chmod u+x run.sh
./run.sh

This script requires ChatMGL to be under /models/chatMGL and prompts the model with questions contained in the file "prompts.json". By default, the model is tuned to generate up to 150 new tokens. This parameter is tunable, but greatly affects the running time of the script.

The script is a wrapper for the script gen_script_chatMGL.py, which can be run using:

cd src
python3 gen_script_chatMGL.py --model_path ../models/chatMGL/ --input_questions_path ../prompts.json --output_filename ../answers_chatMGL.json --generation_tokens 150
cd ..

The script sets all seeds to be 42, in order to make the provided answers reproducible. Also, the generation uses default values for topk, topp and temperature, to enhance reproducability.

Because of the way we have trained the model, it has learned to repeat the initial answer before giving the actual response, thus the generation token number should be set with this in mind.

Current generations in the repo were generated with 300 max generation tokens.

Using ChatMGL

You can load and use ChatMGL in python to provide any prompt for any question. We list an indicative example here.

from generative_model import GenerativeModel

path = "PathToTheModel" #e.g. /models/chatMGL 
model = GenerativeModel(path)

question = "Your data science question here"
response = model.generate(question)

If you have a hardware accelerator, ChatMGL can perform the calculations there:

from generative_model import GenerativeModel

path = "PathToTheModel" #e.g. /models/chatMGL 
devide = "YourHardwareAccelerator" #e.g. cuda
model = GenerativeModel(path, device)

Datasets

We make our datasets openly available in this repo, in a JSON format. The complete generative dataset is provided in gen_dataset_chatMGL.json and the complete reward dataset in reward_dataset_chatMGL.json. We also provide the partial test, validation and train files under /dataset/ directory.

Other Models

We provide all other models (generative and reward) we describe in our report under /models/ directory.

About

ChatMGL was developed as the project of the MSc. level course of EPFL: "Modern Natural Language Processing"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatMGL: A Large Language Model Fine-tuned for Data Science Questions

Getting started

Model Download

Testing run

Using ChatMGL

Datasets

Other Models

About

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
dataset		dataset
models		models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
answers_chatMGL.json		answers_chatMGL.json
gen_dataset_chatMGL.json		gen_dataset_chatMGL.json
get_chatMGL.sh		get_chatMGL.sh
prompts.json		prompts.json
python.txt		python.txt
report.pdf		report.pdf
requirements.txt		requirements.txt
reward_dataset_chatMGL.json		reward_dataset_chatMGL.json
run.sh		run.sh

License

MChatzakis/ChatMGL

Folders and files

Latest commit

History

Repository files navigation

ChatMGL: A Large Language Model Fine-tuned for Data Science Questions

Getting started

Model Download

Testing run

Using ChatMGL

Datasets

Other Models

About

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages