Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working #32

Merged
merged 27 commits into from
Oct 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
1f96ed4
modified: todo.md
Daethyra Oct 11, 2023
d6a9afb
Merge branch 'working' of https://github.com/Daethyra/LLM-Utilikit in…
Daethyra Oct 11, 2023
c1ea6c9
Tried adding more checks for stability
Daethyra Oct 11, 2023
f2e7bc8
Upgraded qa_local_docs. See changes below:
Daethyra Oct 11, 2023
20c7477
Enhanced logic of `qa_local_docs.py`
Daethyra Oct 12, 2023
241cc2d
Corrected faulty GPT updates in TODO
Daethyra Oct 12, 2023
830c60a
Refactored qa_local_docs and reorganized dirs
Daethyra Oct 12, 2023
e38f879
Removed useless initialization + print statements
Daethyra Oct 12, 2023
9034a6b
Add test module
Daethyra Oct 12, 2023
d60d308
++ .env & easy configuration for multiple variables
Daethyra Oct 12, 2023
bb64923
new file: LangChain/Chatbots/__init__.py
Daethyra Oct 12, 2023
0a2d0d8
Moved a dir, renamed a dir; Updated README
Daethyra Oct 12, 2023
e787961
Small README changes
Daethyra Oct 12, 2023
dbb90c3
Added HuggingFace section to 'todo.md'
Daethyra Oct 12, 2023
7ba795e
Attempted fine-tuning of a sequence classification model using Huggin…
Daethyra Oct 12, 2023
826d118
Update README.md
Daethyra Oct 12, 2023
5386e79
Removed stateful_chatbot, replaced w/ chroma_memory.py
Daethyra Oct 12, 2023
27840d1
Moved Prompts dir to root
Daethyra Oct 19, 2023
719778c
Renamed to Embedding-Upsertion to better represent its contents
Daethyra Oct 19, 2023
4403ff6
Filled in variables for readme
Daethyra Oct 19, 2023
04292a0
Finalizing readme edits, just need new mind map
Daethyra Oct 19, 2023
3ab6108
modified: README.md
Daethyra Oct 19, 2023
f2c7b7f
Archived some graphics
Daethyra Oct 19, 2023
27296b3
new file: .github/2023-10-18_Mindmap.jpg
Daethyra Oct 19, 2023
49c0745
modified: README.md
Daethyra Oct 19, 2023
1f1b403
Fixing #28
Daethyra Oct 19, 2023
75c666b
Merge branch 'master' into working
Daethyra Oct 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes
File renamed without changes
File renamed without changes
Binary file added .github/2023-10-18_Mindmap.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions HuggingFace/Accelerate/.env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Checkpoint to use for the model
CHECKPOINT=distilbert-base-uncased

# Number of epochs to train the model
NUM_EPOCHS=3

# Learning rate for the optimizer
LR=3e-5

# Path to the data directory
DATA_PATH=data_path

# Tokenizer to use for the model
TOKENIZER=distilbert-base-uncased

# Train, validation, and test split ratios
TRAIN_RATIO=0.8
EVAL_RATIO=0.1
VAL_RATIO=0.05
TEST_RATIO=0.05

# Seed for reproducibility
SEED=42

# Batch size for training and evaluation
BATCH_SIZE=16
49 changes: 49 additions & 0 deletions HuggingFace/Accelerate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Getting Started with Sequence Classification

Welcome to the Sequence Classification example! This guide will help you get started with training a sequence classification model using the Hugging Face Transformers library.

## Installation

To install the required packages, you can use pip:

`pip install torch transformers accelerate tqdm python-dotenv`

## Usage

To use the Sequence Classification example, you can run the `sequence_classification.py` script:

`python sequence_classification.py`

This will train a sequence classification model on a dataset and evaluate its performance on the validation and test sets.

## Configuration

The behavior of the Sequence Classification example can be configured using environment variables. Here are the available environment variables and their default values:

- `CHECKPOINT`: The path or identifier of the pre-trained checkpoint to use. Default is `distilbert-base-uncased`.
- `NUM_EPOCHS`: The number of epochs to train for. Default is `3`.
- `LR`: The learning rate to use for the optimizer. Default is `3e-5`.
- `DATA_PATH`: The path to the dataset. This is a required environment variable.
- `TOKENIZER`: The path or identifier of the tokenizer to use. Default is `distilbert-base-uncased`.
- `TRAIN_RATIO`: The ratio of examples to use for training. Default is `0.8`.
- `EVAL_RATIO`: The ratio of examples to use for evaluation. Default is `0.1`.
- `VAL_RATIO`: The ratio of examples to use for validation. Default is `0.05`.
- `TEST_RATIO`: The ratio of examples to use for testing. Default is `0.05`.
- `SEED`: The random seed to use for shuffling the dataset. Default is `42`.
- `BATCH_SIZE`: The batch size to use for training, evaluation, and validation. Default is `16`.

You can set these environment variables using a `.env` file in the same directory as the `sequence_classification.py` script. Here's an example `.env` file:

```DATA_PATH=data.csv TRAIN_RATIO=0.7 EVAL_RATIO=0.15 VAL_RATIO=0.05 TEST_RATIO=0.1```

---

# GPT Description

This Python script defines a Trainer class that can be used to fine-tune a pre-trained sequence classification model using the Hugging Face Transformers library. The Trainer class provides methods for preparing the dataset, training the model, and evaluating the model's performance. The split_dataset function is also defined in the script, which can be used to split a dataset into training, evaluation, validation, and test subsets.

The script includes an example usage section that demonstrates how to use the Trainer class and split_dataset function with a custom dataset. The example usage section shows how to load a pre-trained model, prepare the dataset, fine-tune the model, and evaluate the model's performance. The example usage section also shows how to save the fine-tuned model to disk for later use.

Finally, the script includes a unit test class TestFineTuneSequenceClassificationModel that tests the split_dataset, prepare, train, and evaluate methods of the Trainer class. The unit test class provides a set of test cases that can be used to verify the correctness of the Trainer class implementation. The unit test class can be run using a testing framework such as unittest to ensure that the Trainer class is working as expected.

To improve the readability of the code, it may be helpful to add comments to explain the purpose of each method and variable. Additionally, it may be helpful to break up the Trainer class into smaller, more focused classes or functions to improve the modularity of the code. Finally, it may be helpful to add more error handling and input validation to the code to make it more robust and prevent unexpected errors.
246 changes: 246 additions & 0 deletions HuggingFace/Accelerate/fine_tune_sequence_classification_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
import os
import random
import torch
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler, AutoTokenizer
from torch.utils.data import DataLoader, Subset
from tqdm import tqdm
from dotenv import load_dotenv
import unittest

load_dotenv()

class Trainer:
"""
A class for training a sequence classification model using the Hugging Face Transformers library.

Args:
checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
train_dataloader (DataLoader): The data loader for the training set.
eval_dataloader (DataLoader): The data loader for the evaluation set.
val_dataloader (DataLoader): The data loader for the validation set.
test_dataloader (DataLoader): The data loader for the test set.
num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
"""
def __init__(self, checkpoint=None, train_dataloader=None, eval_dataloader=None, val_dataloader=None, test_dataloader=None, num_epochs=None, lr=None):
"""
Initializes a new instance of the Trainer class.

Args:
checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
train_dataloader (DataLoader): The data loader for the training set.
eval_dataloader (DataLoader): The data loader for the evaluation set.
val_dataloader (DataLoader): The data loader for the validation set.
test_dataloader (DataLoader): The data loader for the test set.
num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
"""
self.checkpoint = checkpoint or os.getenv("CHECKPOINT", "distilbert-base-uncased")
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader
self.val_dataloader = val_dataloader
self.test_dataloader = test_dataloader
self.num_epochs = num_epochs or int(os.getenv("NUM_EPOCHS", 3))
self.lr = lr or float(os.getenv("LR", 3e-5))
self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
self.accelerator = Accelerator()
self.model = None
self.optimizer = None
self.lr_scheduler = None
self.progress_bar = None

def prepare(self):
"""
Initializes the model, optimizer, and learning rate scheduler.
"""
if self.train_dataloader is None or self.eval_dataloader is None or self.val_dataloader is None or self.test_dataloader is None:
raise ValueError("Data loaders not defined. Cannot prepare trainer.")
self.model = AutoModelForSequenceClassification.from_pretrained(self.checkpoint, num_labels=2)
self.optimizer = AdamW(self.model.parameters(), lr=self.lr)
self.model.to(self.device)
self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer = self.accelerator.prepare(
self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer
)
num_training_steps = self.num_epochs * len(self.train_dataloader)
self.lr_scheduler = get_scheduler(
"linear",
optimizer=self.optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
self.progress_bar = tqdm(range(num_training_steps))

def train(self):
"""
Trains the model for the specified number of epochs.

Raises:
ValueError: If the model, optimizer, learning rate scheduler, or progress bar is not initialized.
"""
if self.model is None or self.optimizer is None or self.lr_scheduler is None or self.progress_bar is None:
raise ValueError("Trainer not prepared. Call prepare() method first.")
self.model.train()
for epoch in range(self.num_epochs):
for batch in self.train_dataloader:
batch = {k: v.to(self.device) for k, v in batch.items()}
outputs = self.model(**batch)
loss = outputs.loss
loss.backward()
self.accelerator.backward(loss)

self.optimizer.step()
self.lr_scheduler.step()
self.optimizer.zero_grad()
self.progress_bar.update(1)

def split_dataset(dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42):
"""
Splits a dataset into training, evaluation, validation, and test subsets.

Args:
dataset (Dataset): The dataset to split.
train_ratio (float, optional): The ratio of examples to use for training. Defaults to 0.8.
eval_ratio (float, optional): The ratio of examples to use for evaluation. Defaults to 0.1.
val_ratio (float, optional): The ratio of examples to use for validation. Defaults to 0.05.
test_ratio (float, optional): The ratio of examples to use for testing. Defaults to 0.05.
seed (int, optional): The random seed to use for shuffling the dataset. Defaults to 42.

Returns:
Tuple[Subset]: A tuple of four subsets for training, evaluation, validation, and test.
"""
num_examples = len(dataset)
indices = list(range(num_examples))
random.seed(seed)
random.shuffle(indices)
train_size = int(train_ratio * num_examples)
eval_size = int(eval_ratio * num_examples)
val_size = int(val_ratio * num_examples)
test_size = int(test_ratio * num_examples)
train_indices = indices[:train_size]
eval_indices = indices[train_size:train_size+eval_size]
val_indices = indices[train_size+eval_size:train_size+eval_size+val_size]
test_indices = indices[train_size+eval_size+val_size:train_size+eval_size+val_size+test_size]
train_subset = Subset(dataset, train_indices)
eval_subset = Subset(dataset, eval_indices)
val_subset = Subset(dataset, val_indices)
test_subset = Subset(dataset, test_indices)
return train_subset, eval_subset, val_subset, test_subset

# Example usage
if __name__ == "__main__":
from my_dataset import MyDataset

# Load dataset
data_path = os.getenv("DATA_PATH")
tokenizer = AutoTokenizer.from_pretrained(os.getenv("TOKENIZER", "distilbert-base-uncased"))
dataset = MyDataset(data_path, tokenizer)

# Split dataset
train_ratio = float(os.getenv("TRAIN_RATIO", 0.8))
eval_ratio = float(os.getenv("EVAL_RATIO", 0.1))
val_ratio = float(os.getenv("VAL_RATIO", 0.05))
test_ratio = float(os.getenv("TEST_RATIO", 0.05))
seed = int(os.getenv("SEED", 42))
train_subset, eval_subset, val_subset, test_subset = split_dataset(dataset, train_ratio, eval_ratio, val_ratio, test_ratio, seed)

# Create data loaders
batch_size = int(os.getenv("BATCH_SIZE", 16))
train_dataloader = DataLoader(train_subset, batch_size=batch_size, shuffle=True)
eval_dataloader = DataLoader(eval_subset, batch_size=batch_size, shuffle=False)
val_dataloader = DataLoader(val_subset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_subset, batch_size=batch_size, shuffle=False)

# Create trainer
trainer = Trainer(train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, val_dataloader=val_dataloader, test_dataloader=test_dataloader)

# Prepare trainer
trainer.prepare()

# Train model
trainer.train()

# Evaluate model on validation set
trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in val_dataloader:
batch = {k: v.to(trainer.device) for k, v in batch.items()}
outputs = trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
print(f"Validation accuracy: {accuracy:.4f}")

# Evaluate model on test set
trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in test_dataloader:
batch = {k: v.to(trainer.device) for k, v in batch.items()}
outputs = trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
print(f"Test accuracy: {accuracy:.4f}")

class TestFineTuneSequenceClassificationModel(unittest.TestCase):
def setUp(self):
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
self.dataset = MyDataset("data_path", self.tokenizer)
self.train_subset, self.eval_subset, self.val_subset, self.test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
self.batch_size = 16
self.train_dataloader = DataLoader(self.train_subset, batch_size=self.batch_size, shuffle=True)
self.eval_dataloader = DataLoader(self.eval_subset, batch_size=self.batch_size, shuffle=False)
self.val_dataloader = DataLoader(self.val_subset, batch_size=self.batch_size, shuffle=False)
self.test_dataloader = DataLoader(self.test_subset, batch_size=self.batch_size, shuffle=False)
self.trainer = Trainer(train_dataloader=self.train_dataloader, eval_dataloader=self.eval_dataloader, val_dataloader=self.val_dataloader, test_dataloader=self.test_dataloader)

def test_split_dataset(self):
train_subset, eval_subset, val_subset, test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
self.assertEqual(len(train_subset), 80)
self.assertEqual(len(eval_subset), 10)
self.assertEqual(len(val_subset), 5)
self.assertEqual(len(test_subset), 5)

def test_prepare(self):
self.trainer.prepare()
self.assertIsNotNone(self.trainer.model)
self.assertIsNotNone(self.trainer.optimizer)
self.assertIsNotNone(self.trainer.lr_scheduler)
self.assertIsNotNone(self.trainer.progress_bar)

def test_train(self):
self.trainer.prepare()
self.trainer.train()
self.assertIsNotNone(self.trainer.model)

def test_evaluate(self):
self.trainer.prepare()
self.trainer.train()
self.trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in self.val_dataloader:
batch = {k: v.to(self.trainer.device) for k, v in batch.items()}
outputs = self.trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
self.assertGreaterEqual(accuracy, 0.0)
self.assertLessEqual(accuracy, 1.0)

if __name__ == '__main__':
unittest.main()
File renamed without changes.
45 changes: 45 additions & 0 deletions LangChain/Chatbots/chroma_memory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import logging
from typing import List, Any, Dict
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.filters import EmbeddingsRedundantFilter
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
import chromadb
from langchain.vectorstores import Chroma

logging.basicConfig(level=logging.ERROR)

class ChromaMemory:
def __init__(self, model_name: str, cache_dir: str, max_history_len: int, vectorstore: Chroma):
"""
Initialize the ChromaMemory with a model name, cache directory, maximum history length, and a vectorstore.
Args:
model_name (str): The name of the LLM model to use.
cache_dir (str): The path to the directory to cache embeddings.
vectorstore (Chroma): The vectorstore to use for similarity matching.
chroma_memory = ChromaMemory(model_name, cache_dir, max_history_len, vectorstore)
max_history_len (int): The maximum length of the conversation history to remember.

"""
try:
self.embeddings = CacheBackedEmbeddings(
OpenAIEmbeddings(model_name),
cache_dir
)
self.filter = EmbeddingsRedundantFilter()
self.chat_model = ChatOpenAI(
self.embeddings,
self.filter
)
self.memory = ConversationBufferWindowMemory(
max_history_len,
self.chat_model
)
self.retrieval = RetrievalQA(
self.memory,
vectorstore
)
except Exception as e:
logging.error(f"Error initializing ChromaMemory: {e}")
raise ValueError(f"Error initializing ChromaMemory: {e}") from e
Loading