Skip to content

Commit

Permalink
Project Overhaul (#32)
Browse files Browse the repository at this point in the history
- Rearranged directory structure
- Updated necessary documentation
- Enhanced programming logic
- Closes issues #28, #30, #31.
  • Loading branch information
Daethyra authored Oct 19, 2023
2 parents 6c8122d + 75c666b commit d68f517
Show file tree
Hide file tree
Showing 38 changed files with 717 additions and 473 deletions.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Binary file added .github/2023-10-18_Mindmap.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions HuggingFace/Accelerate/.env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Checkpoint to use for the model
CHECKPOINT=distilbert-base-uncased

# Number of epochs to train the model
NUM_EPOCHS=3

# Learning rate for the optimizer
LR=3e-5

# Path to the data directory
DATA_PATH=data_path

# Tokenizer to use for the model
TOKENIZER=distilbert-base-uncased

# Train, validation, and test split ratios
TRAIN_RATIO=0.8
EVAL_RATIO=0.1
VAL_RATIO=0.05
TEST_RATIO=0.05

# Seed for reproducibility
SEED=42

# Batch size for training and evaluation
BATCH_SIZE=16
49 changes: 49 additions & 0 deletions HuggingFace/Accelerate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Getting Started with Sequence Classification

Welcome to the Sequence Classification example! This guide will help you get started with training a sequence classification model using the Hugging Face Transformers library.

## Installation

To install the required packages, you can use pip:

`pip install torch transformers accelerate tqdm python-dotenv`

## Usage

To use the Sequence Classification example, you can run the `sequence_classification.py` script:

`python sequence_classification.py`

This will train a sequence classification model on a dataset and evaluate its performance on the validation and test sets.

## Configuration

The behavior of the Sequence Classification example can be configured using environment variables. Here are the available environment variables and their default values:

- `CHECKPOINT`: The path or identifier of the pre-trained checkpoint to use. Default is `distilbert-base-uncased`.
- `NUM_EPOCHS`: The number of epochs to train for. Default is `3`.
- `LR`: The learning rate to use for the optimizer. Default is `3e-5`.
- `DATA_PATH`: The path to the dataset. This is a required environment variable.
- `TOKENIZER`: The path or identifier of the tokenizer to use. Default is `distilbert-base-uncased`.
- `TRAIN_RATIO`: The ratio of examples to use for training. Default is `0.8`.
- `EVAL_RATIO`: The ratio of examples to use for evaluation. Default is `0.1`.
- `VAL_RATIO`: The ratio of examples to use for validation. Default is `0.05`.
- `TEST_RATIO`: The ratio of examples to use for testing. Default is `0.05`.
- `SEED`: The random seed to use for shuffling the dataset. Default is `42`.
- `BATCH_SIZE`: The batch size to use for training, evaluation, and validation. Default is `16`.

You can set these environment variables using a `.env` file in the same directory as the `sequence_classification.py` script. Here's an example `.env` file:

```DATA_PATH=data.csv TRAIN_RATIO=0.7 EVAL_RATIO=0.15 VAL_RATIO=0.05 TEST_RATIO=0.1```

---

# GPT Description

This Python script defines a Trainer class that can be used to fine-tune a pre-trained sequence classification model using the Hugging Face Transformers library. The Trainer class provides methods for preparing the dataset, training the model, and evaluating the model's performance. The split_dataset function is also defined in the script, which can be used to split a dataset into training, evaluation, validation, and test subsets.

The script includes an example usage section that demonstrates how to use the Trainer class and split_dataset function with a custom dataset. The example usage section shows how to load a pre-trained model, prepare the dataset, fine-tune the model, and evaluate the model's performance. The example usage section also shows how to save the fine-tuned model to disk for later use.

Finally, the script includes a unit test class TestFineTuneSequenceClassificationModel that tests the split_dataset, prepare, train, and evaluate methods of the Trainer class. The unit test class provides a set of test cases that can be used to verify the correctness of the Trainer class implementation. The unit test class can be run using a testing framework such as unittest to ensure that the Trainer class is working as expected.

To improve the readability of the code, it may be helpful to add comments to explain the purpose of each method and variable. Additionally, it may be helpful to break up the Trainer class into smaller, more focused classes or functions to improve the modularity of the code. Finally, it may be helpful to add more error handling and input validation to the code to make it more robust and prevent unexpected errors.
246 changes: 246 additions & 0 deletions HuggingFace/Accelerate/fine_tune_sequence_classification_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
import os
import random
import torch
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler, AutoTokenizer
from torch.utils.data import DataLoader, Subset
from tqdm import tqdm
from dotenv import load_dotenv
import unittest

load_dotenv()

class Trainer:
"""
A class for training a sequence classification model using the Hugging Face Transformers library.
Args:
checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
train_dataloader (DataLoader): The data loader for the training set.
eval_dataloader (DataLoader): The data loader for the evaluation set.
val_dataloader (DataLoader): The data loader for the validation set.
test_dataloader (DataLoader): The data loader for the test set.
num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
"""
def __init__(self, checkpoint=None, train_dataloader=None, eval_dataloader=None, val_dataloader=None, test_dataloader=None, num_epochs=None, lr=None):
"""
Initializes a new instance of the Trainer class.
Args:
checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
train_dataloader (DataLoader): The data loader for the training set.
eval_dataloader (DataLoader): The data loader for the evaluation set.
val_dataloader (DataLoader): The data loader for the validation set.
test_dataloader (DataLoader): The data loader for the test set.
num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
"""
self.checkpoint = checkpoint or os.getenv("CHECKPOINT", "distilbert-base-uncased")
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader
self.val_dataloader = val_dataloader
self.test_dataloader = test_dataloader
self.num_epochs = num_epochs or int(os.getenv("NUM_EPOCHS", 3))
self.lr = lr or float(os.getenv("LR", 3e-5))
self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
self.accelerator = Accelerator()
self.model = None
self.optimizer = None
self.lr_scheduler = None
self.progress_bar = None

def prepare(self):
"""
Initializes the model, optimizer, and learning rate scheduler.
"""
if self.train_dataloader is None or self.eval_dataloader is None or self.val_dataloader is None or self.test_dataloader is None:
raise ValueError("Data loaders not defined. Cannot prepare trainer.")
self.model = AutoModelForSequenceClassification.from_pretrained(self.checkpoint, num_labels=2)
self.optimizer = AdamW(self.model.parameters(), lr=self.lr)
self.model.to(self.device)
self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer = self.accelerator.prepare(
self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer
)
num_training_steps = self.num_epochs * len(self.train_dataloader)
self.lr_scheduler = get_scheduler(
"linear",
optimizer=self.optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)
self.progress_bar = tqdm(range(num_training_steps))

def train(self):
"""
Trains the model for the specified number of epochs.
Raises:
ValueError: If the model, optimizer, learning rate scheduler, or progress bar is not initialized.
"""
if self.model is None or self.optimizer is None or self.lr_scheduler is None or self.progress_bar is None:
raise ValueError("Trainer not prepared. Call prepare() method first.")
self.model.train()
for epoch in range(self.num_epochs):
for batch in self.train_dataloader:
batch = {k: v.to(self.device) for k, v in batch.items()}
outputs = self.model(**batch)
loss = outputs.loss
loss.backward()
self.accelerator.backward(loss)

self.optimizer.step()
self.lr_scheduler.step()
self.optimizer.zero_grad()
self.progress_bar.update(1)

def split_dataset(dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42):
"""
Splits a dataset into training, evaluation, validation, and test subsets.
Args:
dataset (Dataset): The dataset to split.
train_ratio (float, optional): The ratio of examples to use for training. Defaults to 0.8.
eval_ratio (float, optional): The ratio of examples to use for evaluation. Defaults to 0.1.
val_ratio (float, optional): The ratio of examples to use for validation. Defaults to 0.05.
test_ratio (float, optional): The ratio of examples to use for testing. Defaults to 0.05.
seed (int, optional): The random seed to use for shuffling the dataset. Defaults to 42.
Returns:
Tuple[Subset]: A tuple of four subsets for training, evaluation, validation, and test.
"""
num_examples = len(dataset)
indices = list(range(num_examples))
random.seed(seed)
random.shuffle(indices)
train_size = int(train_ratio * num_examples)
eval_size = int(eval_ratio * num_examples)
val_size = int(val_ratio * num_examples)
test_size = int(test_ratio * num_examples)
train_indices = indices[:train_size]
eval_indices = indices[train_size:train_size+eval_size]
val_indices = indices[train_size+eval_size:train_size+eval_size+val_size]
test_indices = indices[train_size+eval_size+val_size:train_size+eval_size+val_size+test_size]
train_subset = Subset(dataset, train_indices)
eval_subset = Subset(dataset, eval_indices)
val_subset = Subset(dataset, val_indices)
test_subset = Subset(dataset, test_indices)
return train_subset, eval_subset, val_subset, test_subset

# Example usage
if __name__ == "__main__":
from my_dataset import MyDataset

# Load dataset
data_path = os.getenv("DATA_PATH")
tokenizer = AutoTokenizer.from_pretrained(os.getenv("TOKENIZER", "distilbert-base-uncased"))
dataset = MyDataset(data_path, tokenizer)

# Split dataset
train_ratio = float(os.getenv("TRAIN_RATIO", 0.8))
eval_ratio = float(os.getenv("EVAL_RATIO", 0.1))
val_ratio = float(os.getenv("VAL_RATIO", 0.05))
test_ratio = float(os.getenv("TEST_RATIO", 0.05))
seed = int(os.getenv("SEED", 42))
train_subset, eval_subset, val_subset, test_subset = split_dataset(dataset, train_ratio, eval_ratio, val_ratio, test_ratio, seed)

# Create data loaders
batch_size = int(os.getenv("BATCH_SIZE", 16))
train_dataloader = DataLoader(train_subset, batch_size=batch_size, shuffle=True)
eval_dataloader = DataLoader(eval_subset, batch_size=batch_size, shuffle=False)
val_dataloader = DataLoader(val_subset, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_subset, batch_size=batch_size, shuffle=False)

# Create trainer
trainer = Trainer(train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, val_dataloader=val_dataloader, test_dataloader=test_dataloader)

# Prepare trainer
trainer.prepare()

# Train model
trainer.train()

# Evaluate model on validation set
trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in val_dataloader:
batch = {k: v.to(trainer.device) for k, v in batch.items()}
outputs = trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
print(f"Validation accuracy: {accuracy:.4f}")

# Evaluate model on test set
trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in test_dataloader:
batch = {k: v.to(trainer.device) for k, v in batch.items()}
outputs = trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
print(f"Test accuracy: {accuracy:.4f}")

class TestFineTuneSequenceClassificationModel(unittest.TestCase):
def setUp(self):
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
self.dataset = MyDataset("data_path", self.tokenizer)
self.train_subset, self.eval_subset, self.val_subset, self.test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
self.batch_size = 16
self.train_dataloader = DataLoader(self.train_subset, batch_size=self.batch_size, shuffle=True)
self.eval_dataloader = DataLoader(self.eval_subset, batch_size=self.batch_size, shuffle=False)
self.val_dataloader = DataLoader(self.val_subset, batch_size=self.batch_size, shuffle=False)
self.test_dataloader = DataLoader(self.test_subset, batch_size=self.batch_size, shuffle=False)
self.trainer = Trainer(train_dataloader=self.train_dataloader, eval_dataloader=self.eval_dataloader, val_dataloader=self.val_dataloader, test_dataloader=self.test_dataloader)

def test_split_dataset(self):
train_subset, eval_subset, val_subset, test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
self.assertEqual(len(train_subset), 80)
self.assertEqual(len(eval_subset), 10)
self.assertEqual(len(val_subset), 5)
self.assertEqual(len(test_subset), 5)

def test_prepare(self):
self.trainer.prepare()
self.assertIsNotNone(self.trainer.model)
self.assertIsNotNone(self.trainer.optimizer)
self.assertIsNotNone(self.trainer.lr_scheduler)
self.assertIsNotNone(self.trainer.progress_bar)

def test_train(self):
self.trainer.prepare()
self.trainer.train()
self.assertIsNotNone(self.trainer.model)

def test_evaluate(self):
self.trainer.prepare()
self.trainer.train()
self.trainer.model.eval()
with torch.no_grad():
total_correct = 0
total_samples = 0
for batch in self.val_dataloader:
batch = {k: v.to(self.trainer.device) for k, v in batch.items()}
outputs = self.trainer.model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=1)
labels = batch["labels"]
total_correct += (predictions == labels).sum().item()
total_samples += len(labels)
accuracy = total_correct / total_samples
self.assertGreaterEqual(accuracy, 0.0)
self.assertLessEqual(accuracy, 1.0)

if __name__ == '__main__':
unittest.main()
File renamed without changes.
45 changes: 45 additions & 0 deletions LangChain/Chatbots/chroma_memory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import logging
from typing import List, Any, Dict
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.filters import EmbeddingsRedundantFilter
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
import chromadb
from langchain.vectorstores import Chroma

logging.basicConfig(level=logging.ERROR)

class ChromaMemory:
def __init__(self, model_name: str, cache_dir: str, max_history_len: int, vectorstore: Chroma):
"""
Initialize the ChromaMemory with a model name, cache directory, maximum history length, and a vectorstore.
Args:
model_name (str): The name of the LLM model to use.
cache_dir (str): The path to the directory to cache embeddings.
vectorstore (Chroma): The vectorstore to use for similarity matching.
chroma_memory = ChromaMemory(model_name, cache_dir, max_history_len, vectorstore)
max_history_len (int): The maximum length of the conversation history to remember.
"""
try:
self.embeddings = CacheBackedEmbeddings(
OpenAIEmbeddings(model_name),
cache_dir
)
self.filter = EmbeddingsRedundantFilter()
self.chat_model = ChatOpenAI(
self.embeddings,
self.filter
)
self.memory = ConversationBufferWindowMemory(
max_history_len,
self.chat_model
)
self.retrieval = RetrievalQA(
self.memory,
vectorstore
)
except Exception as e:
logging.error(f"Error initializing ChromaMemory: {e}")
raise ValueError(f"Error initializing ChromaMemory: {e}") from e
Loading

0 comments on commit d68f517

Please sign in to comment.