Project Overhaul (#32)

- Rearranged directory structure - Updated necessary documentation - Enhanced programming logic - Closes issues #28, #30, #31.
Daethyra · Oct 19, 2023 · d68f517 · d68f517
2 parents 6c8122d + 75c666b
commit d68f517
Show file tree

Hide file tree

Showing 38 changed files with 717 additions and 473 deletions.
diff --git a/.github/bar_graph.jpg → .archive/graphics/bar_graph.jpg b/.github/bar_graph.jpg → .archive/graphics/bar_graph.jpg
diff --git a/.github/mindmap_2023-10-07.jpg → .archive/graphics/mindmap_2023-10-07.jpg b/.github/mindmap_2023-10-07.jpg → .archive/graphics/mindmap_2023-10-07.jpg
diff --git a/.github/pie_chart.jpg → .archive/graphics/pie_chart.jpg b/.github/pie_chart.jpg → .archive/graphics/pie_chart.jpg
diff --git a/.github/plugin_icons.jpg → .archive/graphics/plugin_icons.jpg b/.github/plugin_icons.jpg → .archive/graphics/plugin_icons.jpg
diff --git a/.github/2023-10-18_Mindmap.jpg b/.github/2023-10-18_Mindmap.jpg
diff --git a/HuggingFace/Accelerate/.env.template b/HuggingFace/Accelerate/.env.template
@@ -0,0 +1,26 @@
+# Checkpoint to use for the model
+CHECKPOINT=distilbert-base-uncased
+
+# Number of epochs to train the model
+NUM_EPOCHS=3
+
+# Learning rate for the optimizer
+LR=3e-5
+
+# Path to the data directory
+DATA_PATH=data_path
+
+# Tokenizer to use for the model
+TOKENIZER=distilbert-base-uncased
+
+# Train, validation, and test split ratios
+TRAIN_RATIO=0.8
+EVAL_RATIO=0.1
+VAL_RATIO=0.05
+TEST_RATIO=0.05
+
+# Seed for reproducibility
+SEED=42
+
+# Batch size for training and evaluation
+BATCH_SIZE=16
diff --git a/HuggingFace/Accelerate/README.md b/HuggingFace/Accelerate/README.md
@@ -0,0 +1,49 @@
+# Getting Started with Sequence Classification
+
+Welcome to the Sequence Classification example! This guide will help you get started with training a sequence classification model using the Hugging Face Transformers library.
+
+## Installation
+
+To install the required packages, you can use pip:
+
+`pip install torch transformers accelerate tqdm python-dotenv`
+
+## Usage
+
+To use the Sequence Classification example, you can run the `sequence_classification.py` script:
+
+`python sequence_classification.py`
+
+This will train a sequence classification model on a dataset and evaluate its performance on the validation and test sets.
+
+## Configuration
+
+The behavior of the Sequence Classification example can be configured using environment variables. Here are the available environment variables and their default values:
+
+- `CHECKPOINT`: The path or identifier of the pre-trained checkpoint to use. Default is `distilbert-base-uncased`.
+- `NUM_EPOCHS`: The number of epochs to train for. Default is `3`.
+- `LR`: The learning rate to use for the optimizer. Default is `3e-5`.
+- `DATA_PATH`: The path to the dataset. This is a required environment variable.
+- `TOKENIZER`: The path or identifier of the tokenizer to use. Default is `distilbert-base-uncased`.
+- `TRAIN_RATIO`: The ratio of examples to use for training. Default is `0.8`.
+- `EVAL_RATIO`: The ratio of examples to use for evaluation. Default is `0.1`.
+- `VAL_RATIO`: The ratio of examples to use for validation. Default is `0.05`.
+- `TEST_RATIO`: The ratio of examples to use for testing. Default is `0.05`.
+- `SEED`: The random seed to use for shuffling the dataset. Default is `42`.
+- `BATCH_SIZE`: The batch size to use for training, evaluation, and validation. Default is `16`.
+
+You can set these environment variables using a `.env` file in the same directory as the `sequence_classification.py` script. Here's an example `.env` file:
+
+```DATA_PATH=data.csv TRAIN_RATIO=0.7 EVAL_RATIO=0.15 VAL_RATIO=0.05 TEST_RATIO=0.1```
+
+---
+
+# GPT Description
+
+This Python script defines a Trainer class that can be used to fine-tune a pre-trained sequence classification model using the Hugging Face Transformers library. The Trainer class provides methods for preparing the dataset, training the model, and evaluating the model's performance. The split_dataset function is also defined in the script, which can be used to split a dataset into training, evaluation, validation, and test subsets.
+
+The script includes an example usage section that demonstrates how to use the Trainer class and split_dataset function with a custom dataset. The example usage section shows how to load a pre-trained model, prepare the dataset, fine-tune the model, and evaluate the model's performance. The example usage section also shows how to save the fine-tuned model to disk for later use.
+
+Finally, the script includes a unit test class TestFineTuneSequenceClassificationModel that tests the split_dataset, prepare, train, and evaluate methods of the Trainer class. The unit test class provides a set of test cases that can be used to verify the correctness of the Trainer class implementation. The unit test class can be run using a testing framework such as unittest to ensure that the Trainer class is working as expected.
+
+To improve the readability of the code, it may be helpful to add comments to explain the purpose of each method and variable. Additionally, it may be helpful to break up the Trainer class into smaller, more focused classes or functions to improve the modularity of the code. Finally, it may be helpful to add more error handling and input validation to the code to make it more robust and prevent unexpected errors.
diff --git a/HuggingFace/Accelerate/fine_tune_sequence_classification_model.py b/HuggingFace/Accelerate/fine_tune_sequence_classification_model.py
@@ -0,0 +1,246 @@
+import os
+import random
+import torch
+from accelerate import Accelerator
+from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler, AutoTokenizer
+from torch.utils.data import DataLoader, Subset
+from tqdm import tqdm
+from dotenv import load_dotenv
+import unittest
+
+load_dotenv()
+
+class Trainer:
+    """
+    A class for training a sequence classification model using the Hugging Face Transformers library.
+
+    Args:
+        checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
+        train_dataloader (DataLoader): The data loader for the training set.
+        eval_dataloader (DataLoader): The data loader for the evaluation set.
+        val_dataloader (DataLoader): The data loader for the validation set.
+        test_dataloader (DataLoader): The data loader for the test set.
+        num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
+        lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
+    """
+    def __init__(self, checkpoint=None, train_dataloader=None, eval_dataloader=None, val_dataloader=None, test_dataloader=None, num_epochs=None, lr=None):
+        """
+        Initializes a new instance of the Trainer class.
+
+        Args:
+            checkpoint (str): The path or identifier of the pre-trained checkpoint to use.
+            train_dataloader (DataLoader): The data loader for the training set.
+            eval_dataloader (DataLoader): The data loader for the evaluation set.
+            val_dataloader (DataLoader): The data loader for the validation set.
+            test_dataloader (DataLoader): The data loader for the test set.
+            num_epochs (int, optional): The number of epochs to train for. Defaults to 3.
+            lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5.
+        """
+        self.checkpoint = checkpoint or os.getenv("CHECKPOINT", "distilbert-base-uncased")
+        self.train_dataloader = train_dataloader
+        self.eval_dataloader = eval_dataloader
+        self.val_dataloader = val_dataloader
+        self.test_dataloader = test_dataloader
+        self.num_epochs = num_epochs or int(os.getenv("NUM_EPOCHS", 3))
+        self.lr = lr or float(os.getenv("LR", 3e-5))
+        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+        self.accelerator = Accelerator()
+        self.model = None
+        self.optimizer = None
+        self.lr_scheduler = None
+        self.progress_bar = None
+
+    def prepare(self):
+        """
+        Initializes the model, optimizer, and learning rate scheduler.
+        """
+        if self.train_dataloader is None or self.eval_dataloader is None or self.val_dataloader is None or self.test_dataloader is None:
+            raise ValueError("Data loaders not defined. Cannot prepare trainer.")
+        self.model = AutoModelForSequenceClassification.from_pretrained(self.checkpoint, num_labels=2)
+        self.optimizer = AdamW(self.model.parameters(), lr=self.lr)
+        self.model.to(self.device)
+        self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer = self.accelerator.prepare(
+            self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer
+        )
+        num_training_steps = self.num_epochs * len(self.train_dataloader)
+        self.lr_scheduler = get_scheduler(
+            "linear",
+            optimizer=self.optimizer,
+            num_warmup_steps=0,
+            num_training_steps=num_training_steps
+        )
+        self.progress_bar = tqdm(range(num_training_steps))
+
+    def train(self):
+        """
+        Trains the model for the specified number of epochs.
+
+        Raises:
+            ValueError: If the model, optimizer, learning rate scheduler, or progress bar is not initialized.
+        """
+        if self.model is None or self.optimizer is None or self.lr_scheduler is None or self.progress_bar is None:
+            raise ValueError("Trainer not prepared. Call prepare() method first.")
+        self.model.train()
+        for epoch in range(self.num_epochs):
+            for batch in self.train_dataloader:
+                batch = {k: v.to(self.device) for k, v in batch.items()}
+                outputs = self.model(**batch)
+                loss = outputs.loss
+                loss.backward()
+                self.accelerator.backward(loss)
+
+                self.optimizer.step()
+                self.lr_scheduler.step()
+                self.optimizer.zero_grad()
+                self.progress_bar.update(1)
+
+def split_dataset(dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42):
+    """
+    Splits a dataset into training, evaluation, validation, and test subsets.
+
+    Args:
+        dataset (Dataset): The dataset to split.
+        train_ratio (float, optional): The ratio of examples to use for training. Defaults to 0.8.
+        eval_ratio (float, optional): The ratio of examples to use for evaluation. Defaults to 0.1.
+        val_ratio (float, optional): The ratio of examples to use for validation. Defaults to 0.05.
+        test_ratio (float, optional): The ratio of examples to use for testing. Defaults to 0.05.
+        seed (int, optional): The random seed to use for shuffling the dataset. Defaults to 42.
+
+    Returns:
+        Tuple[Subset]: A tuple of four subsets for training, evaluation, validation, and test.
+    """
+    num_examples = len(dataset)
+    indices = list(range(num_examples))
+    random.seed(seed)
+    random.shuffle(indices)
+    train_size = int(train_ratio * num_examples)
+    eval_size = int(eval_ratio * num_examples)
+    val_size = int(val_ratio * num_examples)
+    test_size = int(test_ratio * num_examples)
+    train_indices = indices[:train_size]
+    eval_indices = indices[train_size:train_size+eval_size]
+    val_indices = indices[train_size+eval_size:train_size+eval_size+val_size]
+    test_indices = indices[train_size+eval_size+val_size:train_size+eval_size+val_size+test_size]
+    train_subset = Subset(dataset, train_indices)
+    eval_subset = Subset(dataset, eval_indices)
+    val_subset = Subset(dataset, val_indices)
+    test_subset = Subset(dataset, test_indices)
+    return train_subset, eval_subset, val_subset, test_subset
+
+# Example usage
+if __name__ == "__main__":
+    from my_dataset import MyDataset
+
+    # Load dataset
+    data_path = os.getenv("DATA_PATH")
+    tokenizer = AutoTokenizer.from_pretrained(os.getenv("TOKENIZER", "distilbert-base-uncased"))
+    dataset = MyDataset(data_path, tokenizer)
+
+    # Split dataset
+    train_ratio = float(os.getenv("TRAIN_RATIO", 0.8))
+    eval_ratio = float(os.getenv("EVAL_RATIO", 0.1))
+    val_ratio = float(os.getenv("VAL_RATIO", 0.05))
+    test_ratio = float(os.getenv("TEST_RATIO", 0.05))
+    seed = int(os.getenv("SEED", 42))
+    train_subset, eval_subset, val_subset, test_subset = split_dataset(dataset, train_ratio, eval_ratio, val_ratio, test_ratio, seed)
+
+    # Create data loaders
+    batch_size = int(os.getenv("BATCH_SIZE", 16))
+    train_dataloader = DataLoader(train_subset, batch_size=batch_size, shuffle=True)
+    eval_dataloader = DataLoader(eval_subset, batch_size=batch_size, shuffle=False)
+    val_dataloader = DataLoader(val_subset, batch_size=batch_size, shuffle=False)
+    test_dataloader = DataLoader(test_subset, batch_size=batch_size, shuffle=False)
+
+    # Create trainer
+    trainer = Trainer(train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, val_dataloader=val_dataloader, test_dataloader=test_dataloader)
+
+    # Prepare trainer
+    trainer.prepare()
+
+    # Train model
+    trainer.train()
+
+    # Evaluate model on validation set
+    trainer.model.eval()
+    with torch.no_grad():
+        total_correct = 0
+        total_samples = 0
+        for batch in val_dataloader:
+            batch = {k: v.to(trainer.device) for k, v in batch.items()}
+            outputs = trainer.model(**batch)
+            logits = outputs.logits
+            predictions = torch.argmax(logits, dim=1)
+            labels = batch["labels"]
+            total_correct += (predictions == labels).sum().item()
+            total_samples += len(labels)
+        accuracy = total_correct / total_samples
+        print(f"Validation accuracy: {accuracy:.4f}")
+
+    # Evaluate model on test set
+    trainer.model.eval()
+    with torch.no_grad():
+        total_correct = 0
+        total_samples = 0
+        for batch in test_dataloader:
+            batch = {k: v.to(trainer.device) for k, v in batch.items()}
+            outputs = trainer.model(**batch)
+            logits = outputs.logits
+            predictions = torch.argmax(logits, dim=1)
+            labels = batch["labels"]
+            total_correct += (predictions == labels).sum().item()
+            total_samples += len(labels)
+        accuracy = total_correct / total_samples
+        print(f"Test accuracy: {accuracy:.4f}")
+
+class TestFineTuneSequenceClassificationModel(unittest.TestCase):
+    def setUp(self):
+        self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+        self.dataset = MyDataset("data_path", self.tokenizer)
+        self.train_subset, self.eval_subset, self.val_subset, self.test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
+        self.batch_size = 16
+        self.train_dataloader = DataLoader(self.train_subset, batch_size=self.batch_size, shuffle=True)
+        self.eval_dataloader = DataLoader(self.eval_subset, batch_size=self.batch_size, shuffle=False)
+        self.val_dataloader = DataLoader(self.val_subset, batch_size=self.batch_size, shuffle=False)
+        self.test_dataloader = DataLoader(self.test_subset, batch_size=self.batch_size, shuffle=False)
+        self.trainer = Trainer(train_dataloader=self.train_dataloader, eval_dataloader=self.eval_dataloader, val_dataloader=self.val_dataloader, test_dataloader=self.test_dataloader)
+
+    def test_split_dataset(self):
+        train_subset, eval_subset, val_subset, test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42)
+        self.assertEqual(len(train_subset), 80)
+        self.assertEqual(len(eval_subset), 10)
+        self.assertEqual(len(val_subset), 5)
+        self.assertEqual(len(test_subset), 5)
+
+    def test_prepare(self):
+        self.trainer.prepare()
+        self.assertIsNotNone(self.trainer.model)
+        self.assertIsNotNone(self.trainer.optimizer)
+        self.assertIsNotNone(self.trainer.lr_scheduler)
+        self.assertIsNotNone(self.trainer.progress_bar)
+
+    def test_train(self):
+        self.trainer.prepare()
+        self.trainer.train()
+        self.assertIsNotNone(self.trainer.model)
+
+    def test_evaluate(self):
+        self.trainer.prepare()
+        self.trainer.train()
+        self.trainer.model.eval()
+        with torch.no_grad():
+            total_correct = 0
+            total_samples = 0
+            for batch in self.val_dataloader:
+                batch = {k: v.to(self.trainer.device) for k, v in batch.items()}
+                outputs = self.trainer.model(**batch)
+                logits = outputs.logits
+                predictions = torch.argmax(logits, dim=1)
+                labels = batch["labels"]
+                total_correct += (predictions == labels).sum().item()
+                total_samples += len(labels)
+            accuracy = total_correct / total_samples
+            self.assertGreaterEqual(accuracy, 0.0)
+            self.assertLessEqual(accuracy, 1.0)
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/LangChain/Retrieval-Agents/__init__.py → LangChain/Chatbots/__init__.py b/LangChain/Retrieval-Agents/__init__.py → LangChain/Chatbots/__init__.py
diff --git a/LangChain/Chatbots/chroma_memory.py b/LangChain/Chatbots/chroma_memory.py
@@ -0,0 +1,45 @@
+import logging
+from typing import List, Any, Dict
+from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings, HuggingFaceEmbeddings
+from langchain.filters import EmbeddingsRedundantFilter
+from langchain.chat_models import ChatOpenAI
+from langchain.chains.conversation.memory import ConversationBufferWindowMemory
+from langchain.chains import RetrievalQA
+import chromadb
+from langchain.vectorstores import Chroma
+
+logging.basicConfig(level=logging.ERROR)
+
+class ChromaMemory:
+    def __init__(self, model_name: str, cache_dir: str, max_history_len: int, vectorstore: Chroma):
+        """
+        Initialize the ChromaMemory with a model name, cache directory, maximum history length, and a vectorstore.
+        Args:
+            model_name (str): The name of the LLM model to use.
+            cache_dir (str): The path to the directory to cache embeddings.
+            vectorstore (Chroma): The vectorstore to use for similarity matching.
+            chroma_memory = ChromaMemory(model_name, cache_dir, max_history_len, vectorstore)
+            max_history_len (int): The maximum length of the conversation history to remember.
+
+        """
+        try:
+            self.embeddings = CacheBackedEmbeddings(
+                OpenAIEmbeddings(model_name),
+                cache_dir
+            )
+            self.filter = EmbeddingsRedundantFilter()
+            self.chat_model = ChatOpenAI(
+                self.embeddings,
+                self.filter
+            )
+            self.memory = ConversationBufferWindowMemory(
+                max_history_len,
+                self.chat_model
+            )
+            self.retrieval = RetrievalQA(
+                self.memory,
+                vectorstore
+            )
+        except Exception as e:
+            logging.error(f"Error initializing ChromaMemory: {e}")
+            raise ValueError(f"Error initializing ChromaMemory: {e}") from e