diff --git a/.github/bar_graph.jpg b/.archive/graphics/bar_graph.jpg similarity index 100% rename from .github/bar_graph.jpg rename to .archive/graphics/bar_graph.jpg diff --git a/.github/mindmap_2023-10-07.jpg b/.archive/graphics/mindmap_2023-10-07.jpg similarity index 100% rename from .github/mindmap_2023-10-07.jpg rename to .archive/graphics/mindmap_2023-10-07.jpg diff --git a/.github/pie_chart.jpg b/.archive/graphics/pie_chart.jpg similarity index 100% rename from .github/pie_chart.jpg rename to .archive/graphics/pie_chart.jpg diff --git a/.github/plugin_icons.jpg b/.archive/graphics/plugin_icons.jpg similarity index 100% rename from .github/plugin_icons.jpg rename to .archive/graphics/plugin_icons.jpg diff --git a/.github/2023-10-18_Mindmap.jpg b/.github/2023-10-18_Mindmap.jpg new file mode 100644 index 0000000..77b1ede Binary files /dev/null and b/.github/2023-10-18_Mindmap.jpg differ diff --git a/HuggingFace/Accelerate/.env.template b/HuggingFace/Accelerate/.env.template new file mode 100644 index 0000000..e13396e --- /dev/null +++ b/HuggingFace/Accelerate/.env.template @@ -0,0 +1,26 @@ +# Checkpoint to use for the model +CHECKPOINT=distilbert-base-uncased + +# Number of epochs to train the model +NUM_EPOCHS=3 + +# Learning rate for the optimizer +LR=3e-5 + +# Path to the data directory +DATA_PATH=data_path + +# Tokenizer to use for the model +TOKENIZER=distilbert-base-uncased + +# Train, validation, and test split ratios +TRAIN_RATIO=0.8 +EVAL_RATIO=0.1 +VAL_RATIO=0.05 +TEST_RATIO=0.05 + +# Seed for reproducibility +SEED=42 + +# Batch size for training and evaluation +BATCH_SIZE=16 \ No newline at end of file diff --git a/HuggingFace/Accelerate/README.md b/HuggingFace/Accelerate/README.md new file mode 100644 index 0000000..ae8239c --- /dev/null +++ b/HuggingFace/Accelerate/README.md @@ -0,0 +1,49 @@ +# Getting Started with Sequence Classification + +Welcome to the Sequence Classification example! This guide will help you get started with training a sequence classification model using the Hugging Face Transformers library. + +## Installation + +To install the required packages, you can use pip: + +`pip install torch transformers accelerate tqdm python-dotenv` + +## Usage + +To use the Sequence Classification example, you can run the `sequence_classification.py` script: + +`python sequence_classification.py` + +This will train a sequence classification model on a dataset and evaluate its performance on the validation and test sets. + +## Configuration + +The behavior of the Sequence Classification example can be configured using environment variables. Here are the available environment variables and their default values: + +- `CHECKPOINT`: The path or identifier of the pre-trained checkpoint to use. Default is `distilbert-base-uncased`. +- `NUM_EPOCHS`: The number of epochs to train for. Default is `3`. +- `LR`: The learning rate to use for the optimizer. Default is `3e-5`. +- `DATA_PATH`: The path to the dataset. This is a required environment variable. +- `TOKENIZER`: The path or identifier of the tokenizer to use. Default is `distilbert-base-uncased`. +- `TRAIN_RATIO`: The ratio of examples to use for training. Default is `0.8`. +- `EVAL_RATIO`: The ratio of examples to use for evaluation. Default is `0.1`. +- `VAL_RATIO`: The ratio of examples to use for validation. Default is `0.05`. +- `TEST_RATIO`: The ratio of examples to use for testing. Default is `0.05`. +- `SEED`: The random seed to use for shuffling the dataset. Default is `42`. +- `BATCH_SIZE`: The batch size to use for training, evaluation, and validation. Default is `16`. + +You can set these environment variables using a `.env` file in the same directory as the `sequence_classification.py` script. Here's an example `.env` file: + +```DATA_PATH=data.csv TRAIN_RATIO=0.7 EVAL_RATIO=0.15 VAL_RATIO=0.05 TEST_RATIO=0.1``` + +--- + +# GPT Description + +This Python script defines a Trainer class that can be used to fine-tune a pre-trained sequence classification model using the Hugging Face Transformers library. The Trainer class provides methods for preparing the dataset, training the model, and evaluating the model's performance. The split_dataset function is also defined in the script, which can be used to split a dataset into training, evaluation, validation, and test subsets. + +The script includes an example usage section that demonstrates how to use the Trainer class and split_dataset function with a custom dataset. The example usage section shows how to load a pre-trained model, prepare the dataset, fine-tune the model, and evaluate the model's performance. The example usage section also shows how to save the fine-tuned model to disk for later use. + +Finally, the script includes a unit test class TestFineTuneSequenceClassificationModel that tests the split_dataset, prepare, train, and evaluate methods of the Trainer class. The unit test class provides a set of test cases that can be used to verify the correctness of the Trainer class implementation. The unit test class can be run using a testing framework such as unittest to ensure that the Trainer class is working as expected. + +To improve the readability of the code, it may be helpful to add comments to explain the purpose of each method and variable. Additionally, it may be helpful to break up the Trainer class into smaller, more focused classes or functions to improve the modularity of the code. Finally, it may be helpful to add more error handling and input validation to the code to make it more robust and prevent unexpected errors. \ No newline at end of file diff --git a/HuggingFace/Accelerate/fine_tune_sequence_classification_model.py b/HuggingFace/Accelerate/fine_tune_sequence_classification_model.py new file mode 100644 index 0000000..75d2422 --- /dev/null +++ b/HuggingFace/Accelerate/fine_tune_sequence_classification_model.py @@ -0,0 +1,246 @@ +import os +import random +import torch +from accelerate import Accelerator +from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler, AutoTokenizer +from torch.utils.data import DataLoader, Subset +from tqdm import tqdm +from dotenv import load_dotenv +import unittest + +load_dotenv() + +class Trainer: + """ + A class for training a sequence classification model using the Hugging Face Transformers library. + + Args: + checkpoint (str): The path or identifier of the pre-trained checkpoint to use. + train_dataloader (DataLoader): The data loader for the training set. + eval_dataloader (DataLoader): The data loader for the evaluation set. + val_dataloader (DataLoader): The data loader for the validation set. + test_dataloader (DataLoader): The data loader for the test set. + num_epochs (int, optional): The number of epochs to train for. Defaults to 3. + lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5. + """ + def __init__(self, checkpoint=None, train_dataloader=None, eval_dataloader=None, val_dataloader=None, test_dataloader=None, num_epochs=None, lr=None): + """ + Initializes a new instance of the Trainer class. + + Args: + checkpoint (str): The path or identifier of the pre-trained checkpoint to use. + train_dataloader (DataLoader): The data loader for the training set. + eval_dataloader (DataLoader): The data loader for the evaluation set. + val_dataloader (DataLoader): The data loader for the validation set. + test_dataloader (DataLoader): The data loader for the test set. + num_epochs (int, optional): The number of epochs to train for. Defaults to 3. + lr (float, optional): The learning rate to use for the optimizer. Defaults to 3e-5. + """ + self.checkpoint = checkpoint or os.getenv("CHECKPOINT", "distilbert-base-uncased") + self.train_dataloader = train_dataloader + self.eval_dataloader = eval_dataloader + self.val_dataloader = val_dataloader + self.test_dataloader = test_dataloader + self.num_epochs = num_epochs or int(os.getenv("NUM_EPOCHS", 3)) + self.lr = lr or float(os.getenv("LR", 3e-5)) + self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") + self.accelerator = Accelerator() + self.model = None + self.optimizer = None + self.lr_scheduler = None + self.progress_bar = None + + def prepare(self): + """ + Initializes the model, optimizer, and learning rate scheduler. + """ + if self.train_dataloader is None or self.eval_dataloader is None or self.val_dataloader is None or self.test_dataloader is None: + raise ValueError("Data loaders not defined. Cannot prepare trainer.") + self.model = AutoModelForSequenceClassification.from_pretrained(self.checkpoint, num_labels=2) + self.optimizer = AdamW(self.model.parameters(), lr=self.lr) + self.model.to(self.device) + self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer = self.accelerator.prepare( + self.train_dataloader, self.eval_dataloader, self.val_dataloader, self.test_dataloader, self.model, self.optimizer + ) + num_training_steps = self.num_epochs * len(self.train_dataloader) + self.lr_scheduler = get_scheduler( + "linear", + optimizer=self.optimizer, + num_warmup_steps=0, + num_training_steps=num_training_steps + ) + self.progress_bar = tqdm(range(num_training_steps)) + + def train(self): + """ + Trains the model for the specified number of epochs. + + Raises: + ValueError: If the model, optimizer, learning rate scheduler, or progress bar is not initialized. + """ + if self.model is None or self.optimizer is None or self.lr_scheduler is None or self.progress_bar is None: + raise ValueError("Trainer not prepared. Call prepare() method first.") + self.model.train() + for epoch in range(self.num_epochs): + for batch in self.train_dataloader: + batch = {k: v.to(self.device) for k, v in batch.items()} + outputs = self.model(**batch) + loss = outputs.loss + loss.backward() + self.accelerator.backward(loss) + + self.optimizer.step() + self.lr_scheduler.step() + self.optimizer.zero_grad() + self.progress_bar.update(1) + +def split_dataset(dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42): + """ + Splits a dataset into training, evaluation, validation, and test subsets. + + Args: + dataset (Dataset): The dataset to split. + train_ratio (float, optional): The ratio of examples to use for training. Defaults to 0.8. + eval_ratio (float, optional): The ratio of examples to use for evaluation. Defaults to 0.1. + val_ratio (float, optional): The ratio of examples to use for validation. Defaults to 0.05. + test_ratio (float, optional): The ratio of examples to use for testing. Defaults to 0.05. + seed (int, optional): The random seed to use for shuffling the dataset. Defaults to 42. + + Returns: + Tuple[Subset]: A tuple of four subsets for training, evaluation, validation, and test. + """ + num_examples = len(dataset) + indices = list(range(num_examples)) + random.seed(seed) + random.shuffle(indices) + train_size = int(train_ratio * num_examples) + eval_size = int(eval_ratio * num_examples) + val_size = int(val_ratio * num_examples) + test_size = int(test_ratio * num_examples) + train_indices = indices[:train_size] + eval_indices = indices[train_size:train_size+eval_size] + val_indices = indices[train_size+eval_size:train_size+eval_size+val_size] + test_indices = indices[train_size+eval_size+val_size:train_size+eval_size+val_size+test_size] + train_subset = Subset(dataset, train_indices) + eval_subset = Subset(dataset, eval_indices) + val_subset = Subset(dataset, val_indices) + test_subset = Subset(dataset, test_indices) + return train_subset, eval_subset, val_subset, test_subset + +# Example usage +if __name__ == "__main__": + from my_dataset import MyDataset + + # Load dataset + data_path = os.getenv("DATA_PATH") + tokenizer = AutoTokenizer.from_pretrained(os.getenv("TOKENIZER", "distilbert-base-uncased")) + dataset = MyDataset(data_path, tokenizer) + + # Split dataset + train_ratio = float(os.getenv("TRAIN_RATIO", 0.8)) + eval_ratio = float(os.getenv("EVAL_RATIO", 0.1)) + val_ratio = float(os.getenv("VAL_RATIO", 0.05)) + test_ratio = float(os.getenv("TEST_RATIO", 0.05)) + seed = int(os.getenv("SEED", 42)) + train_subset, eval_subset, val_subset, test_subset = split_dataset(dataset, train_ratio, eval_ratio, val_ratio, test_ratio, seed) + + # Create data loaders + batch_size = int(os.getenv("BATCH_SIZE", 16)) + train_dataloader = DataLoader(train_subset, batch_size=batch_size, shuffle=True) + eval_dataloader = DataLoader(eval_subset, batch_size=batch_size, shuffle=False) + val_dataloader = DataLoader(val_subset, batch_size=batch_size, shuffle=False) + test_dataloader = DataLoader(test_subset, batch_size=batch_size, shuffle=False) + + # Create trainer + trainer = Trainer(train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, val_dataloader=val_dataloader, test_dataloader=test_dataloader) + + # Prepare trainer + trainer.prepare() + + # Train model + trainer.train() + + # Evaluate model on validation set + trainer.model.eval() + with torch.no_grad(): + total_correct = 0 + total_samples = 0 + for batch in val_dataloader: + batch = {k: v.to(trainer.device) for k, v in batch.items()} + outputs = trainer.model(**batch) + logits = outputs.logits + predictions = torch.argmax(logits, dim=1) + labels = batch["labels"] + total_correct += (predictions == labels).sum().item() + total_samples += len(labels) + accuracy = total_correct / total_samples + print(f"Validation accuracy: {accuracy:.4f}") + + # Evaluate model on test set + trainer.model.eval() + with torch.no_grad(): + total_correct = 0 + total_samples = 0 + for batch in test_dataloader: + batch = {k: v.to(trainer.device) for k, v in batch.items()} + outputs = trainer.model(**batch) + logits = outputs.logits + predictions = torch.argmax(logits, dim=1) + labels = batch["labels"] + total_correct += (predictions == labels).sum().item() + total_samples += len(labels) + accuracy = total_correct / total_samples + print(f"Test accuracy: {accuracy:.4f}") + +class TestFineTuneSequenceClassificationModel(unittest.TestCase): + def setUp(self): + self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") + self.dataset = MyDataset("data_path", self.tokenizer) + self.train_subset, self.eval_subset, self.val_subset, self.test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42) + self.batch_size = 16 + self.train_dataloader = DataLoader(self.train_subset, batch_size=self.batch_size, shuffle=True) + self.eval_dataloader = DataLoader(self.eval_subset, batch_size=self.batch_size, shuffle=False) + self.val_dataloader = DataLoader(self.val_subset, batch_size=self.batch_size, shuffle=False) + self.test_dataloader = DataLoader(self.test_subset, batch_size=self.batch_size, shuffle=False) + self.trainer = Trainer(train_dataloader=self.train_dataloader, eval_dataloader=self.eval_dataloader, val_dataloader=self.val_dataloader, test_dataloader=self.test_dataloader) + + def test_split_dataset(self): + train_subset, eval_subset, val_subset, test_subset = split_dataset(self.dataset, train_ratio=0.8, eval_ratio=0.1, val_ratio=0.05, test_ratio=0.05, seed=42) + self.assertEqual(len(train_subset), 80) + self.assertEqual(len(eval_subset), 10) + self.assertEqual(len(val_subset), 5) + self.assertEqual(len(test_subset), 5) + + def test_prepare(self): + self.trainer.prepare() + self.assertIsNotNone(self.trainer.model) + self.assertIsNotNone(self.trainer.optimizer) + self.assertIsNotNone(self.trainer.lr_scheduler) + self.assertIsNotNone(self.trainer.progress_bar) + + def test_train(self): + self.trainer.prepare() + self.trainer.train() + self.assertIsNotNone(self.trainer.model) + + def test_evaluate(self): + self.trainer.prepare() + self.trainer.train() + self.trainer.model.eval() + with torch.no_grad(): + total_correct = 0 + total_samples = 0 + for batch in self.val_dataloader: + batch = {k: v.to(self.trainer.device) for k, v in batch.items()} + outputs = self.trainer.model(**batch) + logits = outputs.logits + predictions = torch.argmax(logits, dim=1) + labels = batch["labels"] + total_correct += (predictions == labels).sum().item() + total_samples += len(labels) + accuracy = total_correct / total_samples + self.assertGreaterEqual(accuracy, 0.0) + self.assertLessEqual(accuracy, 1.0) + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/LangChain/Retrieval-Agents/__init__.py b/LangChain/Chatbots/__init__.py similarity index 100% rename from LangChain/Retrieval-Agents/__init__.py rename to LangChain/Chatbots/__init__.py diff --git a/LangChain/Chatbots/chroma_memory.py b/LangChain/Chatbots/chroma_memory.py new file mode 100644 index 0000000..f459c24 --- /dev/null +++ b/LangChain/Chatbots/chroma_memory.py @@ -0,0 +1,45 @@ +import logging +from typing import List, Any, Dict +from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings, HuggingFaceEmbeddings +from langchain.filters import EmbeddingsRedundantFilter +from langchain.chat_models import ChatOpenAI +from langchain.chains.conversation.memory import ConversationBufferWindowMemory +from langchain.chains import RetrievalQA +import chromadb +from langchain.vectorstores import Chroma + +logging.basicConfig(level=logging.ERROR) + +class ChromaMemory: + def __init__(self, model_name: str, cache_dir: str, max_history_len: int, vectorstore: Chroma): + """ + Initialize the ChromaMemory with a model name, cache directory, maximum history length, and a vectorstore. + Args: + model_name (str): The name of the LLM model to use. + cache_dir (str): The path to the directory to cache embeddings. + vectorstore (Chroma): The vectorstore to use for similarity matching. + chroma_memory = ChromaMemory(model_name, cache_dir, max_history_len, vectorstore) + max_history_len (int): The maximum length of the conversation history to remember. + + """ + try: + self.embeddings = CacheBackedEmbeddings( + OpenAIEmbeddings(model_name), + cache_dir + ) + self.filter = EmbeddingsRedundantFilter() + self.chat_model = ChatOpenAI( + self.embeddings, + self.filter + ) + self.memory = ConversationBufferWindowMemory( + max_history_len, + self.chat_model + ) + self.retrieval = RetrievalQA( + self.memory, + vectorstore + ) + except Exception as e: + logging.error(f"Error initializing ChromaMemory: {e}") + raise ValueError(f"Error initializing ChromaMemory: {e}") from e \ No newline at end of file diff --git a/LangChain/Chatbots/how-to_chroma-memory.md b/LangChain/Chatbots/how-to_chroma-memory.md new file mode 100644 index 0000000..a866ee1 --- /dev/null +++ b/LangChain/Chatbots/how-to_chroma-memory.md @@ -0,0 +1,38 @@ +# This is a basic guide on how to use the ChromaMemory component to store chat history and retrieve answers to questions from the conversation history. + +### 1. Import the ChromaMemory class from the chroma_memory module: + +`from chroma_memory import ChromaMemory` + +### 2. Create an instance of the ChromaMemory class, passing in the required parameters: + +``` +model_name = "text-embedding-ada-002" +cache_dir = "/opt/llm/vectorstore/chroma" +vectorstore = Chroma("/opt/llm/vectorstore/chroma") +chroma_memory = ChromaMemory(model_name, cache_dir, max_history_len, vectorstore) +max_history_len = 100 +``` + +The model_name parameter specifies the name of the LLM model to use, the cache_dir parameter specifies the path to the directory to cache embeddings, the max_history_len parameter specifies the maximum length of the conversation history to remember, and the vectorstore parameter specifies the vectorstore to use for similarity matching. + +### 3. To store a new chat message in the conversation history, call the add_message method of the ConversationBufferWindowMemory object: + +``` +message = "Hello, how are you?" +chroma_memory.memory.add_message(message) +``` + +### 4. This will add the message to the conversation history. + +To retrieve an answer to a question from the conversation history, call the retrieve method of the RetrievalQA object: + +``` +question = "What's your favorite color?" +answer = chroma_memory.retrieval.retrieve(question) +print(answer) +``` + +This will retrieve the answer to the most similar question in the conversation history to the input question. + +That's it! For more information, please see the official LangChain documentation. \ No newline at end of file diff --git a/LangChain/Retrieval-Agents/qa_local_docs.py b/LangChain/Retrieval-Agents/qa_local_docs.py deleted file mode 100644 index 33729f9..0000000 --- a/LangChain/Retrieval-Agents/qa_local_docs.py +++ /dev/null @@ -1,173 +0,0 @@ -import os -import glob -from typing import Generator, List, Tuple -from dotenv import load_dotenv -from retrying import retry -from langchain.document_loaders import PyPDFLoader -from langchain.text_splitter import RecursiveCharacterTextSplitter -from langchain.embeddings.openai import OpenAIEmbeddings -from langchain.llms import OpenAI as OpenAILLM -from langchain.chains.question_answering import load_qa_chain -from langchain.vectorstores import cosine_similarity - -# Define the retrying decorator for specific functions -def retry_if_value_error(exception: Exception) -> bool: - """Return True if we should retry (in this case when it's a ValueError), False otherwise""" - return isinstance(exception, ValueError) - -def retry_if_file_not_found_error(exception: Exception) -> bool: - """Return True if we should retry (in this case when it's a FileNotFoundError), False otherwise""" - return isinstance(exception, FileNotFoundError) - -class PDFProcessor: - """ - A class to handle PDF document processing, similarity search, and question answering. - - Attributes - ---------- - OPENAI_API_KEY : str - OpenAI API Key for authentication. - embeddings : OpenAIEmbeddings - Object for OpenAI embeddings. - llm : OpenAILLM - Language model for generating embeddings. - - Methods - ------- - get_user_query(prompt: str = "Please enter your query: ") -> str: - Get query from the user. - load_pdfs_from_directory(directory_path: str = 'data/') -> List[List[str]]: - Load PDFs from a specified directory. - _load_and_split_document(file_path: str, chunk_size: int = 2000, chunk_overlap: int = 0) -> List[str]: - Load and split a single document. - perform_similarity_search(documents: List[List[str]], query: str, num_results: int = 10) -> List[Tuple[float, str]]: - Perform similarity search on documents. - """ - - def __init__(self): - """Initialize PDFProcessor with environment variables and reusable objects.""" - self._load_env_vars() - self._initialize_reusable_objects() - - @retry(retry_on_exception=retry_if_value_error, stop_max_attempt_number=3) - def _load_env_vars(self): - """Load environment variables.""" - try: - load_dotenv() - self.OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'sk-') - if not self.OPENAI_API_KEY: - raise ValueError("OPENAI_API_KEY is missing. Please set the environment variable.") - except ValueError as ve: - print(f"ValueError encountered: {ve}") - raise - - def _initialize_reusable_objects(self): - """Initialize reusable objects like embeddings and language models.""" - self.embeddings = OpenAIEmbeddings(openai_api_key=self.OPENAI_API_KEY) - self.llm = OpenAILLM(temperature=0, openai_api_key=self.OPENAI_API_KEY) - - @staticmethod - def get_user_query(prompt: str = "Please enter your query: ") -> str: - """ - Get user input for a query. - - Parameters: - prompt (str): The prompt to display for user input. - - Returns: - str: User's query input. - """ - return input(prompt) - - @retry(retry_on_exception=retry_if_file_not_found_error, stop_max_attempt_number=3) - def load_pdfs_from_directory(self, directory_path: str = 'data/') -> List[List[str]]: - """ - Load all PDF files from a given directory. - - Parameters: - directory_path (str): Directory path to load PDFs from. - - Returns: - List[List[str]]: List of text chunks from loaded PDFs. - """ - try: - if not os.path.exists(directory_path): - raise FileNotFoundError(f"The directory {directory_path} does not exist.") - - pdf_files = glob.glob(f"{directory_path}/*.pdf") - if not pdf_files: - raise FileNotFoundError(f"No PDF files found in the directory {directory_path}.") - - texts = [] - for pdf_file in pdf_files: - texts.extend(self._load_and_split_document(pdf_file)) - return texts - except FileNotFoundError as fe: - print(f"FileNotFoundError encountered: {fe}") - raise - - def _load_and_split_document(self, file_path: str, chunk_size: int = 2000, chunk_overlap: int = 0) -> List[str]: - """ - Load and split a PDF document into text chunks. - - Parameters: - file_path (str): Path to the PDF file. - chunk_size (int): Size of each text chunk. - chunk_overlap (int): Overlapping characters between chunks. - - Returns: - List[str]: List of text chunks. - """ - if not os.path.exists(file_path): - raise FileNotFoundError(f"The file {file_path} does not exist.") - loader = PyPDFLoader(file_path) - data = loader.load() - text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) - return text_splitter.split_documents(data) - - def perform_similarity_search(self, documents: List[List[str]], query: str, num_results: int = 10) -> List[Tuple[float, str]]: - """ - Perform similarity search on documents based on a query. - - Parameters: - documents (List[List[str]]): List of documents to search. - query (str): User query for similarity search. - num_results (int): Number of results to return. - - Returns: - List[Tuple[float, str]]: List of tuples containing similarity score and document or chunk. - """ - try: - if not query: - raise ValueError("Query should not be empty.") - results = [] - for document in documents: - similarity_score = cosine_similarity(document, query) - results.append((similarity_score, document)) - results = sorted(results, key=lambda x: x[0], reverse=True)[:num_results] - return results - except Exception as e: - print(f"An error occurred: {e}") - raise - -if __name__ == "__main__": - try: - # Initialize PDFProcessor class - pdf_processor = PDFProcessor() - - # Load PDFs from directory and count the number of loaded documents - texts = pdf_processor.load_pdfs_from_directory() - num_docs = len(texts) - print(f'Loaded {num_docs} document(s).') - - # Get user query for similarity search - query = pdf_processor.get_user_query() - - # Perform similarity search based on the query - results = pdf_processor.perform_similarity_search(texts, query) - - # Print the results - for i, result in enumerate(results): - print(f"{i+1}. Similarity score: {result[0]}, Document: {result[1]}") - except Exception as e: - print(f"An error occurred: {e}") \ No newline at end of file diff --git a/LangChain/Retrieval-Agents/stateful_chatbot.py b/LangChain/Retrieval-Agents/stateful_chatbot.py deleted file mode 100644 index 8852499..0000000 --- a/LangChain/Retrieval-Agents/stateful_chatbot.py +++ /dev/null @@ -1,230 +0,0 @@ -import logging -from typing import List, Any, Dict -from langchain.document_loaders import PyPDFDirectoryLoader -from langchain.text_splitter import RecursiveCharacterTextSplitter -from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings, HuggingFaceEmbeddings -from langchain.filters import EmbeddingsRedundantFilter -from langchain.chat_models import ChatOpenAI -from langchain.chains.conversation.memory import ConversationBufferWindowMemory -from langchain.chains import RetrievalQA -import chromadb -from langchain.vectorstores import Chroma - -logging.basicConfig(level=logging.ERROR) - -# PDF Document Management -class PDFDocumentManager: - def __init__(self, directory: str): - """ - Initialize the PDFDocumentManager with a directory path. - Args: - directory (str): The path to the directory containing PDF files. - """ - try: - self.loader = PyPDFDirectoryLoader(directory) - except Exception as e: - logging.error(f"Error initializing PyPDFDirectoryLoader: {e}") - raise ValueError(f"Error initializing PyPDFDirectoryLoader: {e}") from e - - def load_documents(self) -> List[Any]: - """ - Load PDF documents from the specified directory. - Returns: - List[Any]: A list of loaded PDF documents. - """ - try: - return self.loader.load() - except Exception as e: - logging.error(f"Error loading documents: {e}") - raise ValueError(f"Error loading documents: {e}") from e - -# Text Splitting -class TextSplitManager: - def __init__(self, chunk_size: int, chunk_overlap: int, length_function=len, add_start_index=True): - """ - Initialize TextSplitManager with configuration for text splitting. - Args: - chunk_size (int): The maximum size for each chunk. - chunk_overlap (int): The overlap between adjacent chunks. - length_function (callable, optional): Function to compute the length of a chunk. Defaults to len. - add_start_index (bool, optional): Whether to include the start index of each chunk. Defaults to True. - """ - self.text_splitter = RecursiveCharacterTextSplitter( - chunk_size=chunk_size, - chunk_overlap=chunk_overlap, - length_function=length_function, - add_start_index=add_start_index - ) - - def create_documents(self, docs: List[Any]) -> List[Any]: - """ - Create document chunks based on the configuration. - Args: - docs (List[Any]): List of documents to be chunked. - Returns: - List[Any]: List of document chunks. - """ - try: - return self.text_splitter.create_documents(docs) - except Exception as e: - logging.error(f"Error in text splitting: {e}") - raise ValueError(f"Error in text splitting: {e}") from e - -# Embeddings and Filtering -class EmbeddingManager: - def __init__(self): - """ - Initialize EmbeddingManager for handling document embeddings. - """ - self.embedder = CacheBackedEmbeddings(OpenAIEmbeddings()) - - def embed_documents(self, docs: List[Any]) -> List[Any]: - """ - Embed the documents using the configured embedder. - Args: - docs (List[Any]): List of documents to be embedded. - Returns: - List[Any]: List of embedded documents. - """ - try: - return self.embedder.embed_documents(docs) - except Exception as e: - logging.error(f"Error in embedding documents: {e}") - raise ValueError(f"Error in embedding documents: {e}") from e - - def filter_redundant(self, embeddings: List[Any]) -> List[Any]: - """ - Filter redundant embeddings from the list. - Args: - embeddings (List[Any]): List of embeddings. - Returns: - List[Any]: List of non-redundant embeddings. - """ - try: - filter_instance = EmbeddingsRedundantFilter(embeddings) - return filter_instance() - except Exception as e: - logging.error(f"Error in filtering redundant embeddings: {e}") - raise ValueError(f"Error in filtering redundant embeddings: {e}") from e - -# Document Retrieval and Reordering -class DocumentRetriever: - def __init__(self, model_name: str, texts: List[str], search_kwargs: Dict[str, Any]): - """ - Initialize DocumentRetriever for document retrieval and reordering. - Args: - model_name (str): Name of the embedding model to use. - texts (List[str]): Texts for retriever training. - search_kwargs (Dict[str, Any]): Additional search parameters. - """ - self.embeddings = HuggingFaceEmbeddings(model_name=model_name) - self.retriever = Chroma.from_texts(texts, embedding=self.embeddings).as_retriever( - search_kwargs=search_kwargs - ) - - def get_relevant_documents(self, query: str) -> List[Any]: - """ - Retrieve relevant documents based on the query. - Args: - query (str): The query string. - Returns: - List[Any]: List of relevant documents. - """ - try: - return self.retriever.get_relevant_documents(query) - except Exception as e: - logging.error(f"Error retrieving relevant documents: {e}") - raise ValueError(f"Error retrieving relevant documents: {e}") from e - -# Chat and QA functionalities -class ChatQA: - def __init__(self, api_key: str, model_name: str, directory: str, chunk_size: int, chunk_overlap: int, search_k: int): - """ - Initialize ChatQA for chat and QA functionalities. - Args: - api_key (str): API key for OpenAI. - model_name (str): Name of the model for embeddings. - directory (str): The path to the directory containing PDF files. - chunk_size (int): The maximum size for each chunk. - chunk_overlap (int): The overlap between adjacent chunks. - search_k (int): Number of documents to retrieve. - """ - self.pdf_manager = PDFDocumentManager(directory) - self.text_split_manager = TextSplitManager(chunk_size, chunk_overlap) - self.embedding_manager = EmbeddingManager() - self.llm = ChatOpenAI( - openai_api_key=api_key, - model_name='gpt-3.5-turbo', - temperature=0.0 - ) - self.conversational_memory = ConversationBufferWindowMemory( - memory_key='chat_history', - k=5, - return_messages=True - ) - self.retriever = DocumentRetriever(model_name, [], {"k": search_k}) - self.qa = RetrievalQA.from_chain_type( - llm=self.llm, - chain_type="stuff", - retriever=self.retriever.retriever - ) - - def load_documents(self) -> List[Any]: - """ - Load PDF documents from the specified directory, split them into chunks, and embed them. - Returns: - List[Any]: List of embedded document chunks. - """ - try: - docs = self.pdf_manager.load_documents() - chunks = self.text_split_manager.create_documents(docs) - embeddings = self.embedding_manager.embed_documents(chunks) - return self.embedding_manager.filter_redundant(embeddings) - except Exception as e: - logging.error(f"Error loading and embedding documents: {e}") - raise ValueError(f"Error loading and embedding documents: {e}") from e - - def update_retriever(self, texts: List[str]): - """ - Update the retriever with new texts. - Args: - texts (List[str]): List of texts to update the retriever. - """ - try: - self.retriever = DocumentRetriever(self.retriever.embeddings.model_name, texts, self.retriever.search_kwargs) - self.qa = RetrievalQA.from_chain_type( - llm=self.llm, - chain_type="stuff", - retriever=self.retriever.retriever - ) - except Exception as e: - logging.error(f"Error updating retriever: {e}") - raise ValueError(f"Error updating retriever: {e}") from e - - def get_relevant_documents(self, query: str) -> List[Any]: - """ - Retrieve relevant documents based on the query. - Args: - query (str): The query string. - Returns: - List[Any]: List of relevant documents. - """ - try: - return self.retriever.get_relevant_documents(query) - except Exception as e: - logging.error(f"Error retrieving relevant documents: {e}") - raise ValueError(f"Error retrieving relevant documents: {e}") from e - - def ask_question(self, query: str) -> str: - """ - Ask a question based on the query. - Args: - query (str): The query string. - Returns: - str: The answer to the question. - """ - try: - return self.qa.ask_question(query) - except Exception as e: - logging.error(f"Error asking question: {e}") - raise ValueError(f"Error asking question: {e}") from e \ No newline at end of file diff --git a/LangChain/Retrieval-Augmented-Generation/.env.template b/LangChain/Retrieval-Augmented-Generation/.env.template new file mode 100644 index 0000000..b269983 --- /dev/null +++ b/LangChain/Retrieval-Augmented-Generation/.env.template @@ -0,0 +1,5 @@ +OPENAI_API_KEY= +SIMILARITY_THRESHOLD=0.7 +CHUNK_SIZE=500 +CHUNK_OVERLAP=0 +LLM_CHAIN_PROMPT_URL=https://smith.langchain.com/hub/rlm/rag-prompt \ No newline at end of file diff --git a/OpenAI/Auto-Embedder/__init__.py b/LangChain/Retrieval-Augmented-Generation/__init__.py similarity index 100% rename from OpenAI/Auto-Embedder/__init__.py rename to LangChain/Retrieval-Augmented-Generation/__init__.py diff --git a/LangChain/Retrieval-Augmented-Generation/main.py b/LangChain/Retrieval-Augmented-Generation/main.py new file mode 100644 index 0000000..4820cd9 --- /dev/null +++ b/LangChain/Retrieval-Augmented-Generation/main.py @@ -0,0 +1,49 @@ +import logging +from qa_local_docs import PDFProcessor + +def setup_logging(): + """Set up logging configuration.""" + logging.basicConfig( + level=logging.DEBUG, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' + ) + +if __name__ == "__main__": + # Set up logging + setup_logging() + + try: + # Initialize PDFProcessor class + pdf_processor = PDFProcessor() + + # Load PDFs from directory and count the number of loaded documents + texts = pdf_processor.load_pdfs_from_directory() + num_docs = len(texts) + logging.info(f'Loaded {num_docs} document(s) from directory.') + + # Perform similarity search based on the query + query = pdf_processor.get_user_query() + logging.debug(f'User query: {query}') + results = pdf_processor.perform_similarity_search(texts, query) + + # Log the results + if results: + logging.info(f'Found {len(results)} similar document(s) for query: {query}') + for i, result in enumerate(results): + logging.debug(f"{i+1}. Similarity score: {result['similarity_score']}, \nDocument: {result['document']}") + else: + logging.warning(f'No similar documents found for query: {query}') + + # Answer a question using the RAG model + question = pdf_processor.get_user_query("""Welcome! \ + \nYour document agent has been fully instantiated. \ + Please enter a clear and concise question: """) + logging.debug(f'User question: {question}') + answer = pdf_processor.answer_question(question) + logging.info(f"\nAnswer: {answer}") + except FileNotFoundError as fe: + logging.error(f"FileNotFoundError encountered: {fe}") + except ValueError as ve: + logging.error(f"ValueError encountered: {ve}") + except Exception as e: + logging.error(f"An error occurred: {e}") \ No newline at end of file diff --git a/LangChain/Retrieval-Augmented-Generation/qa_local_docs.py b/LangChain/Retrieval-Augmented-Generation/qa_local_docs.py new file mode 100644 index 0000000..12ecd13 --- /dev/null +++ b/LangChain/Retrieval-Augmented-Generation/qa_local_docs.py @@ -0,0 +1,157 @@ +import os +from typing import Dict, List, Union +from dotenv import load_dotenv +from retrying import retry +from langchain.text_splitter import RecursiveCharacterTextSplitter +from langchain.embeddings.tensorflow import UniversalSentenceEncoder +from langchain.vectorstores import Chroma +from langchain.embeddings import OpenAIEmbeddings +from langchain.chains import RetrievalQA +from langchain.document_loaders import DirectoryLoader +from langchain.chat_models import ChatOpenAI + +class PDFProcessor: + """ + A class to handle PDF document processing, similarity search, and question answering. + + Attributes + ---------- + OPENAI_API_KEY : str + OpenAI API Key for authentication. + embeddings : UniversalSentenceEncoder + Object for Universal Sentence Encoder embeddings. + Language model for generating embeddings. + vectorstore : Chroma + Vectorstore for storing document embeddings. + qa_chain : RetrievalQA + Question answering chain for answering questions. + + Methods + ------- + get_user_query(prompt: str = "Please enter your query: ") -> str: + Get query from the user. + load_pdfs_from_directory(directory_path: str = 'data/') -> List[List[str]]: + Load PDFs from a specified directory. + perform_similarity_search(documents: List[List[str]], query: str, threshold: float = 0.7) -> List[Dict[str, Union[float, str]]]]: + Perform similarity search on documents. Higher threshold means more similar results. + answer_question(question: str) -> str: + Answer a question using the Retrieval Augmented Generation (RAG) model. + """ + + def __init__(self, embeddings: UniversalSentenceEncoder, llm: ChatOpenAI, vectorstore: Chroma, qa_chain: RetrievalQA): + """Initialize PDFProcessor with environment variables and reusable objects.""" + self._load_env_vars() + self.embeddings = embeddings + self.llm = llm + self.vectorstore = vectorstore + self.qa_chain = qa_chain + + @retry(retry_on_exception=retry_if_value_error, stop_max_attempt_number=3) + def _load_env_vars(self): + """Load environment variables.""" + try: + load_dotenv() + self.OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'sk-') + if not self.OPENAI_API_KEY: + raise ValueError("OPENAI_API_KEY is missing. Please set the environment variable.") + self.LLM_CHAIN_PROMPT_URL = os.getenv('LLM_CHAIN_PROMPT_URL', 'https://smith.langchain.com/hub/rlm/rag-prompt') + except ValueError as ve: + print(f"ValueError encountered: {ve}") + raise + + @staticmethod + def get_user_query(prompt: str = "Please enter your query: ") -> str: + """ + Get user input for a query. + + Parameters: + prompt (str): The prompt to display for user input. + + Returns: + str: User's query input. + """ + return input(prompt) + + def load_pdfs_from_directory(self, directory_path: str = 'data/') -> List[List[str]]: + """ + Load all PDF files from a given directory. + + Parameters: + directory_path (str): Directory path to load PDFs from. + + Returns: + List[List[str]]: List of text chunks from loaded PDFs. + """ + try: + if not os.path.exists(directory_path): + return [] + + loader = DirectoryLoader(directory_path) + data = loader.load() + """ + Adjustable chunk size and overlap + - 500 characters is a safe starting point for chunk size + - We use 0 overlap to avoid duplicate chunks + """ + text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0) + all_splits = text_splitter.split_documents(data) + # Store document embeddings in a vectorstore + self.vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings()) + self.qa_chain = RetrievalQA.from_chain_type( + self.llm, + retriever=self.vectorstore.as_retriever(), + # Pull premade RAG prompt from + # https://smith.langchain.com/hub/rlm/rag-prompt + chain_type_kwargs={"prompt": hub.pull(self.LLM_CHAIN_PROMPT_URL)} + ) + # Return all text splits from PDFs + return all_splits + except FileNotFoundError as fe: + print(f"FileNotFoundError encountered: {fe}") + return [] + + def perform_similarity_search(self, documents: List[List[str]], query: str, threshold: float = 0.7) -> List[Dict[str, Union[float, str]]]: + """ + Perform similarity search on documents based on a query. + + Parameters: + documents (List[List[str]]): List of documents to search. + query (str): User query for similarity search. + threshold (float): Minimum similarity score to return. + + Returns: + List[Dict[str, Union[float, str]]]: List of dictionaries containing similarity score, document or chunk, and any other relevant metadata. + """ + try: + if not query: + query = self.get_user_query("Please enter a valid query: ") + results = [] + query_embedding = self.embeddings.embed(query) + for document in documents: + document_embedding = self.embeddings.embed(document) + similarity_score = cosine_similarity(document_embedding, query_embedding) + if similarity_score >= threshold: + result = { + "similarity_score": similarity_score, + "document": document, + "metadata": {} + } + results.append(result) + # Sort results by similarity score in reverse order because we want the highest similarity score first + return sorted(results, key=lambda k: k['similarity_score'], reverse=True) + except Exception as e: + print(f"An error occurred: {e}") + return [] + + def answer_question(self, question: str) -> str: + """ + Answer a question using the Retrieval Augmented Generation (RAG) model. + + Parameters: + question (str): The question to answer. + + Returns: + str: The answer to the question. + """ + result = self.qa_chain({"query": question}) + return result["result"] \ No newline at end of file diff --git a/LangChain/Retrieval-Augmented-Generation/test.py b/LangChain/Retrieval-Augmented-Generation/test.py new file mode 100644 index 0000000..99ef6f5 --- /dev/null +++ b/LangChain/Retrieval-Augmented-Generation/test.py @@ -0,0 +1,38 @@ +import unittest +from unittest.mock import patch, MagicMock +from qa_local_docs import PDFProcessor, ChatOpenAI, Chroma, UniversalSentenceEncoder, RetrievalQA + +# Assumes that 'data/' directory contains PDFs +class TestPDFProcessor(unittest.TestCase): + # Set up reusable objects + def setUp(self): + embeddings = UniversalSentenceEncoder() + llm = ChatOpenAI() + vectorstore = Chroma() + qa_chain = RetrievalQA() + # Tie reusable objects together + self.pdf_processor = PDFProcessor(embeddings, llm, vectorstore, qa_chain) + + def test_load_pdfs_from_directory(self): + # Test that the method returns a non-empty list + result = self.pdf_processor.load_pdfs_from_directory() + self.assertTrue(isinstance(result, list)) + self.assertTrue(len(result) > 0) + + def test_perform_similarity_search(self): + # Test that the method returns a non-empty list + texts = self.pdf_processor.load_pdfs_from_directory() + result = self.pdf_processor.perform_similarity_search(texts, "test") + self.assertTrue(isinstance(result, list)) + self.assertTrue(len(result) > 0) + + @patch('qa_local_docs.ChatOpenAI') + @patch('qa_local_docs.Chroma') + @patch('qa_local_docs.UniversalSentenceEncoder') + def test_answer_question(self, mock_embeddings, mock_vectorstore, mock_llm): + # Test that the method returns a string + mock_result = MagicMock() + mock_result.__getitem__.return_value = {"result": "test answer"} + mock_llm.return_value = mock_result + result = self.pdf_processor.answer_question("test question") + self.assertTrue(isinstance(result, str)) \ No newline at end of file diff --git a/OpenAI/Auto-Embedder/.env.template b/OpenAI/Embedding-Upsertion/.env.template similarity index 100% rename from OpenAI/Auto-Embedder/.env.template rename to OpenAI/Embedding-Upsertion/.env.template diff --git a/OpenAI/Auto-Embedder/README.md b/OpenAI/Embedding-Upsertion/README.md similarity index 100% rename from OpenAI/Auto-Embedder/README.md rename to OpenAI/Embedding-Upsertion/README.md diff --git a/OpenAI/Embedding-Upsertion/__init__.py b/OpenAI/Embedding-Upsertion/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/OpenAI/Auto-Embedder/pinembed.py b/OpenAI/Embedding-Upsertion/pinembed.py similarity index 100% rename from OpenAI/Auto-Embedder/pinembed.py rename to OpenAI/Embedding-Upsertion/pinembed.py diff --git a/OpenAI/Auto-Embedder/requirements.txt b/OpenAI/Embedding-Upsertion/requirements.txt similarity index 100% rename from OpenAI/Auto-Embedder/requirements.txt rename to OpenAI/Embedding-Upsertion/requirements.txt diff --git a/OpenAI/Auto-Embedder/test.py b/OpenAI/Embedding-Upsertion/test.py similarity index 100% rename from OpenAI/Auto-Embedder/test.py rename to OpenAI/Embedding-Upsertion/test.py diff --git a/OpenAI/GPT-Prompt-Examples/MS-6_Daethyra_Custom-Instruction_GPT4.md b/Prompts/MS-6_Daethyra_Custom-Instruction_GPT4.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/MS-6_Daethyra_Custom-Instruction_GPT4.md rename to Prompts/MS-6_Daethyra_Custom-Instruction_GPT4.md diff --git a/OpenAI/GPT-Prompt-Examples/multi-shot/MS-1.MD b/Prompts/multi-shot/MS-1.MD similarity index 100% rename from OpenAI/GPT-Prompt-Examples/multi-shot/MS-1.MD rename to Prompts/multi-shot/MS-1.MD diff --git a/OpenAI/GPT-Prompt-Examples/multi-shot/MS-2_Large-Template.txt b/Prompts/multi-shot/MS-2_Large-Template.txt similarity index 100% rename from OpenAI/GPT-Prompt-Examples/multi-shot/MS-2_Large-Template.txt rename to Prompts/multi-shot/MS-2_Large-Template.txt diff --git a/OpenAI/GPT-Prompt-Examples/multi-shot/MS-5_No-Prose_Doc-Reader.txt b/Prompts/multi-shot/MS-5_No-Prose_Doc-Reader.txt similarity index 100% rename from OpenAI/GPT-Prompt-Examples/multi-shot/MS-5_No-Prose_Doc-Reader.txt rename to Prompts/multi-shot/MS-5_No-Prose_Doc-Reader.txt diff --git a/OpenAI/GPT-Prompt-Examples/OUT-prompt-cheatsheet.md b/Prompts/prompt-cheatsheet.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/OUT-prompt-cheatsheet.md rename to Prompts/prompt-cheatsheet.md diff --git a/OpenAI/GPT-Prompt-Examples/system-role/SR-1_List-o-Prompts.md b/Prompts/system-role/SR-1_List-o-Prompts.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/system-role/SR-1_List-o-Prompts.md rename to Prompts/system-role/SR-1_List-o-Prompts.md diff --git a/OpenAI/GPT-Prompt-Examples/system-role/SR-2_package-migration.md b/Prompts/system-role/SR-2_package-migration.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/system-role/SR-2_package-migration.md rename to Prompts/system-role/SR-2_package-migration.md diff --git a/OpenAI/GPT-Prompt-Examples/system-role/SR-3_thorough-programmer.md b/Prompts/system-role/SR-3_thorough-programmer.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/system-role/SR-3_thorough-programmer.md rename to Prompts/system-role/SR-3_thorough-programmer.md diff --git a/OpenAI/GPT-Prompt-Examples/system-role/SR-4_online-searches.md b/Prompts/system-role/SR-4_online-searches.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/system-role/SR-4_online-searches.md rename to Prompts/system-role/SR-4_online-searches.md diff --git a/OpenAI/GPT-Prompt-Examples/user-role/UR-1.MD b/Prompts/user-role/UR-1.MD similarity index 100% rename from OpenAI/GPT-Prompt-Examples/user-role/UR-1.MD rename to Prompts/user-role/UR-1.MD diff --git a/OpenAI/GPT-Prompt-Examples/user-role/UR-2.md b/Prompts/user-role/UR-2.md similarity index 100% rename from OpenAI/GPT-Prompt-Examples/user-role/UR-2.md rename to Prompts/user-role/UR-2.md diff --git a/Prompts/user-role/UR-3.md b/Prompts/user-role/UR-3.md new file mode 100644 index 0000000..ec8b8ca --- /dev/null +++ b/Prompts/user-role/UR-3.md @@ -0,0 +1,6 @@ +## Enforce idiomacy + +"What is the idiomatic way to {MASK} +in {ProgrammingLanguage}?" + +- Credit to [Sammi-Turner](https://github.com/sammi-turner) \ No newline at end of file diff --git a/README.md b/README.md index e3d5f24..8cd5b79 100644 --- a/README.md +++ b/README.md @@ -1,58 +1,50 @@ # LLM Utilikit -${INTRO} -${SupportedLibraries} -${Intention : Reasoning} -${BriefResummary} +Welcome to the Utilikit, your one-stop library of Python modules designed to supercharge your projects. Whether you're just getting started or looking to enhance an existing project, this library offers a rich set of pluggable components and a treasure trove of large language model prompts and templates. But that's not all — I envision the Utilikit as a communal canvas, inviting proompters from all industries and walks of life to enrich this toolkit with their own prompts, templates, and Python modules. Join us in crafting a toolkit that's greater than the sum of its parts. -#### 1. **[OpenAI: Utilikit](./OpenAI/)** +## Supported libraries: +- OpenAI +- LangChain +- HuggingFace +- Pinecone ---- - -A. **[Auto-Embedder](./OpenAI/Auto-Embedder)** +This project aims to solve two key challenges faced by developers and data scientists alike: the need for a quick start and the desire for modular, reusable components. This library addresses these challenges head-on by offering a curated set of Python modules that can either serve as a robust starting point for new projects or as plug-and-play components to elevate existing ones. -Provides an automated pipeline for retrieving embeddings from [OpenAIs `text-embedding-ada-002`](https://platform.openai.com/docs/guides/embeddings) and upserting them to a [Pinecone index](https://docs.pinecone.io/docs/indexes). +## 0. **[Prompts](./Prompts/)** -- **[`pinembed.py`](./OpenAI/Auto-Embedder/pinembed.py)**: A Python module to easily automate the retrieval of embeddings from OpenAI and storage in Pinecone. - - **[.env.template](./OpenAI/Auto-Embedder/.env.template)**: Template for environment variables. - ---- +There are three main prompt types, [multi-shot](./Prompts/multi-shot), [system-role](./Prompts/system-role), [user-role](./Prompts/user-role). -B. **[GPT-Prompt-Examples](./OpenAI/GPT-Prompt-Examples)** +Please also see the [prompt-cheatsheet](./Prompts/prompt-cheatsheet.md). -There are three main prompt types, [multi-shot](./OpenAI/GPT-Prompt-Examples/multi-shot), [system-role](./OpenAI/GPT-Prompt-Examples/system-role), [user-role](./OpenAI/GPT-Prompt-Examples/user-role). +- **[Cheatsheet](./Prompts/prompt-cheatsheet.md)**: @Daethyra's go-to prompts. -Please also see the [OUT-prompt-cheatsheet](./OpenAI/GPT-Prompt-Examples/OUT-prompt-cheatsheet.md). +- **[multi-shot](./Prompts/multi-shot)**: Prompts, with prompts inside them. +It's kind of like a bundle of Matryoshka prompts! -- **[Cheatsheet](./OpenAI/GPT-Prompt-Examples/OUT-prompt-cheatsheet.md)**: @Daethyra's go-to prompts. +- **[system-role](./Prompts/system-role)**: Steer your LLM by shifting the ground it stands on. -- **[multi-shot](./OpenAI/GPT-Prompt-Examples/multi-shot)**: Prompts, with prompts inside them. -It's kind of like a bundle of Matryoshka prompts! +- **[user-role](./Prompts/user-role)**: Markdown files for user-role prompts. -- **[system-role](./OpenAI/GPT-Prompt-Examples/system-role)**: Steer your LLM by shifting the ground it stands on. +## 1. **[OpenAI](./OpenAI/)** -- **[user-role](./OpenAI/GPT-Prompt-Examples/user-role)**: Markdown files for user-role prompts. +A. **[Auto-Embedder](./OpenAI/Auto-Embedder)** ---- +Provides an automated pipeline for retrieving embeddings from [OpenAIs `text-embedding-ada-002`](https://platform.openai.com/docs/guides/embeddings) and upserting them to a [Pinecone index](https://docs.pinecone.io/docs/indexes). -#### 2. **[LangChain: Pluggable Components](./LangChain/)** +- **[`pinembed.py`](./OpenAI/Auto-Embedder/pinembed.py)**: A Python module to easily automate the retrieval of embeddings from OpenAI and storage in Pinecone. ---- +## 2. **[LangChain](./LangChain/)** -A. **[`stateful_chatbot.py`](./LangChain/Retrieval-Agents/stateful_chatbot.py)** +A. **[`stateful_chatbot.py`](./LangChain/Retrieval-Augmented-Generation/qa_local_docs.py)** This module offers a set of functionalities for conversational agents in LangChain. Specifically, it provides: - Argument parsing for configuring the agent -- Document loading via `PyPDFDirectoryLoader` +- Document loading via `PDFProcessor` - Text splitting using `RecursiveCharacterTextSplitter` - Various embeddings options like `OpenAIEmbeddings`, `CacheBackedEmbeddings`, and `HuggingFaceEmbeddings` -**Potential Use Cases:** - -${MASK} - ---- +**Potential Use Cases:** For developing conversational agents with advanced features. B. **[`qa_local_docs.py`](./LangChain/Retrieval-Agents/qa_local_docs.py)** @@ -64,17 +56,9 @@ This module focuses on querying local documents and employs the following featur - Vector storage options like `Chroma` - Embedding options via `OpenAIEmbeddings` -**Potential Use Cases:** - -${MASK} - ---- +**Potential Use Cases:** For querying large sets of documents efficiently. -These modules are designed to be extensible and can be easily integrated into your LangChain projects. - ---- - -#### 3. **[HuggingFace: Pluggable Components](./HuggingFace/)** +### 3. **[HuggingFace](./HuggingFace/)** A. **[`integrable_captioner.py`](./HuggingFace\image_captioner\integrable_image_captioner.py)** @@ -86,18 +70,17 @@ This module focuses on generating captions for images using Hugging Face's trans - Caption caching for improved efficiency - Device selection (CPU or GPU) based on availability -**Potential Use Cases:** -${MASK} +**Potential Use Cases:** For generating accurate and context-appropriate image captions. + +## Installation + +Distribution as a package for easy installation and integration is planned, however that *not* currently in progress. ---
- Creation Date: Oct 7th, 2023 -
-
- Creation Date: Oct 7th, 2023 - Creation Date: Oct 7th, 2023 + Creation Date: Oct 7th, 2023
diff --git a/todo.md b/todo.md index 316611d..2d0e249 100644 --- a/todo.md +++ b/todo.md @@ -1,14 +1,4 @@ -### Todo list - -[README] - -- Add intro - - Clearly define: [Utilikit, Pluggable/Components, multi-shot, zero-shot,] - - create summarization of prompt reusability, and component extendability - - Then, clearly state the intention of the repository. : Provide Reasoning, I want this to be a nexus of information to empower my LLMs moving forward. By continually updating this repository as a codebase and conglomeration of documentation, it may serve as a `git clone`able neuron for machine learning models. - - Finally, provide one to two brief statements to close out and resummarize - ---- +### Todo list [GitHub] @@ -24,27 +14,42 @@ [LangChain] -- langchain_conv_agent.py +- stateful_chatbot.py + - Lacks single execution runnability - Fix by removing argparsing and implement default settings, with a configuration file - Config file settings: - Embedding Engine: [OpenAI, HuggingFace, etc.] - ***Lacks .env var loading(API keys, model names[OpenAI, HuggingFace])*** - - Ambiguity regarding (EmbeddingManager and DocumentRetriever) - - Needs comments and to load via .env file - - Differentiate EmbeddingManager and DocumentRetriever by explaining how they're implemented into the pipeline stream created by the module. - - One generates embeddings - - `DocumentRetriever` queries them locally + - ~~Ambiguity regarding (EmbeddingManager and DocumentRetriever)~~ + - (**AVOID SUGGESTIONS BELOW**) + - ~~Needs comments and to load via .env file~~ + - ~~Differentiate EmbeddingManager and DocumentRetriever by explaining how they're implemented into the pipeline stream created by the module.~~ + - ~~One generates embeddings~~ + - ~~`DocumentRetriever` queries them locally (HF model is cached after first download. Therefore, all runs after the first, - are entirely local since we're using ChromaDB) + are entirely local since we're using ChromaDB)~~ +- qa_local_docs.py + + - ~~Doesn't automatically collect and generate embeddings for the data folder~~ + - ~~To ensure automation, create a first-run / boot-up process~~ + + 1. ~~Move the `PDFProcessor` class to a separate file to increase modularity and maintainability.~~ + 2. Use dependency injection to pass in the necessary objects to the `PDFProcessor` class instead of initializing them in the constructor. This will increase modularity and make the class more testable. + 3. ~~Use a logger instead of `print` statements to log errors and other messages. This will make the code more maintainable and scalable.~~ --- [OpenAI] -- Auto-Embedder - - Requires testing - - test.py requires updates -- [Task]:Update test.py and run +- ~~Auto-Embedder~~ + - ~~Requires testing~~ + - ~~test.py requires updates~~ +- ~~[Task]:Update test.py and run~~ --- + +[HuggingFace] + +- Test: `integrable_image_captioner.py` + - Deposit AI art images for batch tests \ No newline at end of file