diff --git a/ build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb b/ build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb new file mode 100644 index 0000000000..31315c5d0a --- /dev/null +++ b/ build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb @@ -0,0 +1,560 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "48b9f283-12e1-4c30-924d-d6bac1f14d6a", + "metadata": {}, + "source": [ + "# Fine-tuning a HuggingFace FLAN-T5 Model on Amazon SageMaker with TensorBoard Integration\n", + "\n", + "**Author**: Hubert Gabryel\n", + "\n", + "**Date**: 2023-10-05\n", + "\n", + "## Table of Contents\n", + "\n", + "1.\t[Introduction](#1-introduction)\n", + "\n", + "\t1.1 [Background](#11-background)\n", + "\n", + "\t1.2 [Objective](#12-objective)\n", + "\n", + "2.\t[Setup](#2-setup)\n", + "\n", + "\t2.1 [Import Libraries](#21-import-libraries)\n", + "\n", + "\t2.2 [Initialize SageMaker Session and Role](#22-initialize-sagemaker-session-and-role)\n", + "\n", + "\t2.3 [Model Configuration](#23-model-configuration)\n", + "\n", + "3.\t[Data Preparation](#3-data-preparation)\n", + "\n", + "\t3.1 [Download and Prepare the Dataset](#31-download-and-prepare-the-dataset)\n", + "\n", + "\t3.2 [Load and Preprocess the Data](#32-load-and-preprocess-the-data)\n", + "\n", + "\t3.3 [Prepare the Data for Training](#33-prepare-the-data-for-training)\n", + "\n", + "\t3.4 [Visualize Sample Data](#34-visualize-sample-data)\n", + "\n", + "\t3.5 [Upload Data to S3](#35-upload-data-to-s3)\n", + "\n", + "4.\t[Training Script Modification](#4-training-script-modification)\n", + "\n", + "\t4.1 [Download the Training Script](#41-download-the-training-script)\n", + "\n", + "\t4.2 [Modify the Training Script for TensorBoard Integration](#42-modify-the-training-script-for-tensorboard-integration)\n", + "\n", + "5.\t[Model Training with TensorBoard Integration](#5-model-training-with-tensorboard-integration)\n", + "\n", + "\t5.1 [Set Up TensorBoard Output Configuration](#51-set-up-tensorboard-output-configuration)\n", + "\n", + "\t5.2 [Define Hyperparameters](#52-define-hyperparameters)\n", + "\n", + "\t5.3 [Create and Fit the Estimator](#53-create-and-fit-the-estimator)\n", + "\n", + "6.\t[TensorBoard Visualization](#6-tensorboard-visualization)\n", + "\n", + "\t6.1 [Start TensorBoard from the SageMaker Console](#61-start-tensorboard-from-the-sagemaker-console)\n", + "\t\n", + "7.\t[Conclusion](#7-conclusion)\n", + "\n", + "8.\t[References](#8-references)\n", + "\n", + "\n", + "## 1. Introduction\n", + "\n", + "### 1.1 Background\n", + "\n", + "In this notebook, we demonstrate how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker’s JumpStart models with TensorBoard integration. This integration allows us to monitor and visualize the training process in real-time, providing valuable insights into model performance.\n", + "\n", + "### 1.2 Objective\n", + "\n", + "Our goal is to fine-tune the FLAN-T5 small model on a subset of the Tiny Shakespeare dataset and visualize the training metrics using TensorBoard. We will:\n", + "\n", + "- Set up the SageMaker environment and import necessary libraries.\n", + "- Prepare the dataset for training.\n", + "- Modify the training script to include TensorBoard logging.\n", + "- Train the model with TensorBoard integration.\n", + "- Visualize the training metrics using TensorBoard.\n", + "\n", + "## 2. Setup\n", + "\n", + "### 2.1 Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c3644c9c-adb0-4eb9-9d95-89120ab22dde", + "metadata": {}, + "outputs": [], + "source": [ + "# Install or upgrade the SageMaker Python SDK\n", + "!pip install -U sagemaker --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d6bcbc8-6abf-4b62-80b3-3e58ece51b66", + "metadata": {}, + "outputs": [], + "source": [ + "# Import necessary libraries\n", + "import os\n", + "import random\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "import boto3\n", + "import sagemaker\n", + "from sagemaker import get_execution_role, script_uris\n", + "from sagemaker.s3 import S3Uploader, S3Downloader\n", + "from sagemaker.jumpstart.estimator import JumpStartEstimator\n", + "from sagemaker.debugger import TensorBoardOutputConfig\n", + "\n", + "# Set random seeds for reproducibility\n", + "RANDOM_SEED = 42\n", + "random.seed(RANDOM_SEED)\n", + "np.random.seed(RANDOM_SEED)" + ] + }, + { + "cell_type": "markdown", + "id": "2b572c67-a33c-4b2d-b1dc-aa71192a9682", + "metadata": {}, + "source": [ + "### 2.2 Initialize SageMaker Session and Role" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e462af99-d22a-4fb8-93e0-d03edb4351eb", + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker_session = sagemaker.Session()\n", + "role = get_execution_role()\n", + "\n", + "# Verify S3 access\n", + "try:\n", + " s3_client = sagemaker_session.boto_session.client('s3')\n", + " s3_client.head_bucket(Bucket=sagemaker_session.default_bucket())\n", + " print(\"S3 access confirmed.\")\n", + "except Exception as e:\n", + " print(f\"Unable to access S3 bucket: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "a46f6523-ddb1-489f-9757-dc9543f475de", + "metadata": {}, + "source": [ + "### 2.3 Model Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "ea3f3cda-803f-4bea-a678-f4284efbbb4b", + "metadata": {}, + "outputs": [], + "source": [ + "# Model configuration\n", + "MODEL_ID = 'huggingface-text2text-flan-t5-small' # Small model to keep training cost low\n", + "MODEL_VERSION = '2.1.2' # Latest model version at the time of writing" + ] + }, + { + "cell_type": "markdown", + "id": "ab6ad668-ac5a-4b88-99c6-17b8c962b69a", + "metadata": {}, + "source": [ + "## 3. Data Preparation\n", + "\n", + "### 3.1 Download and Prepare the Dataset\n", + "\n", + "We will use the Tiny Shakespeare dataset for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a444e79-dd2c-4f3f-9a64-33c6d6f713a6", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the Tiny Shakespeare dataset\n", + "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt --no-check-certificate" + ] + }, + { + "cell_type": "markdown", + "id": "40d302be-4997-42cd-b83f-5fe732ac0fa8", + "metadata": {}, + "source": [ + "### 3.2 Load and Preprocess the Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c61092-8b0c-4104-8943-bfe8098160d3", + "metadata": {}, + "outputs": [], + "source": [ + "# Read the data\n", + "with open('input.txt', 'r') as f:\n", + " data = f.read()\n", + "\n", + "# Limit the data to the first MAX_DATA_LENGTH characters\n", + "MAX_DATA_LENGTH = 10000\n", + "data = data[:MAX_DATA_LENGTH]\n", + "\n", + "# Split the data into training and validation sets\n", + "TEST_SIZE = 0.2\n", + "train_text, val_text = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_SEED)\n", + "\n", + "print(f\"Training data length: {len(train_text)}\")\n", + "print(f\"Validation data length: {len(val_text)}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "2db6e736-4088-4518-8515-efd31bbef8ef", + "metadata": {}, + "source": [ + "### 3.3 Prepare the Data for Training\n", + "\n", + "We need to format the data into prompts and completions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21329da7-f9ed-4869-b8ab-88abdb6f5255", + "metadata": {}, + "outputs": [], + "source": [ + "def prepare_data(text, sequence_length=256, prompt_length=128):\n", + " data = []\n", + " max_index = len(text) - sequence_length + 1\n", + " for i in range(0, max_index, prompt_length):\n", + " prompt = text[i:i+prompt_length]\n", + " completion = text[i+prompt_length:i+sequence_length]\n", + " if len(completion) == (sequence_length - prompt_length):\n", + " data.append({'prompt': prompt, 'completion': completion})\n", + " return data\n", + "\n", + "# Prepare the training and validation data\n", + "train_data = prepare_data(train_text)\n", + "val_data = prepare_data(val_text)\n", + "\n", + "print(f\"Number of training samples: {len(train_data)}\")\n", + "print(f\"Number of validation samples: {len(val_data)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "854bc82d-a469-41b0-a470-da8f7d938101", + "metadata": {}, + "source": [ + "### 3.4 Visualize Sample Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ebdcd9c-b32a-45e6-9f9f-8d1d116318b3", + "metadata": {}, + "outputs": [], + "source": [ + "# Display a sample from the training data\n", + "pd.DataFrame(train_data).head()" + ] + }, + { + "cell_type": "markdown", + "id": "517c62b7-cc20-4c2a-b338-880d488a70db", + "metadata": {}, + "source": [ + "### 3.5 Upload Data to S3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a204e6f9-8ae8-4199-bec4-c7c8b443bbfe", + "metadata": {}, + "outputs": [], + "source": [ + "# Define S3 bucket and prefix\n", + "bucket = sagemaker_session.default_bucket()\n", + "data_prefix = 'jumpstart-example-data'\n", + "\n", + "# Save the data to local files\n", + "pd.DataFrame(train_data).to_json('train.jsonl', orient='records', lines=True)\n", + "pd.DataFrame(val_data).to_json('val.jsonl', orient='records', lines=True)\n", + "\n", + "# Upload training data\n", + "train_s3_uri = sagemaker_session.upload_data(\n", + " path='train.jsonl',\n", + " bucket=bucket,\n", + " key_prefix=f\"{data_prefix}/train.jsonl\"\n", + ")\n", + "\n", + "# Upload validation data\n", + "val_s3_uri = sagemaker_session.upload_data(\n", + " path='val.jsonl',\n", + " bucket=bucket,\n", + " key_prefix=f\"{data_prefix}/val.jsonl\"\n", + ")\n", + "\n", + "print(f\"Training data uploaded to: {train_s3_uri}\")\n", + "print(f\"Validation data uploaded to: {val_s3_uri}\")" + ] + }, + { + "cell_type": "markdown", + "id": "2e75bf64-0948-4a74-bd67-b1441d00917d", + "metadata": {}, + "source": [ + "## 4. Training Script Modification\n", + "\n", + "### 4.1 Download the Training Script\n", + "\n", + "We need to obtain the default training script provided by the JumpStart model and modify it to integrate TensorBoard." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "a5191d10-8dac-4a3a-9ee1-597a3208bcb2", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.s3 import S3Downloader\n", + "\n", + "# Retrieve the training script URI\n", + "train_script_uri = script_uris.retrieve(\n", + " model_id=MODEL_ID, model_version=MODEL_VERSION, script_scope=\"training\"\n", + ")\n", + "\n", + "# Download the training script\n", + "S3Downloader.download(train_script_uri, \"training_script\")\n", + "\n", + "# Unpack the training script\n", + "import tarfile\n", + "\n", + "with tarfile.open('training_script/sourcedir.tar.gz') as tar:\n", + " tar.extractall('./training_script')\n", + "\n", + "\n", + "with tarfile.open('training_script.tar.gz/sourcedir.tar.gz') as tar:\n", + " tar.extractall('./training_script')" + ] + }, + { + "cell_type": "markdown", + "id": "87befce4-deb1-426f-acbf-803099438ac2", + "metadata": {}, + "source": [ + "### 4.2 Modify the Training Script for TensorBoard Integration\n", + "\n", + "We need to modify the train.py script to include TensorBoard logging.\n", + "\n", + "- Import the TensorBoardCallback:\n", + " In train.py, add:\n", + "\n", + "```python\n", + "from transformers.integrations import TensorBoardCallback\n", + "```\n", + "\n", + "- Modify the Seq2SeqTrainingArguments to include TensorBoard parameters:\n", + "\n", + " ```python\n", + " training_args = Seq2SeqTrainingArguments(\n", + " # ... other arguments ...\n", + " logging_dir=\"/opt/ml/output/tensorboard\",\n", + " report_to=['tensorboard'],\n", + " # ... other arguments ...\n", + ")\n", + "```\n", + "\n", + "\n", + "- Add the TensorBoardCallback to the trainer:\n", + "\n", + "```python\n", + "if callbacks is None: # Added line\n", + " callbacks = [] # Added line\n", + "callbacks.append(TensorBoardCallback()) # Added line\n", + "\n", + "# Create Trainer instance\n", + " trainer = Seq2SeqTrainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=dataset[constants.TRAIN],\n", + " eval_dataset=dataset[constants.VALIDATION],\n", + " data_collator=data_collator,\n", + " callbacks=callbacks,\n", + " )\n", + "```\n", + "\n", + "Note: Ensure that the \"/opt/ml/output/tensorboard\" in the training script matches the container_local_output_path in the TensorBoardOutputConfig.\n", + "\n", + "## 5. Model Training with TensorBoard Integration\n", + "\n", + "### 5.1 Set Up TensorBoard Output Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fdbc2e0-9b81-420d-bbe2-01dba31aa103", + "metadata": {}, + "outputs": [], + "source": [ + "tensorboard_output_config = TensorBoardOutputConfig(\n", + " s3_output_path=f's3://{bucket}/tensorboard-output',\n", + " container_local_output_path='/opt/ml/output/tensorboard' # Should match LOG_DIR in your script\n", + ")\n", + "\n", + "print(f\"TensorBoard logs will be saved to: s3://{bucket}/tensorboard-output\")" + ] + }, + { + "cell_type": "markdown", + "id": "e7e316f7-6a1e-4f68-932b-9a495d3a56e5", + "metadata": {}, + "source": [ + "### 5.2 Define Hyperparameters" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "d4c9fe44-cb9c-407b-a80c-5577105b18f6", + "metadata": {}, + "outputs": [], + "source": [ + "hyperparameters = {\n", + " \"epochs\": \"5\",\n", + " \"batch_size\": \"4\",\n", + " \"learning_rate\": \"5e-5\",\n", + " \"logging_strategy\": \"steps\",\n", + " \"logging_steps\": \"5\",\n", + " \"evaluation_strategy\": \"steps\",\n", + " \"save_strategy\": \"steps\",\n", + " \"eval_steps\": \"25\",\n", + " \"save_steps\": \"25\",\n", + " \"gradient_accumulation_steps\": \"1\",\n", + " \"fp16\": \"true\",\n", + " \"bf16\": \"false\"\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "137786e8-56f4-4acc-b4cb-3bf2e2f219fc", + "metadata": {}, + "source": [ + "### 5.3 Create and Fit the Estimator" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ebc3d8f5-a505-4754-a578-7de4c7221f28", + "metadata": {}, + "outputs": [], + "source": [ + "estimator = JumpStartEstimator(\n", + " model_id=MODEL_ID,\n", + " model_version=MODEL_VERSION,\n", + " instance_type='ml.g5.xlarge',\n", + " hyperparameters=hyperparameters,\n", + " entry_point='transfer_learning.py', # Name of main script\n", + " source_dir='training_script', # Directory containing your scripts\n", + " tensorboard_output_config=tensorboard_output_config\n", + ")\n", + "\n", + "# Start the training job\n", + "estimator.fit(\n", + " {\"train\": train_s3_uri, \"validation\": val_s3_uri}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "ac2186f6-8767-4773-ad21-c7e3458ae818", + "metadata": {}, + "source": [ + "## 6. TensorBoard Visualization\n", + "\n", + "### 6.1 Start TensorBoard from the SageMaker Console\n", + "\n", + "\t1.\tNavigate to the SageMaker Console:\n", + "\t•\tGo to the Amazon SageMaker Console.\n", + "\t2.\tAccess TensorBoard:\n", + "\t•\tIn the left-hand navigation pane, click on Applications and IDEs.\n", + "\t•\tSelect TensorBoard.\n", + "\t3.\tOpen TensorBoard:\n", + "\t•\tClick on Open TensorBoard to launch the TensorBoard landing page.\n", + "\t4.\tAdd Your Training Job:\n", + "\t•\tOn the TensorBoard page, click on Add job.\n", + "\t•\tSelect your most recent completed training job from the list.\n", + "\t5.\tView Training Metrics:\n", + "\t•\tAfter the data loads, navigate to the Scalars tab.\n", + "\t•\tHere, you can see charts and graphs of your training metrics.\n" + ] + }, + { + "cell_type": "markdown", + "id": "731283bc-212b-4810-9e3c-d6f84368b960", + "metadata": {}, + "source": [ + "## 7. Conclusion\n", + "\n", + "In this notebook, we demonstrated how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker with TensorBoard integration. We prepared a subset of the Tiny Shakespeare dataset, modified the training script to include TensorBoard logging, and visualized the training metrics.\n", + "\n", + "Next Steps:\n", + "\n", + "- Experiment with Hyperparameters: Adjust learning rates, batch sizes, and other hyperparameters to improve model performance.\n", + "- Use a Larger Dataset: Try using a larger dataset for better results.\n", + "- Deploy the Model: After training, deploy the model using SageMaker’s deployment capabilities for inference.\n", + "\n", + "## 8. References\n", + "\n", + "- [Amazon SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-htb-prepare-training-job.html)\n", + "- [TensorBoard Documentation](https://www.tensorflow.org/tensorboard/get_started)\n", + "- [Tiny Shakespeare Dataset](https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare)\n", + "- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)\n", + "- [SageMaker JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}