add cost tracking and exercises

tonghuikang · Mar 27, 2024 · 289f35b · 289f35b
1 parent 8c0b150
commit 289f35b
Show file tree

Hide file tree

Showing 13 changed files with 6,462 additions and 5,938 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Automatic Prompt Engineer
 
-This repository contains a notebook that [generates and optimizes system and user prompts](https://tonghuikang.github.io/automatic-prompt-engineer/html_output/prompt-history-classification.html) for classification purposes.
+This repository contains a [notebook](https://nbviewer.org/github/tonghuikang/automatic-prompt-engineer/blob/master/classification.ipynb) that [generates and optimizes system and user prompts](https://tonghuikang.github.io/automatic-prompt-engineer/html_output/prompt-history-classification.html) for classification purposes.
 
 This is how classification is intended to be done.
 - (system prompt, user prompt prefix + text + user prompt suffix) -Haiku-> bot response -function-> label
@@ -21,33 +21,109 @@ If you want to change the classification task, you will need to
 
 This is how prompt tuning is done
 - Sample from the full dataset.
-- Haiku takes in (system prompt, user prompt prefix + text + user prompt suffix) and produces bot_response.
+- Haiku takes in (system prompt, user prompt prefix + text + user prompt suffix) and produces bot_response (the final layer values).
 - The function takes in bot_response and produces the label. The (text -> label) process is analogous to the forward pass.
 - Sample from the mistakes.
-- Opus takes in the mistakes and summarizes the mistakes (gradient).
-- Opus takes in the mistake summary (gradient) and the current prompts (model parameters) updates the prompts.
+- Opus takes in the mistakes and summarizes the mistakes (gradient calculation).
+- Opus takes in the mistake summary (gradient) and the current prompts (model parameters) update the prompts.
 - Repeat.
 
 This notebook will also produce
 - The [classification](https://tonghuikang.github.io/automatic-prompt-engineer/html_output/iteration-classification-002.html) (or just the [mistakes](https://tonghuikang.github.io/automatic-prompt-engineer/html_output/iteration-classification-002-diff.html)) at each iteration of the prompt.
 - The [history](https://tonghuikang.github.io/automatic-prompt-engineer/html_output/prompt-history-classification.html) of the prompt and relevant metrics.
-- (These will be saved locally as html files)
+- (These will be saved locally as HTML files)
 
 
 # References
 
-I took inspiration from these resources.
+I took inspiration from these works.
 
 - [DSPy](https://dspy-docs.vercel.app/docs/building-blocks/solving_your_task) for describing how tuning a prompt engineering pipeline mirrors that tuning the parameters of a neural network.
-- [Matt Shumer](https://twitter.com/mattshumer_/status/1770942240191373770) for showing that Opus is a very good prompt engineer, and Haiku is good at following instructions.
+- [Matt Shumer](https://twitter.com/mattshumer_/status/1770942240191373770) for showing that Opus is a very good prompt engineer, and Haiku is sufficiently good at following instructions.
 
 
 # Design Decisions
 
 - I require the LLM to produce the reasoning, and I have a separate function to extract the predicted label.
   Having the reasoning provides visibility to the thought process, which helps with improving the prompt.
-- I minimized the packages that you will need to install.
-  As of commit, you will only need to install `pandas` and `anthropic` Python libraries.
-- I maximized the visibility into the workflows in the abstraction-visibility tradeoff.
+- I minimized the amount of required packages.
+  As of commit, you will only need to install `scikit-learn`, `pandas`, and `anthropic` Python libraries.
+  If you want to use this in a restricted environment, there are fewer packages to audit.
+- I chose visibility over abstraction.
   There is only one Python notebook with no helper Python functions.
-  You can easily edit the individual function to edit how prompt tuning is done.
+  You can easily find out where to edit the individual functions.
+
+
+# Exercises
+
+### Run this notebook
+
+Just add your with your Anthropic API key and you should be able to run `classification.ipynb`.
+
+You will need to clone the repository for a copy of `qiqc_truncated.csv`.
+
+It will cost approximately 40 cents per iteration, and `NUM_ITERATIONS` is 5.
+
+You may need to change `NUM_PARALLEL_FORWARD_PASS_API_CALLS` depending on the rate limit your API key can afford.
+
+
+### Change the dataset
+
+As you can see, the labeling of whether a question is insincere is inconsistent.
+
+There are some wrong labels in the dataset. You can either change the labels or exclude them from the dataset.
+
+I recommend initializing `dataset` in the notebook with probably a copy of 50 samples where you can easily change the label, or comment out the sample.
+
+
+### Change the sampling method
+
+For the forward pass, we currently sample 100 positive and negative samples.
+For the mistake summary, we sample 10 false positives and false negatives, 5 true positives, and 5 true negatives.
+Otherwise, the sample is chosen totally at random.
+
+We can improve the sampling method so that the model learns better.
+
+For example, you can
+- retain the wrong samples for the next iteration (or more) to make sure it is judged correctly
+- inject canonical examples for every forward pass or mistake summary that you want to absolutely get correct
+
+
+### Add notes for the classifier
+
+Instead of the dataset being just (sample, label), we can have (sample, label, note) instead.
+
+In the note, you note the reason why the sample is classified in a certain way.
+You do not need to label every data with a note.
+
+The note can be used as an input to gradient calculation and parameter updates, so that the sample is classified for the correct reasons.
+
+
+### Get Opus to put words in Claude's mouth
+
+Claude allows you to specify the prefix of the response (see [Putting words in Claude's mouth](https://docs.anthropic.com/claude/reference/messages-examples#putting-words-in-claudes-mouth))
+
+Besides `system_prompt`, `user_prompt_prefix`, and `user_prompt_suffix`, you can also add `bot_reply_prefix` as a field that Opus should produce.
+
+You will need to update the `user_message` in `update_model_parameters`.
+You also need to describe in `PROMPT_UPDATE_SYSTEM_PROMPT` what `bot_reply_prefix` is for.
+
+
+### Try a different classification task
+
+You can try out a different dataset.
+
+Another dataset I recommend is the [Quora Question Pairs dataset](https://www.kaggle.com/c/quora-question-pairs/data).
+
+You will need to need to
+- Change how the dataset is being loaded (since each sample is now a pair of questions, you need to change the delimiters)
+- Change `predict_from_final_layer` to match a different string
+- Change `PROMPT_UPDATE_SYSTEM_PROMPT` to describe what is being matched
+
+
+### Try a non-classification task
+
+You are either wrong or correct at a classification task.
+For non-classification tasks, it is more difficult to evaluate how good your output is.
+You need to think of how to evaluate whether the model is making a mistake, and how to update the prompts.
+I think it is still useful to keep most of the structure of the notebook.
diff --git a/classification.ipynb b/classification.ipynb
@@ -18,23 +18,26 @@
     "- the user prompt prefix\n",
     "- the user prompt suffix\n",
     "\n",
-    "To use this notebook, you will need\n",
+    "You can simply run this notebook with just\n",
     "- an Anthropic API key\n",
-    "- a dataset (text -> label)\n",
+    "\n",
+    "If you want to change the classification task, you will need to\n",
+    "- provide a dataset (text -> label)\n",
     "- define the function bot_response -> label\n",
-    "- description for Opus the expected bot_response that Haiku should produce\n",
+    "- description for Opus on what instructions Haiku should follow\n",
     "\n",
     "This is how prompt tuning is done\n",
     "- Sample from the full dataset.\n",
-    "- Haiku takes in (system prompt, user prompt prefix + text + user prompt suffix) and produces bot_response.\n",
+    "- Haiku takes in (system prompt, user prompt prefix + text + user prompt suffix) and produces bot_response (the final layer values).\n",
     "- The function takes in bot_response and produces the label. The (text -> label) process is analogous to the forward pass.\n",
     "- Sample from the mistakes.\n",
-    "- Opus takes in the mistakes and summarizes the mistakes (gradient).\n",
-    "- Opus takes in the mistake summary (gradient) and the current prompts (model parameters) updates the prompts.\n",
+    "- Opus takes in the mistakes and summarizes the mistakes (gradient calculation).\n",
+    "- Opus takes in the mistake summary (gradient) and the current prompts (model parameters) update the prompts.\n",
     "- Repeat.\n",
     "\n",
     "You will need to have these Python modules installed\n",
     "- pandas\n",
+    "- scikit-learn\n",
     "- anthropic"
    ]
   },
@@ -82,7 +85,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "b71cce29",
+   "id": "e7c9b939",
    "metadata": {},
    "outputs": [
     {
@@ -153,9 +156,9 @@
    "outputs": [],
    "source": [
     "df = pd.concat([\n",
-    "    df[df[\"target\"] == 1].sample(100),\n",
-    "    df[df[\"target\"] == 0].sample(100),\n",
-    "], ignore_index=True).sample(frac=1)\n",
+    "    df[df[\"target\"] == 1].sample(100, random_state=42),\n",
+    "    df[df[\"target\"] == 0].sample(100, random_state=42),\n",
+    "], ignore_index=True).sample(frac=1, random_state=0)\n",
     "\n",
     "# you can also just define the dataset with code\n",
     "dataset = list(zip(df[\"question_text\"], df[\"target\"].map({0: \"sincere\", 1: \"insincere\"})))"
@@ -164,7 +167,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "139792ef",
+   "id": "53e4e290",
    "metadata": {},
    "outputs": [
     {
@@ -180,7 +183,7 @@
    ],
    "source": [
     "# make sure the number of types of labels is small\n",
-    "# prefer descriptive labels\n",
+    "# prefer descriptive labels to avoid giving the model mental gymnastics\n",
     "collections.Counter(label for _, label in dataset)"
    ]
   },
@@ -193,7 +196,7 @@
     {
      "data": {
       "text/plain": [
-       "('Is it generally acceptable if someone grew up in the USA and considers Mediterranean Europeans like French or Spaniards not white, but rather black, because they are not Germanic like Dutch or Scandinavian people?',\n",
+       "('Can women cut off mens Penises and cook them as sauages to eat them?',\n",
        " 'insincere')"
       ]
      },
@@ -241,7 +244,7 @@
     "\n",
     "The LLM will take the following input\n",
     "- system_prompt\n",
-    "- user_prompt_prefix + question + user\n",
+    "- user_prompt_prefix + question + user_prompt_suffix\n",
     "\n",
     "The LLM is expected to produce the following output\n",
     "- reasoning on whether the question is insincere\n",
@@ -261,12 +264,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# usually Opus is good enough to produce working prompts\n",
+    "# usually Opus is good enough to produce working prompts from nothing\n",
     "model_parameters = {\n",
     "    \"system_prompt\": \"\",\n",
     "    \"user_prompt_prefix\": \"\",\n",
     "    \"user_prompt_suffix\": \"\",\n",
-    "}"
+    "}\n",
+    "\n",
+    "token_counts = {\n",
+    "    \"haiku_input\": 0,\n",
+    "    \"sonnet_input\": 0,\n",
+    "    \"opus_input\": 0,\n",
+    "    \"haiku_output\": 0,\n",
+    "    \"sonnet_output\": 0,\n",
+    "    \"opus_output\": 0,\n",
+    "}  # ideally this should have been tracked in anthropic.Anthropic"
    ]
   },
   {
@@ -319,6 +331,8 @@
     "        messages=[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": user_message}]}],\n",
     "        timeout=10\n",
     "    )\n",
+    "    token_counts[\"haiku_input\"] += message.usage.input_tokens\n",
+    "    token_counts[\"haiku_output\"] += message.usage.output_tokens\n",
     "\n",
     "    return message.content[0].text"
    ]
@@ -398,7 +412,9 @@
     "        system=system_message,\n",
     "        messages=[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": user_message}]}]\n",
     "    )\n",
-    "    \n",
+    "    token_counts[\"opus_input\"] += message.usage.input_tokens\n",
+    "    token_counts[\"opus_output\"] += message.usage.output_tokens\n",
+    "\n",
     "    return message.content[0].text"
    ]
   },
@@ -456,7 +472,9 @@
     "        system=system_message,\n",
     "        messages=[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": user_message}]}],\n",
     "    )\n",
-    "    \n",
+    "    token_counts[\"opus_input\"] += message.usage.input_tokens\n",
+    "    token_counts[\"opus_output\"] += message.usage.output_tokens\n",
+    "\n",
     "    bot_message = message.content[0].text\n",
     "\n",
     "    match_system_prompt = re.search(r'<system_prompt>(.*?)</system_prompt>', bot_message, re.DOTALL)\n",
@@ -811,10 +829,69 @@
     "    save_and_display_current_iteration(iteration_idx, samples, final_layer_values, predicted_labels, actual_labels)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a188d56b",
+   "metadata": {},
+   "source": [
+    "# Cost tracking"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "9154d3ab",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'haiku_input': 335600,\n",
+       " 'sonnet_input': 0,\n",
+       " 'opus_input': 49812,\n",
+       " 'haiku_output': 211791,\n",
+       " 'sonnet_output': 0,\n",
+       " 'opus_output': 5612}"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "token_counts"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "e509034e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1.51671875"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cost_in_dollar = (\n",
+    "    token_counts[\"haiku_input\"] * 0.25 + token_counts[\"sonnet_input\"] * 3 + token_counts[\"opus_input\"] * 15\n",
+    "    + token_counts[\"haiku_output\"] * 1.25 + token_counts[\"sonnet_output\"] * 15 + token_counts[\"opus_output\"] * 75\n",
+    ") / 1_000_000\n",
+    "cost_in_dollar"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e526210a",
+   "id": "7f28fade",
    "metadata": {},
    "outputs": [],
    "source": []