worse than random lol

tonghuikang · Sep 17, 2024 · db0c852 · db0c852
1 parent 63a350a
commit db0c852
Show file tree

Hide file tree

Showing 12 changed files with 3,177 additions and 3,183 deletions.
diff --git a/classification.ipynb b/classification.ipynb
@@ -140,8 +140,8 @@
    "source": [
     "NUM_PARALLEL_FORWARD_PASS_API_CALLS = 100  # see https://docs.anthropic.com/claude/reference/rate-limits\n",
     "NUM_SAMPLES_FORWARD_PASS_FOR_EACH_LABEL = 100\n",
-    "NUM_SAMPLES_MISTAKE_GRADIENT_CALCULATION_FOR_EACH_LABEL = 10\n",
-    "NUM_SAMPLES_CORRECT_GRADIENT_CALCULATION_FOR_EACH_LABEL = 5\n",
+    "NUM_SAMPLES_MISTAKE_GRADIENT_CALCULATION_FOR_EACH_LABEL = 20\n",
+    "NUM_SAMPLES_CORRECT_GRADIENT_CALCULATION_FOR_EACH_LABEL = 10\n",
     "NUM_ITERATIONS = 5"
    ]
   },
@@ -164,7 +164,7 @@
    "source": [
     "tsv = \"\"\"\n",
     "What are some examples of sorting algorithms that require more conditional statements?\tTRUE\n",
-    "Can you elaborate on how radix sort and counting sort functions work?\tFALSE\n",
+    "Can you elaborate on how radix sort and counting sort functions work?\tTRUE\n",
     "What are some specialized sorting algorithms?\tTRUE\n",
     "Could you explain why sorting is equivalent to discovering the permutation of items?\tTRUE\n",
     "What are some real-world applications of sorting algorithms?\tTRUE\n",
@@ -178,14 +178,13 @@
     "How might this \"good kid\" syndrome affect Japanese society in the long term?\tFALSE\n",
     "Do you think DeSantis would have been a good president?\tTRUE\n",
     "Do you think the sympathy wave would have been enough to secure a win for DeSantis?\tFALSE\n",
-    "Do you think Biden would have run for reelection if Trump had still been alive?\tFALSE\n",
     "What are some other factors that you think would have impacted the election outcome?\tFALSE\n",
     "Why do you think it's useful to think of dying as a process rather than an event?\tTRUE\n",
     "What other human cell types have a low metabolic rate?\tFALSE\n",
     "How long does it take for muscle and skin cells to die after circulation stops?\tTRUE\n",
     "What are some other factors that forensic scientists use to estimate time of death?\tFALSE\n",
     "What is the most surprising thing about how the human body decomposes?\tTRUE\n",
-    "What are some examples of how apathy can affect a project?\tFALSE\n",
+    "What are some examples of how apathy can affect a project?\tTRUE\n",
     "What is the trade-off between shipping updates quickly and releasing quality software?\tTRUE\n",
     "Is this apathy issue specific to software development, or is it a broader problem?\tFALSE\n",
     "What are the average costs associated with burying someone in a standard casket compared to an extra-large casket?\tTRUE\n",
@@ -194,30 +193,28 @@
     "Do you think the price of clothing for larger people is unfairly inflated?\tTRUE\n",
     "How does this perspective hold up when looking at specific Fighting-type Pokémon that are not typically heroic?\tFALSE\n",
     "Does the Fairy type fill the role of a \"Light\" type?\tFALSE\n",
-    "What do your thoughts about the \"Light-type\" concept say about the development of the Pokémon world and its themes?\tFALSE\n",
     "What are some other positive qualities besides fighting and righteousness that the Fighting type embodies?\tFALSE\n",
     "Why do you think it's important to condemn violence in politics?\tTRUE\n",
     "What do you mean by \"Trumpism would turn from a political movement to a religion\"?\tFALSE\n",
     "Why do you hope the perpetrator was not a Democrat?\tFALSE\n",
-    "How does obtaining guns easily make it easier to take a shot at a president?\tFALSE\n",
+    "How does obtaining guns easily make it easier to take a shot at a president?\tTRUE\n",
     "What are the most important factors when judging the success of a prime minister?\tTRUE\n",
     "What was it like living through the time of Harold Wilson?\tTRUE\n",
     "Did your dad ever explain why he acted that way?\tFALSE\n",
     "Do you think this experience changed your relationship with your father?\tFALSE\n",
     "How do you feel about your father now?\tFALSE\n",
     "Is it common for fathers to act this way?\tFALSE\n",
     "What's the most thoughtful thing your dad ever did for you?\tTRUE\n",
-    "What are some benefits you received during your deployment?\tFALSE\n",
+    "What are some benefits you received during your deployment?\tTRUE\n",
     "What were some of the challenges of living on a remote combat outpost in Afghanistan?\tTRUE\n",
-    "How did you manage your money during your deployment?\tFALSE\n",
+    "How did you manage your money during your deployment?\tTRUE\n",
     "Do other cultures use stars and constellations to represent concepts like the afterlife?\tFALSE\n",
     "What are some names for Canis Minor in Sanskrit?\tTRUE\n",
     "Are there other examples of using celestial figures as a way to navigate or remember things?\tFALSE\n",
-    "Is there any evidence that ancient Hindus actually used these constellations for navigating the path of the departed souls?\tTRUE\n",
+    "Is there any evidence that ancient Hindus actually used these constellations for navigating the path of the departed souls?\tFALSE\n",
     "Does the concept of Pitriloka and the path of the departed souls play a significant role in modern Hinduism?\tTRUE\n",
     "What are the implications of the current situation between the US and China?\tFALSE\n",
     "Why does China not need nuclear-fueled aircraft carriers?\tTRUE\n",
-    "How is the U.S. Navy adapting?\tTRUE\n",
     "What is the strategy behind China's approach of building smaller ships?\tTRUE\n",
     "Did you think the employees at the first Lowe's thought you were going to spend a lot of money and just didn’t want to deal with it?\tFALSE\n",
     "Did you use financing for all $12,000?\tFALSE\n",
@@ -229,7 +226,6 @@
     "What could you do to help your son feel more comfortable expressing his emotions?\tTRUE\n",
     "What are the societal pressures that contribute to men being discouraged from expressing their emotions?\tTRUE\n",
     "What are your thoughts on the idea that men should \"toughen up\" and not cry?\tTRUE\n",
-    "How have expectations of masculinity evolved over time?\tFALSE\n",
     "What does this story say about the nature of desire?\tFALSE\n",
     "What are some other stories of divine play (leela) in Hinduism?\tFALSE\n",
     "What are some other examples of the \"divine play\" in Hindu mythology?\tFALSE\n",
@@ -239,8 +235,6 @@
     "Why don't Omaha Steaks show photos of their raw steaks?\tTRUE\n",
     "What other information would you need to feel comfortable purchasing from Omaha Steaks?\tFALSE\n",
     "What is the difference between welfare and food stamps?\tTRUE\n",
-    "Why do you believe that no able-bodied person, including Americans, receives food stamps?\tTRUE\n",
-    "Is there any evidence to support your claim that undocumented immigrants do not receive government assistance?\tTRUE\n",
     "What are some ways companies can create a more flexible work environment?\tTRUE\n",
     "What are some ways employees can communicate effectively with their teams about their schedule?\tTRUE\n",
     "How has the importance of being present in the office changed in recent years?\tTRUE\n",
@@ -249,7 +243,7 @@
     "What are some examples of children who exhibit similar behavior?\tFALSE\n",
     "Is there any connection between E.'s behavior and his early childhood experiences?\tFALSE\n",
     "What are the long-term effects of \"Defiance and Anger Management Disorder\"?\tFALSE\n",
-    "What are some of the challenges of dealing with children with \"Defiance and Anger Management Disorder\"?\tTRUE\n",
+    "What are some of the challenges of dealing with children with \"Defiance and Anger Management Disorder\"?\tFALSE\n",
     "\"\"\".strip()"
    ]
   },
@@ -261,7 +255,7 @@
    "outputs": [],
    "source": [
     "dataset = [row.split('\\t') for row in tsv.split(\"\\n\")]\n",
-    "dataset = [(text, \"good_question\" if label == \"TRUE\" else \"bad_question\") for text, label in dataset]"
+    "dataset = [(text, \"0\" if label == \"TRUE\" else \"1\") for text, label in dataset]"
    ]
   },
   {
@@ -286,7 +280,7 @@
     {
      "data": {
       "text/plain": [
-       "Counter({'good_question': 48, 'bad_question': 39})"
+       "Counter({'0': 48, '1': 33})"
       ]
      },
      "execution_count": 9,
@@ -309,7 +303,7 @@
      "data": {
       "text/plain": [
        "('What are some examples of sorting algorithms that require more conditional statements?',\n",
-       " 'good_question')"
+       " '0')"
       ]
      },
      "execution_count": 10,
@@ -486,12 +480,12 @@
     "    user_message = textwrap.dedent(\n",
     "        f\"\"\"\n",
     "        You are given\n",
-    "        - a set of (text, model response, extracted label, expected label)\n",
-    "            - extracted label may be None if it is not found in the model response\n",
+    "        - a set of (text, model response, predicted label, correct label)\n",
+    "            - predicted label may be None if it is not found in the model response\n",
     "        - the current set of prompts (which may be empty) for an LLM\n",
     "\n",
-    "        You will improve the prompts so that the LLM will predict the expected label.\n",
-    "        You might need to guess what each label means.\n",
+    "        You will improve the prompts so that the LLM will predict the correct label.\n",
+    "        Please spend some time to think what each label means, based on the examples.\n",
     "\n",
     "        The LLM input has the following parameters.\n",
     "        - {list(model_parameters.keys())}\n",
@@ -507,11 +501,11 @@
     "        - The LLM should only classify the text. The LLM should not respond to the text or decline classifying the text.\n",
     "        - The LLM should provide a concise reasoning. The reasoning should happen before the label.\n",
     "        - The response should always end with Label: <label>{{label}}</label>\n",
-    "            - Note that the label needs exactly the text in the expected label\n",
+    "            - Note that the label needs exactly the text in the correct label\n",
     "            - Note that you need the html tags\n",
     "        \n",
     "        The current metrics is {str(metrics)}.\n",
-    "        Fix the worst performing metric.\n",
+    "        If the metrics is especially bad (i.e. correctness is less than random), you likely need to rethink the interpretation of the labels.\n",
     "        \"\"\"\n",
     "    ) + \"\\n\\n\\n\"\n",
     "    \n",
@@ -526,7 +520,7 @@
     "                continue\n",
     "            correct_counts[correct_label] += 1\n",
     "        elif predicted_label not in correct_labels_set:\n",
-    "            correctness_verdict = \"This label could not be extracted or does not belong to one of the actual labels.\"\n",
+    "            correctness_verdict = \"This predicted label could not be extracted or does not belong to one of the actual labels.\"\n",
     "            if mistake_counts[None] > NUM_SAMPLES_MISTAKE_GRADIENT_CALCULATION_FOR_EACH_LABEL:\n",
     "                continue\n",
     "            mistake_counts[None] += 1\n",
@@ -565,14 +559,10 @@
     "    user_message += textwrap.dedent(f\"\"\"\n",
     "    Your reply (not the LLM you are tuning prompts for) should include the following\n",
     "    \n",
-    "    A summary of the current performance\n",
-    "    \n",
-    "    The key mistakes observed with some examples\n",
-    "    \n",
-    "    What are the proposed changes the prompt\n",
-    "    \n",
+    "    Your informed interpretation of the labels\n",
+    "        \n",
     "    The prompt parameters in the following format within the xml tags\n",
-    "    (please make sure each prompt parameter has some meaningful content)\n",
+    "    (please make sure each prompt parameter has some meaningful content)    \n",
     "    \"\"\") + \"\\n\\n\"\n",
     "\n",
     "    for model_parameter_key in model_parameters.keys():\n",
@@ -581,11 +571,13 @@
     "        the new {model_parameter_key} here\n",
     "        </{model_parameter_key}>\n",
     "        \"\"\") + \"\\n\\n\"        \n",
-    "    \n",
+    "        \n",
+    "    user_message += \"\\n\\nPlease spend time to check that your improved prompts will actually fix the mistakes.\"\n",
+    "        \n",
     "    conversation_history.append({\"role\": \"user\", \"content\": user_message})\n",
     "    \n",
     "    response = openai_client.chat.completions.create(\n",
-    "        model=\"o1-mini\",\n",
+    "        model=\"o1-preview\",\n",
     "        messages=conversation_history\n",
     "    )\n",
     "    \n",
@@ -625,11 +617,6 @@
     "    metrics = {}\n",
     "    correct_labels_set = set(correct_labels)\n",
     "    for label in sorted(correct_labels_set):\n",
-    "#         metrics[f\"{label}_precision\"] = precision_score(\n",
-    "#             [correct_label == label for correct_label in correct_labels],\n",
-    "#             [predicted_label == label for predicted_label in predicted_labels],\n",
-    "#             zero_division = 0,\n",
-    "#         )\n",
     "        metrics[f\"{label}_recall\"] = recall_score(\n",
     "            [correct_label == label for correct_label in correct_labels],\n",
     "            [predicted_label == label for predicted_label in predicted_labels],\n",
@@ -1019,10 +1006,10 @@
      "data": {
       "text/plain": [
        "defaultdict(int,\n",
-       "            {'haiku_input': 206369,\n",
-       "             'haiku_output': 33387,\n",
-       "             'o1_input': 30113,\n",
-       "             'o1_output': 7870})"
+       "            {'haiku_input': 240596,\n",
+       "             'haiku_output': 20808,\n",
+       "             'o1_input': 37424,\n",
+       "             'o1_output': 31341})"
       ]
      },
      "execution_count": 20,
@@ -1044,7 +1031,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Cost: $0.30\n"
+      "Cost: $0.67\n"
      ]
     }
    ],
@@ -1055,7 +1042,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "896e80eb",
+   "id": "0434c634",
    "metadata": {},
    "outputs": [],
    "source": []