Adds a framework for running DeepEval unit tests #714

wpfl-dbt · 2024-07-02T17:06:18Z

Context

We need to be able to run DeepEval test cases as pass/fail tests to catch regressions. This PR adds the unit testing framework to do this. See REDBOX-425.

This PR will produce this kind of output from make test-ai:

core_api/tests/test_ai.py::test_contextual_precision[level1_lang_qa] PASSED                                          [  3%]
core_api/tests/test_ai.py::test_contextual_precision[level2_lang_qa] FAILED                                          [  6%]
core_api/tests/test_ai.py::test_contextual_precision[level1_math_qa] PASSED                                          [ 10%]
core_api/tests/test_ai.py::test_contextual_precision[level2_math_qa] PASSED                                          [ 13%]
core_api/tests/test_ai.py::test_contextual_precision[level3_math_qa] PASSED                                          [ 16%]
core_api/tests/test_ai.py::test_contextual_recall[level1_lang_qa] PASSED                                             [ 20%]
core_api/tests/test_ai.py::test_contextual_recall[level2_lang_qa] FAILED                                             [ 23%]
core_api/tests/test_ai.py::test_contextual_recall[level1_math_qa] PASSED                                             [ 26%]
core_api/tests/test_ai.py::test_contextual_recall[level2_math_qa] FAILED                                             [ 30%]

FAILED core_api/tests/test_ai.py::test_contextual_precision[level2_lang_qa] - AssertionError: Metrics: Contextual Precision (score: 0.25, threshold: 0.5, strict: False, error: None) failed.
FAILED core_api/tests/test_ai.py::test_contextual_recall[level2_lang_qa] - AssertionError: Metrics: Contextual Recall (score: 0.0, threshold: 0.5, strict: False, error: None) failed.

And a table in actions.

Why this evaluation data?

This evaluation data represents a break from previous paradigms. The aim is to have a single test per "user story" that will test some specific capability of our AI system. Examples of things we might test are:

Can we find and regurgitate a fact semi-verbatim
Can we find a fact using synonyms and repeat it
Can we find a number
Can we add and subtract numbers
Can we find a date
Can we add and subtract dates

I hope that building this list can be a collaborative effort by UR and DS.

Currently this PR is more about the shape of this data than that it tests anything useful -- yet.

⚠️ Important information

This puts data in the repo

Changes proposed in this pull request

Adds test_ai.py to core API to run the DeepEval unit tests
Adds data to the repo to run these tests
Adds a pytest ai marker which deselects these tests in the make command
Adds a new test-ai make command
Adds a workflow to run these tests
Session scopes some core fixtures
Adds some parameterisation to the creation of objects in dependencies to more easily import them for testing

Guidance to review

Are you happy that I've added data to the repo?
Are you happy I've added tests that will regularly fail?
make test-ai
Check the workflow -- failing but running

Relevant links

DeepEval metrics

Things to check

I have added any new ENV vars in all deployed environments
I have tested any code added or changed
I have run integration tests -- failing, but not for anything I've touched

… data to repo as it's not big

gecBurton · 2024-07-03T12:32:26Z

core_api/tests/test_ai.py

+from core_api.src.dependencies import get_llm, get_parameterised_retriever, get_tokeniser
+from redbox.models.chain import ChainInput
+
+if TYPE_CHECKING:


do we need to test the tests?

I don't follow?

why bother running mypy on test code?

This was to appease ruff and my IDE

gecBurton · 2024-07-03T12:34:48Z

core_api/tests/test_ai.py

+from uuid import UUID
+
+import jsonlines
+import pandas as pd


I dont see pandas in any pyproject?

Added to dev

And removed again lol

gecBurton · 2024-07-04T05:38:01Z

.env.test

@@ -1,7 +1,7 @@
 # === LLM  ===

-ANTHROPIC_API_KEY=
-OPENAI_API_KEY=
+# ANTHROPIC_API_KEY=


lets just delete these

gecBurton · 2024-07-04T05:38:14Z

.github/workflows/ai.yml

@@ -50,7 +50,73 @@ jobs:

        docker compose up -d --wait elasticsearch
        poetry install --no-root --no-ansi --with dev,ai,api --without worker
-        poetry run download-model --embedding_model all-mpnet-base-v2
+        poetry run python download_embedder.py --embedding_model all-mpnet-base-v2


gecBurton · 2024-07-04T05:39:18Z

Makefile

+.PHONY: test-ai
+test-ai: ## Test code with live LLM
+	poetry install --no-root --no-ansi --with api,dev,ai --without worker,docs
+	poetry run pytest core_api/tests -m "ai" -vv


gecBurton · 2024-07-04T05:40:18Z

core_api/tests/test_ai.py

+    if len(user_uuids) > 1:
+        msg = "Embeddings have more than one creator_user_uuid"
+        raise ValueError(msg)
+    else:


Suggested change

else:

nitpicking

gecBurton · 2024-07-04T05:45:46Z

notebooks/evaluation/data/0.2.3/synthetic/rag.csv

@@ -0,0 +1,7 @@
+user_story,id,notes,input,context,expected_output


sorry to be a PITA but this is one of those things i get weird about, pls could we encode this as just pure python because:

csv is a never ending nightmare, for example I see you are encoding structured objects like lists, this goes wrong all the time

you only have seven records.. so there is no significant readability benefit from using CSV

pandas adds a lot of bloat, im on a mission to cut this out

we can drop your code that turns CSV into python

My challenge is: I'd like to do this in a format that URs can write for us, and that isn't going to be Python. Do you have any recommendations?

I can remove pandas for csvreader?

json? or csv-json, i.e.:

"name", "age", "address" "Larry-The-Cat", 6, ["10 Downing Street", "London", "SW1A 2AA"]

this way you can parse it like:

import json def parse_row(text): return json.loads(f"[{text}]") def parse_file(file): rows = map(parse_row, file) header = next(rows) for row in rows: yield dict(zip(header, row)) with open("my-data.csv") as f: for item in parse_file(f): print(item)

I've changed to JSON and removed pandas.

gecBurton

basically good, I like this, we should do more.

I like the way you have broken these out into their own kind of test

please could you placate my phobia of CSVs.

Are you happy that I've added data to the repo?

yes, you have clearly marked this as test data

Are you happy I've added tests that will regularly fail?

yes, but please disable this in GH for now because I dont want people to get in the habit of thinking that test failure alerts are something they can ignore

make test-ai

this runs but doesnt pass for me, I get:

core_api/tests/test_ai.py::test_rag[how-can-researchers-and-scienc] PASSED                                                                                                                          [ 10%]
core_api/tests/test_ai.py::test_rag[how-does-research-and-innovati] PASSED                                                                                                                          [ 20%]
core_api/tests/test_ai.py::test_rag[how-are-research-outputs-class] FAILED                                                                                                                          [ 30%]
core_api/tests/test_ai.py::test_rag[how-does-public-engagement-wit] PASSED                                                                                                                          [ 40%]
core_api/tests/test_ai.py::test_rag[how-can-policymakers-maintain] PASSED                                                                                                                           [ 50%]
core_api/tests/test_ai.py::test_rag[how-can-trust-in-science-for-p] PASSED                                                                                                                          [ 60%]
core_api/tests/test_ai.py::test_rag[how-do-policy-features-affect] PASSED                                                                                                                           [ 70%]
core_api/tests/test_ai.py::test_rag[how-does-ref-impact-research-q] PASSED                                                                                                                          [ 80%]
core_api/tests/test_ai.py::test_rag[how-challenging-is-it-to-creat] FAILED                                                                                                                          [ 90%]
core_api/tests/test_ai.py::test_rag[how-does-the-ref-evaluate-rese] FAILED                                                                                                                          [100%]Running teardown with pytest sessionfinish...

but i think that this is expected at this stage?

…oves to JSON from CSV for test data, turns off autorunning action while tests fail, minor edits from PR

gecBurton · 2024-07-04T08:00:44Z

core_api/tests/test_ai.py

    embeddings: Path
+    test_cases: list[LLMTestCase] = []


Suggested change

test_cases: list[LLMTestCase] = []

test_cases: list[LLMTestCase] = Field(default_factory=list)

Done, but with a degree of shame as you've not only told me about this before but linked me to a helpful article on why you should never do it!

gecBurton

magical

pls change https://github.com/i-dot-ai/redbox-copilot/pull/714/files/9808189ed3fa96a35084facb6a09aa184ee687da..cfd804deccc155780268e4b71c402675930a3a8b#r1665300693 but approved

Will Langdale added 5 commits July 1, 2024 15:52

Initial pass of unit tests

6ee7e8b

Problems with LLM keys

6b0f22a

Working unit tests, but structure not quite right with how they yield

9844f2c

Working unit tests

18b76be

Adds --ai flag that must be used to run tests that use live LLMs

46b8d12

wpfl-dbt marked this pull request as draft July 2, 2024 17:28

Added make command, added action, amended selection method, added raw…

7363c0f

… data to repo as it's not big

wpfl-dbt temporarily deployed to release July 3, 2024 07:38 — with GitHub Actions Inactive

Fixed syntax in workflow

fc243d6

wpfl-dbt temporarily deployed to release July 3, 2024 07:42 — with GitHub Actions Inactive

Minor change to try and pick up workflow-dispatch

fda267d

wpfl-dbt temporarily deployed to release July 3, 2024 07:44 — with GitHub Actions Inactive

Will Langdale added 2 commits July 3, 2024 08:58

Added elastic index parameter to all unit tests

8a132d9

Merge branch 'main' into feature/eval-unit

eb24425

wpfl-dbt temporarily deployed to release July 3, 2024 08:23 — with GitHub Actions Inactive

Some fixes to work with the new shape of main

81bb1d6

wpfl-dbt temporarily deployed to release July 3, 2024 08:55 — with GitHub Actions Inactive

Adds Slack hook, docker log dump, and only clears the index if it exists

a77f085

wpfl-dbt temporarily deployed to release July 3, 2024 09:10 — with GitHub Actions Inactive

Edited logger to hopefully fix integration tests

5b25032

wpfl-dbt temporarily deployed to release July 3, 2024 12:02 — with GitHub Actions Inactive

wpfl-dbt marked this pull request as ready for review July 3, 2024 12:28

gecBurton reviewed Jul 3, 2024

View reviewed changes

Moved to new testing doc format

9808189

gecBurton reviewed Jul 4, 2024

View reviewed changes

gecBurton requested changes Jul 4, 2024

View reviewed changes

Breaks out unit tests by metric so they're clearer, removes pandas, m…

55f50d9

…oves to JSON from CSV for test data, turns off autorunning action while tests fail, minor edits from PR

wpfl-dbt temporarily deployed to release July 4, 2024 07:25 — with GitHub Actions Inactive

wpfl-dbt requested a review from gecBurton July 4, 2024 07:26

Changed story to scenario as I think it's clearer

cfd804d

gecBurton reviewed Jul 4, 2024

View reviewed changes

gecBurton approved these changes Jul 4, 2024

View reviewed changes

Final changes for PR

032ad0d

wpfl-dbt merged commit a787195 into main Jul 4, 2024
5 checks passed

lmwilkigov deleted the feature/eval-unit branch July 16, 2024 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a framework for running DeepEval unit tests #714

Adds a framework for running DeepEval unit tests #714

wpfl-dbt commented Jul 2, 2024 •

edited

Loading

gecBurton Jul 3, 2024

wpfl-dbt Jul 3, 2024

gecBurton Jul 4, 2024

wpfl-dbt Jul 4, 2024

gecBurton Jul 4, 2024

gecBurton Jul 3, 2024

wpfl-dbt Jul 3, 2024

wpfl-dbt Jul 4, 2024

gecBurton Jul 4, 2024

gecBurton Jul 4, 2024

gecBurton Jul 4, 2024

gecBurton Jul 4, 2024

gecBurton Jul 4, 2024 •

edited

Loading

wpfl-dbt Jul 4, 2024 •

edited

Loading

gecBurton Jul 4, 2024 •

edited

Loading

wpfl-dbt Jul 4, 2024

gecBurton left a comment

gecBurton Jul 4, 2024

wpfl-dbt Jul 4, 2024

gecBurton left a comment

		@@ -0,0 +1,7 @@
		user_story,id,notes,input,context,expected_output

	test_cases: list[LLMTestCase] = []
	test_cases: list[LLMTestCase] = Field(default_factory=list)

Adds a framework for running DeepEval unit tests #714

Adds a framework for running DeepEval unit tests #714

Conversation

wpfl-dbt commented Jul 2, 2024 • edited Loading

Context

Why this evaluation data?

⚠️ Important information

Changes proposed in this pull request

Guidance to review

Relevant links

Things to check

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gecBurton Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

wpfl-dbt Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

gecBurton Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gecBurton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gecBurton left a comment

Choose a reason for hiding this comment

wpfl-dbt commented Jul 2, 2024 •

edited

Loading

gecBurton Jul 4, 2024 •

edited

Loading

wpfl-dbt Jul 4, 2024 •

edited

Loading

gecBurton Jul 4, 2024 •

edited

Loading