-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a framework for running DeepEval unit tests #714
Conversation
… data to repo as it's not big
from core_api.src.dependencies import get_llm, get_parameterised_retriever, get_tokeniser | ||
from redbox.models.chain import ChainInput | ||
|
||
if TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to test the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why bother running mypy on test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to appease ruff and my IDE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair
core_api/tests/test_ai.py
Outdated
from uuid import UUID | ||
|
||
import jsonlines | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see pandas in any pyproject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And removed again lol
.env.test
Outdated
@@ -1,7 +1,7 @@ | |||
# === LLM === | |||
|
|||
ANTHROPIC_API_KEY= | |||
OPENAI_API_KEY= | |||
# ANTHROPIC_API_KEY= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets just delete these
@@ -50,7 +50,73 @@ jobs: | |||
|
|||
docker compose up -d --wait elasticsearch | |||
poetry install --no-root --no-ansi --with dev,ai,api --without worker | |||
poetry run download-model --embedding_model all-mpnet-base-v2 | |||
poetry run python download_embedder.py --embedding_model all-mpnet-base-v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.PHONY: test-ai | ||
test-ai: ## Test code with live LLM | ||
poetry install --no-root --no-ansi --with api,dev,ai --without worker,docs | ||
poetry run pytest core_api/tests -m "ai" -vv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
core_api/tests/test_ai.py
Outdated
if len(user_uuids) > 1: | ||
msg = "Embeddings have more than one creator_user_uuid" | ||
raise ValueError(msg) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: |
nitpicking
@@ -0,0 +1,7 @@ | |||
user_story,id,notes,input,context,expected_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to be a PITA but this is one of those things i get weird about, pls could we encode this as just pure python because:
- csv is a never ending nightmare, for example I see you are encoding structured objects like lists, this goes wrong all the time
- you only have seven records.. so there is no significant readability benefit from using CSV
- pandas adds a lot of bloat, im on a mission to cut this out
- we can drop your code that turns CSV into python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My challenge is: I'd like to do this in a format that URs can write for us, and that isn't going to be Python. Do you have any recommendations?
I can remove pandas for csvreader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json? or csv-json, i.e.:
"name", "age", "address"
"Larry-The-Cat", 6, ["10 Downing Street", "London", "SW1A 2AA"]
this way you can parse it like:
import json
def parse_row(text):
return json.loads(f"[{text}]")
def parse_file(file):
rows = map(parse_row, file)
header = next(rows)
for row in rows:
yield dict(zip(header, row))
with open("my-data.csv") as f:
for item in parse_file(f):
print(item)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to JSON and removed pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically good, I like this, we should do more.
I like the way you have broken these out into their own kind of test
please could you placate my phobia of CSVs.
Are you happy that I've added data to the repo?
yes, you have clearly marked this as test data
Are you happy I've added tests that will regularly fail?
yes, but please disable this in GH for now because I dont want people to get in the habit of thinking that test failure alerts are something they can ignore
make test-ai
this runs but doesnt pass for me, I get:
core_api/tests/test_ai.py::test_rag[how-can-researchers-and-scienc] PASSED [ 10%]
core_api/tests/test_ai.py::test_rag[how-does-research-and-innovati] PASSED [ 20%]
core_api/tests/test_ai.py::test_rag[how-are-research-outputs-class] FAILED [ 30%]
core_api/tests/test_ai.py::test_rag[how-does-public-engagement-wit] PASSED [ 40%]
core_api/tests/test_ai.py::test_rag[how-can-policymakers-maintain] PASSED [ 50%]
core_api/tests/test_ai.py::test_rag[how-can-trust-in-science-for-p] PASSED [ 60%]
core_api/tests/test_ai.py::test_rag[how-do-policy-features-affect] PASSED [ 70%]
core_api/tests/test_ai.py::test_rag[how-does-ref-impact-research-q] PASSED [ 80%]
core_api/tests/test_ai.py::test_rag[how-challenging-is-it-to-creat] FAILED [ 90%]
core_api/tests/test_ai.py::test_rag[how-does-the-ref-evaluate-rese] FAILED [100%]Running teardown with pytest sessionfinish...
but i think that this is expected at this stage?
…oves to JSON from CSV for test data, turns off autorunning action while tests fail, minor edits from PR
core_api/tests/test_ai.py
Outdated
embeddings: Path | ||
test_cases: list[LLMTestCase] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_cases: list[LLMTestCase] = [] | |
test_cases: list[LLMTestCase] = Field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but with a degree of shame as you've not only told me about this before but linked me to a helpful article on why you should never do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context
We need to be able to run DeepEval test cases as pass/fail tests to catch regressions. This PR adds the unit testing framework to do this. See REDBOX-425.
This PR will produce this kind of output from
make test-ai
:And a table in actions.
Why this evaluation data?
This evaluation data represents a break from previous paradigms. The aim is to have a single test per "user story" that will test some specific capability of our AI system. Examples of things we might test are:
I hope that building this list can be a collaborative effort by UR and DS.
Currently this PR is more about the shape of this data than that it tests anything useful -- yet.
Changes proposed in this pull request
test_ai.py
to core API to run the DeepEval unit testsai
marker which deselects these tests in the make commandtest-ai
make commandGuidance to review
make test-ai
Relevant links
Things to check