Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a framework for running DeepEval unit tests #714

Merged
merged 17 commits into from
Jul 4, 2024
Merged

Conversation

wpfl-dbt
Copy link
Collaborator

@wpfl-dbt wpfl-dbt commented Jul 2, 2024

Context

We need to be able to run DeepEval test cases as pass/fail tests to catch regressions. This PR adds the unit testing framework to do this. See REDBOX-425.

This PR will produce this kind of output from make test-ai:

core_api/tests/test_ai.py::test_contextual_precision[level1_lang_qa] PASSED                                          [  3%]
core_api/tests/test_ai.py::test_contextual_precision[level2_lang_qa] FAILED                                          [  6%]
core_api/tests/test_ai.py::test_contextual_precision[level1_math_qa] PASSED                                          [ 10%]
core_api/tests/test_ai.py::test_contextual_precision[level2_math_qa] PASSED                                          [ 13%]
core_api/tests/test_ai.py::test_contextual_precision[level3_math_qa] PASSED                                          [ 16%]
core_api/tests/test_ai.py::test_contextual_recall[level1_lang_qa] PASSED                                             [ 20%]
core_api/tests/test_ai.py::test_contextual_recall[level2_lang_qa] FAILED                                             [ 23%]
core_api/tests/test_ai.py::test_contextual_recall[level1_math_qa] PASSED                                             [ 26%]
core_api/tests/test_ai.py::test_contextual_recall[level2_math_qa] FAILED                                             [ 30%]
FAILED core_api/tests/test_ai.py::test_contextual_precision[level2_lang_qa] - AssertionError: Metrics: Contextual Precision (score: 0.25, threshold: 0.5, strict: False, error: None) failed.
FAILED core_api/tests/test_ai.py::test_contextual_recall[level2_lang_qa] - AssertionError: Metrics: Contextual Recall (score: 0.0, threshold: 0.5, strict: False, error: None) failed.

And a table in actions.

Why this evaluation data?

This evaluation data represents a break from previous paradigms. The aim is to have a single test per "user story" that will test some specific capability of our AI system. Examples of things we might test are:

  • Can we find and regurgitate a fact semi-verbatim
  • Can we find a fact using synonyms and repeat it
  • Can we find a number
  • Can we add and subtract numbers
  • Can we find a date
  • Can we add and subtract dates

I hope that building this list can be a collaborative effort by UR and DS.

Currently this PR is more about the shape of this data than that it tests anything useful -- yet.

⚠️ Important information

  • This puts data in the repo

Changes proposed in this pull request

  • Adds test_ai.py to core API to run the DeepEval unit tests
  • Adds data to the repo to run these tests
  • Adds a pytest ai marker which deselects these tests in the make command
  • Adds a new test-ai make command
  • Adds a workflow to run these tests
  • Session scopes some core fixtures
  • Adds some parameterisation to the creation of objects in dependencies to more easily import them for testing

Guidance to review

  • Are you happy that I've added data to the repo?
  • Are you happy I've added tests that will regularly fail?
  • make test-ai
  • Check the workflow -- failing but running

Relevant links

Things to check

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests -- failing, but not for anything I've touched

@wpfl-dbt wpfl-dbt marked this pull request as draft July 2, 2024 17:28
@wpfl-dbt wpfl-dbt marked this pull request as ready for review July 3, 2024 12:28
from core_api.src.dependencies import get_llm, get_parameterised_retriever, get_tokeniser
from redbox.models.chain import ChainInput

if TYPE_CHECKING:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test the tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother running mypy on test code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to appease ruff and my IDE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair

from uuid import UUID

import jsonlines
import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see pandas in any pyproject?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to dev

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And removed again lol

.env.test Outdated
@@ -1,7 +1,7 @@
# === LLM ===

ANTHROPIC_API_KEY=
OPENAI_API_KEY=
# ANTHROPIC_API_KEY=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets just delete these

@@ -50,7 +50,73 @@ jobs:

docker compose up -d --wait elasticsearch
poetry install --no-root --no-ansi --with dev,ai,api --without worker
poetry run download-model --embedding_model all-mpnet-base-v2
poetry run python download_embedder.py --embedding_model all-mpnet-base-v2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.PHONY: test-ai
test-ai: ## Test code with live LLM
poetry install --no-root --no-ansi --with api,dev,ai --without worker,docs
poetry run pytest core_api/tests -m "ai" -vv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if len(user_uuids) > 1:
msg = "Embeddings have more than one creator_user_uuid"
raise ValueError(msg)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:

nitpicking

@@ -0,0 +1,7 @@
user_story,id,notes,input,context,expected_output
Copy link
Collaborator

@gecBurton gecBurton Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to be a PITA but this is one of those things i get weird about, pls could we encode this as just pure python because:

  1. csv is a never ending nightmare, for example I see you are encoding structured objects like lists, this goes wrong all the time
  2. you only have seven records.. so there is no significant readability benefit from using CSV
  3. pandas adds a lot of bloat, im on a mission to cut this out
  4. we can drop your code that turns CSV into python

Copy link
Collaborator Author

@wpfl-dbt wpfl-dbt Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My challenge is: I'd like to do this in a format that URs can write for us, and that isn't going to be Python. Do you have any recommendations?

I can remove pandas for csvreader?

Copy link
Collaborator

@gecBurton gecBurton Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json? or csv-json, i.e.:

"name", "age", "address"
"Larry-The-Cat", 6, ["10 Downing Street", "London", "SW1A 2AA"]

this way you can parse it like:

import json


def parse_row(text):
    return json.loads(f"[{text}]")

def parse_file(file):
    rows = map(parse_row, file)
    header = next(rows)
    for row in rows:
        yield dict(zip(header, row))


with open("my-data.csv") as f:
    for item in parse_file(f):
        print(item)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to JSON and removed pandas.

Copy link
Collaborator

@gecBurton gecBurton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically good, I like this, we should do more.

I like the way you have broken these out into their own kind of test

please could you placate my phobia of CSVs.

Are you happy that I've added data to the repo?

yes, you have clearly marked this as test data

Are you happy I've added tests that will regularly fail?

yes, but please disable this in GH for now because I dont want people to get in the habit of thinking that test failure alerts are something they can ignore

make test-ai

this runs but doesnt pass for me, I get:

core_api/tests/test_ai.py::test_rag[how-can-researchers-and-scienc] PASSED                                                                                                                          [ 10%]
core_api/tests/test_ai.py::test_rag[how-does-research-and-innovati] PASSED                                                                                                                          [ 20%]
core_api/tests/test_ai.py::test_rag[how-are-research-outputs-class] FAILED                                                                                                                          [ 30%]
core_api/tests/test_ai.py::test_rag[how-does-public-engagement-wit] PASSED                                                                                                                          [ 40%]
core_api/tests/test_ai.py::test_rag[how-can-policymakers-maintain] PASSED                                                                                                                           [ 50%]
core_api/tests/test_ai.py::test_rag[how-can-trust-in-science-for-p] PASSED                                                                                                                          [ 60%]
core_api/tests/test_ai.py::test_rag[how-do-policy-features-affect] PASSED                                                                                                                           [ 70%]
core_api/tests/test_ai.py::test_rag[how-does-ref-impact-research-q] PASSED                                                                                                                          [ 80%]
core_api/tests/test_ai.py::test_rag[how-challenging-is-it-to-creat] FAILED                                                                                                                          [ 90%]
core_api/tests/test_ai.py::test_rag[how-does-the-ref-evaluate-rese] FAILED                                                                                                                          [100%]Running teardown with pytest sessionfinish...

but i think that this is expected at this stage?

…oves to JSON from CSV for test data, turns off autorunning action while tests fail, minor edits from PR
embeddings: Path
test_cases: list[LLMTestCase] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_cases: list[LLMTestCase] = []
test_cases: list[LLMTestCase] = Field(default_factory=list)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but with a degree of shame as you've not only told me about this before but linked me to a helpful article on why you should never do it!

Copy link
Collaborator

@gecBurton gecBurton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpfl-dbt wpfl-dbt merged commit a787195 into main Jul 4, 2024
5 checks passed
@lmwilkigov lmwilkigov deleted the feature/eval-unit branch July 16, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants