Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Benchmark list page #506

Merged
merged 6 commits into from
Feb 6, 2024
Merged

Benchmark list page #506

merged 6 commits into from
Feb 6, 2024

Conversation

jakethekoenig
Copy link
Member

@jakethekoenig jakethekoenig commented Jan 30, 2024

Benchmark Result Summary refactored into Benchmark Run which contains a list of all the results and a summary. These two objects are stored separately as only the summary is necessary to generate the list. The summaries are used to generate a list and plotly graph which is synced to index.html by the runner script. Benchmark specific templates are moved to the benchmarks module.

There are two things you may want to test locally:

  • All the benchmarks now generate a summary. For instance if you run:
./benchmarks/exercism_practice.py --max_iterations 1 --language python --max_benchmarks 1

You should see the familiar results.json that contains full transcript information as well as summary/index.json which only contains summary information about cost, pass, etc.

  • That you can build the list page. If you have a directory with some summaries in it you can built the list page. After running the previous command for instance you can run:
./benchmarks/benchmark_result_list.py benchmarks/benchmark_repos/exercism-python/summary index.html

to build a summary list with just one run.

You can see a page built with all our runs at benchmarks.mentat.ai

With this PR there are two pieces of tech debt:

  • Instead of syncing our results to json in an s3 bucket we should setup a database. It shouldn't be too hard to migrate the jsons later when we get around to this.
  • The run and upload benchmark script is getting more complicated and should probably be converted to a python script.

Miscellaneous other things in the PR:

  • Date, git branch and git commit of mentat are all added to a benchmark's metadata.
  • I added the script I used to backfill the metadata and split apart previously generated runs into run+summary.

Pull Request Checklist

  • Documentation has been updated, or this change doesn't require that

Benchmark Result Summary refactored into Benchmark Run which contains a
list of all the results and a summary. These two objects are stored
seperatly as only the summary is necessary to generate the list. The
summaries are used to generate a list and plotly graph which is synced
to index.html by the runner script. Benchmark specific templates are
moved to the benchmarks module.
@biobootloader
Copy link
Member

default to passed instead of cost? also not sure what's going on with those symbols on the right
image

@@ -14,6 +14,7 @@ class BenchmarkResult:
family: Optional[str] = attr.ib(default=None)
cost: Optional[float] = attr.ib(default=None, metadata={"aggregation": "sum"})
tokens: Optional[int] = attr.ib(default=None, metadata={"aggregation": "average"})
count: int = attr.ib(default=1, metadata={"aggregation": "sum"})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously when the summary was with the run we could compute the length of results but now the summary needs another way to be aware of it.

@jakethekoenig
Copy link
Member Author

default to passed instead of cost? also not sure what's going on with those symbols on the right image

Both fixed. Thanks!

@jakethekoenig jakethekoenig merged commit e817caa into main Feb 6, 2024
16 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants