Benchmark list page #506

jakethekoenig · 2024-01-30T18:57:00Z

Benchmark Result Summary refactored into Benchmark Run which contains a list of all the results and a summary. These two objects are stored separately as only the summary is necessary to generate the list. The summaries are used to generate a list and plotly graph which is synced to index.html by the runner script. Benchmark specific templates are moved to the benchmarks module.

There are two things you may want to test locally:

All the benchmarks now generate a summary. For instance if you run:

./benchmarks/exercism_practice.py --max_iterations 1 --language python --max_benchmarks 1

You should see the familiar results.json that contains full transcript information as well as summary/index.json which only contains summary information about cost, pass, etc.

That you can build the list page. If you have a directory with some summaries in it you can built the list page. After running the previous command for instance you can run:

./benchmarks/benchmark_result_list.py benchmarks/benchmark_repos/exercism-python/summary index.html

to build a summary list with just one run.

You can see a page built with all our runs at benchmarks.mentat.ai

With this PR there are two pieces of tech debt:

Instead of syncing our results to json in an s3 bucket we should setup a database. It shouldn't be too hard to migrate the jsons later when we get around to this.
The run and upload benchmark script is getting more complicated and should probably be converted to a python script.

Miscellaneous other things in the PR:

Date, git branch and git commit of mentat are all added to a benchmark's metadata.
I added the script I used to backfill the metadata and split apart previously generated runs into run+summary.

Pull Request Checklist

Documentation has been updated, or this change doesn't require that

Benchmark Result Summary refactored into Benchmark Run which contains a list of all the results and a summary. These two objects are stored seperatly as only the summary is necessary to generate the list. The summaries are used to generate a list and plotly graph which is synced to index.html by the runner script. Benchmark specific templates are moved to the benchmarks module.

biobootloader · 2024-02-01T21:29:23Z

default to passed instead of cost? also not sure what's going on with those symbols on the right

jakethekoenig · 2024-02-02T18:35:24Z

benchmarks/benchmark_result.py

@@ -14,6 +14,7 @@ class BenchmarkResult:
    family: Optional[str] = attr.ib(default=None)
    cost: Optional[float] = attr.ib(default=None, metadata={"aggregation": "sum"})
    tokens: Optional[int] = attr.ib(default=None, metadata={"aggregation": "average"})
+    count: int = attr.ib(default=1, metadata={"aggregation": "sum"})


Previously when the summary was with the run we could compute the length of results but now the summary needs another way to be aware of it.

jakethekoenig · 2024-02-02T18:36:34Z

default to passed instead of cost? also not sure what's going on with those symbols on the right

Both fixed. Thanks!

jakethekoenig added 4 commits February 2, 2024 10:10

Fix dropdown menu

5be0032

Clean up unused client side plotly

a4841d1

Add file to metadata

4b00c38

Remove todos

6789648

jakethekoenig commented Feb 2, 2024

View reviewed changes

biobootloader approved these changes Feb 5, 2024

View reviewed changes

Merge branch 'main' into benchmark-main-page

1628878

jakethekoenig merged commit e817caa into main Feb 6, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark list page #506

Benchmark list page #506

jakethekoenig commented Jan 30, 2024 •

edited

Loading

biobootloader commented Feb 1, 2024

jakethekoenig Feb 2, 2024

jakethekoenig commented Feb 2, 2024

Benchmark list page #506

Benchmark list page #506

Conversation

jakethekoenig commented Jan 30, 2024 • edited Loading

Pull Request Checklist

biobootloader commented Feb 1, 2024

jakethekoenig Feb 2, 2024

Choose a reason for hiding this comment

jakethekoenig commented Feb 2, 2024

jakethekoenig commented Jan 30, 2024 •

edited

Loading