Evaluator #47

stefstef00 · 2024-07-02T09:12:42Z

Created a method to evaluate benchmarks. The currently implemented features include:

Evaluate an iterator on an entire benchmark or list of benchmarks
A selection of problem ids can be evaluated, this selection can be specified using a regular expression.
Metrics per problem include:
1. Programs evaluated
2. Execution time (s)
3. Memory usage (bytes)
4. Termination cause
5. Any other metrics kept in the SolverStatistics struct of the iterator
Statistics per benchmark include:
- Mean and std for metrics 1, 2 & 3 over all problems
- Mean and std for metrics 1, 2 & 3 over all optimally solved problems
- Per termination cause, the amount that was terminated due to that
A decomposed synth function exists that is easy to extend and reuse for different benchmarks.
Save to file (while running to avoid too large memory usage):
- Folder structure (for now):

< specified_path >/
	environment.txt            Environment variables
	statistics.txt             Statistics per benchmark
	benchmarks/                Problem results per benchmark
		< benchmark 1 >.txt    ..Problem results benchmark 1
		< benchmark 2 >.txt    ..Problem results benchmark 2
		...

…acro

Containing some simple problems for easy and quick debugging

There are three types of results: 1. ProblemResult: for a single problem evaluation. 2. BenchmarkResult: for a single benchmark evaluation. 3. EvaluationResult: for multiple benchmark evaluations. The ProblemResult contains metrics about the search (e.g. execution time, memory usage, ...). The BenchmarkResult contains aggregated statistics (e.g. average execution time, termination cause totals).

Created a custom synth function keeps track of more metrics (enumeration count, termination cause, memory usage, etc.)

This contains two evaluation functions one for a single benchmark and one for multiple. It calls the synth function on each problem within the benchmark(s) and returns the results including metrics and statistics in the corresponding structures.

…e appended

This decomposed synth calls a lot of submethods that each can be overloaded. When a benchmark needs a custom synth function, it can just overload the methods that are different and prevent code duplication.

sebdumancic · 2024-07-02T10:08:27Z

just a thought -- it might be useful to consider building (some?) of this functionality over an existing tool for running large scale experiments, like DrWatson.jl

matteo-meluzzi and others added 30 commits June 14, 2023 09:13

timing benchmarks

1865d0d

enumeration benchmarks added

530c278

compatibility with 'old' herb

91fd389

enumeration tests with constraints

7d092ee

Fix benchmarks without constraints

c6cb6f5

Improve constraints for benchmark

c74af97

Add extra constraint benchmarks

66c9a10

Rename benchmarks to proper name

70a6534

print results in latex

63934b0

added a grammar with less terminals

7cc923a

grammars with constraints have correct ioexamples

7ffb1ac

generate benchmarks had typo

603c506

added propagation count to result tables

481d30c

Prototype Benchmark dataset

360c9bb

Update prototype string benchmark

aae71c5

Move data to src

deda1a2

remove HerbInterpret from dependencies

316d846

Add automatic include of sub-modules

c86af45

Add attempt at String transformation grammar

1956c86

Add automatic include of sub-modules

565a0a4

Reformat Strings transformations

be37f30

Reformat Robots

93aecf2

update utilities and dependencies

1440a53

Reformat grammars

60d7ef1

Add bitoperations SyGuS benchmark

2898fe3

Update io functionality

f50307a

Update fileparsing

abf7774

Add REadmes and updated IO handling

36dafde

Added documentation for SyGus's main module

d88f113

Add ARC benchmark data, grammar and interpret still missing

a785a14

ReubenJ and others added 26 commits April 26, 2024 11:19

Merge pull request #26 from Herb-AI/fix_string_function_and_problem_m…

77a3b4a

…acro

Fix _arg_x indices

224b8ae

Moved benchmark datatype

ab7c99b

Added problem grammar pair datatype

2a15fc3

Added program grammar pair datatype

9c9769f

Moved benchmark runner

d08b861

Update HerbGrammar -> ^0.3.0

c20ec30

Fix SyGuS input arguments issue

50be6f2

Merge branch 'dev' into restructure

3db288e

Added method to fetch all benchmarks

d1aed0d

Moved legacy files to other folder

01d461b

Removed legacy files

cd54863

Added a debug benchmark

d3fe79e

Containing some simple problems for easy and quick debugging

Added docs

86cff51

Added docs

71930c0

Synth function for benchmarks

4e33c7e

Created a custom synth function keeps track of more metrics (enumeration count, termination cause, memory usage, etc.)

Add evaluation functions

b3c49ff

This contains two evaluation functions one for a single benchmark and one for multiple. It calls the synth function on each problem within the benchmark(s) and returns the results including metrics and statistics in the corresponding structures.

Include everything in the module

42530a9

Added support for selecting specific problem ids

badb074

Added an empty constructor for BenchmarkResult to which results can b…

ac56b84

…e appended

Added support for storing evaluation results

eda913b

Rename "evaluate" to "evaluate_iterator"

3fcc4da

Added support for regex problem filtering

2577ac5

ProblemResult metrics now include SolverStatistics metrics

7412653

Added a decomposed synth for easy overloading

dcaa0bf

This decomposed synth calls a lot of submethods that each can be overloaded. When a benchmark needs a custom synth function, it can just overload the methods that are different and prevent code duplication.

stefstef00 self-assigned this Jul 2, 2024

stefstef00 added 2 commits July 2, 2024 12:02

A tour of the benchmark evaluator

a87325f

Rename

e3c3206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator #47

Evaluator #47

stefstef00 commented Jul 2, 2024

sebdumancic commented Jul 2, 2024

Evaluator #47

Are you sure you want to change the base?

Evaluator #47

Conversation

stefstef00 commented Jul 2, 2024

sebdumancic commented Jul 2, 2024