Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluator #47

Draft
wants to merge 102 commits into
base: master
Choose a base branch
from
Draft

Evaluator #47

wants to merge 102 commits into from

Conversation

stefstef00
Copy link
Collaborator

Created a method to evaluate benchmarks. The currently implemented features include:

  • Evaluate an iterator on an entire benchmark or list of benchmarks
  • A selection of problem ids can be evaluated, this selection can be specified using a regular expression.
  • Metrics per problem include:
    1. Programs evaluated
    2. Execution time (s)
    3. Memory usage (bytes)
    4. Termination cause
    5. Any other metrics kept in the SolverStatistics struct of the iterator
  • Statistics per benchmark include:
    • Mean and std for metrics 1, 2 & 3 over all problems
    • Mean and std for metrics 1, 2 & 3 over all optimally solved problems
    • Per termination cause, the amount that was terminated due to that
  • A decomposed synth function exists that is easy to extend and reuse for different benchmarks.
  • Save to file (while running to avoid too large memory usage):
    • Folder structure (for now):
< specified_path >/
	environment.txt            Environment variables
	statistics.txt             Statistics per benchmark
	benchmarks/                Problem results per benchmark
		< benchmark 1 >.txt    ..Problem results benchmark 1
		< benchmark 2 >.txt    ..Problem results benchmark 2
		...

ReubenJ and others added 26 commits April 26, 2024 11:19
Containing some simple problems for easy and quick debugging
There are three types of results:
1. ProblemResult: for a single problem evaluation.
2. BenchmarkResult: for a single benchmark evaluation.
3. EvaluationResult: for multiple benchmark evaluations.

The ProblemResult contains metrics about the search (e.g. execution time, memory usage, ...). The BenchmarkResult contains aggregated statistics (e.g. average execution time, termination cause totals).
Created a custom synth function keeps track of more metrics (enumeration count, termination cause, memory usage, etc.)
This contains two evaluation functions one for a single benchmark and one for multiple. It calls the synth function on each problem within the benchmark(s) and returns the results including metrics and statistics in the corresponding structures.
This decomposed synth calls a lot of submethods that each can be overloaded. When a benchmark needs a custom synth function, it can just overload the methods that are different and prevent code duplication.
@stefstef00 stefstef00 self-assigned this Jul 2, 2024
@sebdumancic
Copy link
Member

just a thought -- it might be useful to consider building (some?) of this functionality over an existing tool for running large scale experiments, like DrWatson.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants