(For Core - Devs) Step by step guide on revalidation #193

AKuederle · 2024-11-18T14:59:49Z

AKuederle
Nov 18, 2024
Maintainer

This is a place to discuss the process of re-validation for the core devs.

Some technical background

Validation is done using the tpcp validation concept. This means, we have an algorithm wrapped by a pipeline and a scoring function wrapped by a Scorer. Then both parts are combined with a dataset by a function like tpcp.validate.validate (https://tpcp.readthedocs.io/en/latest/auto_examples/index.html#validation)
For the revalidation, we want to have the results of that revalidation included in the docs. For this, we store the results of the validaiton runs (aggregated performance metrics, and certain raw results in a seperate git repo (https://github.com/mobilise-d/mobgap_validation).
We have a utility class (mobgap.data.validation_results.ValidationResultLoader) that can either use a local copy of this repo or download the result files using pooch from Github. When the documentation is build on ReadTheDocs, the latter is done, and the files required to display the results are pulled directly from the other github repo.

Step by step guide to create a revalidation for a new algorithm block

Setup

Setup your local mobgap repo (if you have not done that yet): Clone, install poetry, poetry install --all-extras
Clone (https://github.com/mobilise-d/mobgap_validation) to some place local. We will refer to this location as "local_mobgap_validation_path"
Download the official TVS dataset from Zenodo (https://zenodo.org/records/13987963) and unpack all subfolders in some local folder. We will refer to this location as "local_tvs_dataset_path"
Within you local mobgap repo folder (top-level not the python package), create a file called .env with the following content:

MOBGAP_TVS_DATASET_PATH={local_tvs_dataset_path}
MOBGAP_VALIDATION_DATA_PATH={local_mobgap_validation_path}
MOBGAP_VALIDATION_USE_LOCAL_DATA=1

The {...} should be replaced by the respective full path on your local computer.
5. Test that this setup works, by opening revalidation/gait_sequences/_01_gsd_analysis.py. Run it. Everything should run without error and you should NOT see any download progressbars appear, indicating that your local version of the validation files were used.
6. Test that examples/data/_04_tvs_data_no_exc.py can be executed without error.
7. When both pass we can start writing code

The required parts

Note, that we will not cover including the results of the old algorithms in this section. This will be covered later.

The basic idea, we will follow here is to get everything working with 1 algorithm, on 1 datapoint, using 1 performance metric and then expand, once we know everything is working.

Setup new git branches in mobgap and the validation data results. Ideally use the same name in both. In both repos run git switch -c {your branch name}.
Create the "boilderplate" folders and files:
- Copy one of the existing folders in the revalidation folder and rename all references to your block name (within filenames and file content). Clean out the actual code files with the exception of the headlines.
- Within the mobgap subfolder for your block (e.g. mobgap/cadence) create two files evaluation.py and pipeline.py, in case they don't exist already.
Make the documentation aware of the new revalidation folder, by updating docs/conf.py around line 200 with a reference to your new folder

The pipeline

Create a new pipeline using the Emulation Pipeline of the stride length block as reference. Use reference information from the dataset to mock inputs. Have a look at the mobgap pipeline examples, for more information about how to use the gait sequence iterator.
You can use your revalidation/{you_block_file}/_02_{block}_result_generation_no_exc.py for testing. To test the pipeline you need to load the TVS dataset and import the algorithm you want to test.
Here are the relevant parts from the SL evaluation as reference:

from pathlib import Path
from joblib import Memory
from mobgap import PACKAGE_ROOT
from mobgap.data import TVSFreeLivingDataset, TVSLabDataset
from mobgap.stride_length.pipeline import SlEmulationPipeline
from mobgap.stride_length import SlZijlstra
from mobgap.utils.misc import get_env_var

cache_dir = Path(
    get_env_var("MOBGAP_CACHE_DIR_PATH", PACKAGE_ROOT.parent / ".cache")
)

datasets_free_living = TVSFreeLivingDataset(
    get_env_var("MOBGAP_TVS_DATASET_PATH"),
    reference_system="INDIP",
    memory=Memory(cache_dir),
    missing_reference_error_type="skip",
)

pipe = SlEmulationPipeline(SlZijlstra())

pipe.run(datasets_free_living[0])

If the pipeline works as expected, you can move on to the scorer.

Scorer

Scoring consists of a couple of steps. I would suggest starting with the evaluation example that already exists for all blocks (e.g. examples/stride_length/_02_sl_evaluation.py). These examples only show the evaluation "primitives". These are the building blocks to build the official scorer.
Next, you should check the per block validation paper and identify the metrics used there. The scorer should at least output them. You can add further metrics, that you think are useful.

The scoring itself will require two functions: The per-datapoint scorer and the final_agg function (See here for more infos).

The per-datapoint scorer does most of the comparison. It first runs the pipeline and then extracts all relevant information from the pipeline and datapoint. Then you calculate all error metrics that are required. If there are errormetrics, that can not be individually calculated on a per-datapoint level, make sure that you return all the relevant information for them wrapped in no_agg. This allows you to access that info in the final_agg function.

By convention we also pass most of the raw information out of the scorer, in case people want to do further analysis based on that. Note, that we prefix the raw results with raw__ to make it easier to filter them out later from the results (see stride length scorer for examples)

This is a little bit complicated to explain. Best to work through one of the existing scorer and play around with that.

Once you have first iteration of the scoring function, you can create a scorer ({block}_score) like so: Scorer(per_datapoint_score, final_aggregation=final_agg).

Now you can test the scorer using mobgap.utils.evaluation.Evaluate. See examples/stride_length/_02_sl_evaluation.py for example. I would suggest to only test on a couple of participants from the Free-Living dataset to get started (just add [:3] behind the dataset, to just run the first 3 participants).

Storing Results

As mentioned above, the validation results are stored in a second git repository. Now is a good time to check the setup again to make sure you have this secondary git repo clones and the .env file set up correctly. Within the secondary git repo, make sure you have checked out the branch that you created there specific to your block.

Like before, I would suggest to test saving with just a couple of participants.
If you inspect the results you get from the participants (e.g. by running the Evaluation in the debugger), you will see that the results are a relatively complex data structure.
The docstrings in mobgap/gait_sequences/_evaluation_scorer.py and mobgap/stride_length/evaluation.py are a good starting point to better understand what for results you expect.
For more basic information the tpcp guide might also be helpful.

As we expect similar scoring result structure and naming for all blocks in mobgap, we added some helper methods, to simplify working with the validation results.
Specifically, you can call the following methods on the Evaluation instance: get_single_results_as_df, get_aggregated_results_as_df, get_raw_results (see docstrings for more information).

WARNING: When you inspect the aggregated results, it might be that you see a lot of NaNs in the output. This is , because the default aggregater across all recordings is just mean. If one of your scores is NaN for one of the participants, the overall value will be NaN. I am considering switching to nan-mean, but I am not sure, if this is always the best option.

These should give you well formatted results assuming certain structure of the scorer outputs.
To simlify storing the results even further, we also provide mobgap.utils.evaluation.save_evaluation_results. This allows you to store a subset of the results in a proper subdirectory and file structure.

We use this function to store the results in the validation-result git repo:

Example usage from the Stride Length validation:

from mobgap.utils.evaluation import save_evaluation_results

for k, v in results_free_living.items():
    save_evaluation_results(
        k,
        v,
        condition="free_living",
        base_path=results_base_path,
        raw_result_filter=["wb_level_values_with_errors"],
    )

Note, that we use the raw_result_filter to only store the results that we are actually interested in. For example, in most cases the "raw-raw" results might not really be required for the evaluation. I would suggest to start by storing just the raw values you need. We can always add more. Rerunning the pipeline does not take that long.

Inspect the files that were stored on disc to check if everything is as expected.
For now I would suggest to not commit them yet, and also to not create the results for all recordings yet. Stick to your small test sample.

Presentation of the results

Presentation of the results happens in the _01_{block_name}_analysis.py file.
This file will be part of the documentation with the output of each block (blocks are defined using the # %% syntax) rendered.
This allows us to include tables (pandas Dataframes) and plots (generated with seaborn/matplotlib).

We assume that technical people that are interested in the validation result, will have a look at this. Hence, include sufficient information so that people can understand what they are seeing and what this means.

Ideally, it should read like an actual technical report with methods, results, discussion, ... . I would suggest a less formal tone though. Write something you would like to read. Be transparent. There is nothing to hide!

To write this technical report, we need to load the results stored before.
For this we use the mobgap.data.validation_results.ValidationResultLoader.
This loader hides some magic. Specifically, it can either download the correct result files from the git repository, or use the local files you have available. This allows other people to run the file locally, without the hassle of cloning the second git repo.

With the structure shown below, we can dynamically switch between the local and the online version using env vars. Note, if you want to use online results from another branch then main, you need to change version. In the future, we will update the files to point to a fixed release and not just a branch.

local_data_path = (
    [Path](https://docs.python.org/3/library/pathlib.html#pathlib.Path)(get_env_var("MOBGAP_VALIDATION_DATA_PATH")) / "results"
    if int(get_env_var("MOBGAP_VALIDATION_USE_LOCAL_DATA", 0))
    else None
)
loader = ValidationResultLoader(
    "sl", result_path=local_data_path, version="main"
)

As the goal of this script is presentation, I would suggest algorithms to proper names and ensuring that plot and table labels don't contain variable names, but properly formatted names.
(Yes, the existing validation scripts are not the best example of that yet)

To help the presentation of the results, we need to calculate a final level of aggregations (across all participants and per cohort).
For this we can use the existing apply_aggregations and apply_transformation metrics.
Writing them down "seems" a little wordy, but it makes it realivly easy to see what metrics are calculated and to expand the list.
Below, you can see all aggregations for SL:

from functools import partial

from mobgap.pipeline.evaluation import CustomErrorAggregations as A
from mobgap.utils.df_operations import (
    CustomOperation,
    apply_aggregations,
    apply_transformations,
)
from mobgap.utils.tables import FormatTransformer as F

custom_aggs = [
    CustomOperation(
        identifier=None,
        function=A.n_datapoints,
        column_name=[("n_datapoints", "all")],
    ),
    ("wb__detected", ["mean", A.conf_intervals]),
    ("wb__reference", ["mean", A.conf_intervals]),
    ("wb__error", ["mean", A.loa]),
    ("wb__abs_error", ["mean", A.conf_intervals]),
    ("wb__rel_error", ["mean", A.conf_intervals]),
    ("wb__abs_rel_error", ["mean", A.conf_intervals]),
    CustomOperation(
        identifier=None,
        function=partial(
            A.icc,
            reference_col_name="wb__reference",
            detected_col_name="wb__detected",
            icc_type="icc2",
        ),
        column_name=[("icc", "wb_level"), ("icc_ci", "wb_level")],
    ),
]

format_transforms = [
    CustomOperation(
        identifier=None,
        function=lambda df_: df_[("n_datapoints", "all")].astype(int),
        column_name="n_datapoints",
    ),
    *(
        CustomOperation(
            identifier=None,
            function=partial(
                F.value_with_range,
                value_col=("mean", c),
                range_col=("conf_intervals", c),
            ),
            column_name=c,
        )
        for c in [
            "wb__reference",
            "wb__detected",
            "wb__abs_error",
            "wb__rel_error",
            "wb__abs_rel_error",
        ]
    ),
    CustomOperation(
        identifier=None,
        function=partial(
            F.value_with_range,
            value_col=("mean", "wb__error"),
            range_col=("loa", "wb__error"),
        ),
        column_name="wb__error",
    ),
    CustomOperation(
        identifier=None,
        function=partial(
            F.value_with_range,
            value_col=("icc", "wb_level"),
            range_col=("icc_ci", "wb_level"),
        ),
        column_name="icc",
    ),
]


final_names = {
    "n_datapoints": "# participants",
    "wb__detected": "WD mean and CI [m]",
    "wb__reference": "INDIP mean and CI [m]",
    "wb__error": "Bias and LoA [m]",
    "wb__abs_error": "Abs. Error [m]",
    "wb__rel_error": "Rel. Error [%]",
    "wb__abs_rel_error": "Abs. Rel. Error [%]",
    "icc": "ICC",
}


def format_tables(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df.pipe(apply_transformations, format_transforms)
        .rename(columns=final_names)
        .loc[:, list(final_names.values())]
    )

What we get at the end is a fucntion that applies our calculations to any subset of our results. From there on we can use basic pandas to group and apply methods to generate the tables we need. The main goal is to replicate the table represented in this paper

Add plots and further analysis at your discretion.

You can see how the rendered output looks like by running.

poetry run poe docs
poetry run poe docs_preview

And then navigate to the shown URL in the browser.

Final runs and saving results

Once you are happy with everything, change your script to use the entire dataset and run it.
Now you should have all results generated locally.
Then run:

poetry run poe update_validation_results

This will update the index of available result files.
The index is used by the Result-Downloader to check what files are available and check that they are correctly downloaded.

Commit your results and the new index and push. Create a PR in the validation result repo. In case you havn't done that yet, also create PR for your code branch in the main repo.

You will see, that the build documentation CI task fails. To fix this for now, update the version in your ValidationResultLoader creation to the name of your branch in the result repo.

After the typical process of code review, we can merge both PRs (just remember to change the version back).
In case you have to update results in the process, follow the same steps as above:

Rerun the script
run poetry run poe update_validation_results
Commit and push updated results

For the results repo we might consider using a squash merge to reduce the final file size.

TODO: old algo results, Daummy Algos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(For Core - Devs) Step by step guide on revalidation #193

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

(For Core - Devs) Step by step guide on revalidation #193

AKuederle Nov 18, 2024 Maintainer

Some technical background

Step by step guide to create a revalidation for a new algorithm block

Setup

The required parts

The pipeline

Scorer

Storing Results

Presentation of the results

Final runs and saving results

Replies: 0 comments

AKuederle
Nov 18, 2024
Maintainer