You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a place to discuss the process of re-validation for the core devs.
Some technical background
Validation is done using the tpcp validation concept. This means, we have an algorithm wrapped by a pipeline and a scoring function wrapped by a Scorer. Then both parts are combined with a dataset by a function like tpcp.validate.validate (https://tpcp.readthedocs.io/en/latest/auto_examples/index.html#validation)
For the revalidation, we want to have the results of that revalidation included in the docs. For this, we store the results of the validaiton runs (aggregated performance metrics, and certain raw results in a seperate git repo (https://github.com/mobilise-d/mobgap_validation).
We have a utility class (mobgap.data.validation_results.ValidationResultLoader) that can either use a local copy of this repo or download the result files using pooch from Github. When the documentation is build on ReadTheDocs, the latter is done, and the files required to display the results are pulled directly from the other github repo.
Step by step guide to create a revalidation for a new algorithm block
Setup
Setup your local mobgap repo (if you have not done that yet): Clone, install poetry, poetry install --all-extras
Download the official TVS dataset from Zenodo (https://zenodo.org/records/13987963) and unpack all subfolders in some local folder. We will refer to this location as "local_tvs_dataset_path"
Within you local mobgap repo folder (top-level not the python package), create a file called .env with the following content:
The {...} should be replaced by the respective full path on your local computer.
5. Test that this setup works, by opening revalidation/gait_sequences/_01_gsd_analysis.py. Run it. Everything should run without error and you should NOT see any download progressbars appear, indicating that your local version of the validation files were used.
6. Test that examples/data/_04_tvs_data_no_exc.py can be executed without error.
7. When both pass we can start writing code
The required parts
Note, that we will not cover including the results of the old algorithms in this section. This will be covered later.
The basic idea, we will follow here is to get everything working with 1 algorithm, on 1 datapoint, using 1 performance metric and then expand, once we know everything is working.
Setup new git branches in mobgap and the validation data results. Ideally use the same name in both. In both repos run git switch -c {your branch name}.
Create the "boilderplate" folders and files:
Copy one of the existing folders in the revalidation folder and rename all references to your block name (within filenames and file content). Clean out the actual code files with the exception of the headlines.
Within the mobgap subfolder for your block (e.g. mobgap/cadence) create two files evaluation.py and pipeline.py, in case they don't exist already.
Make the documentation aware of the new revalidation folder, by updating docs/conf.py around line 200 with a reference to your new folder
The pipeline
Create a new pipeline using the Emulation Pipeline of the stride length block as reference. Use reference information from the dataset to mock inputs. Have a look at the mobgap pipeline examples, for more information about how to use the gait sequence iterator.
You can use your revalidation/{you_block_file}/_02_{block}_result_generation_no_exc.py for testing. To test the pipeline you need to load the TVS dataset and import the algorithm you want to test.
Here are the relevant parts from the SL evaluation as reference:
If the pipeline works as expected, you can move on to the scorer.
Scorer
Scoring consists of a couple of steps. I would suggest starting with the evaluation example that already exists for all blocks (e.g. examples/stride_length/_02_sl_evaluation.py). These examples only show the evaluation "primitives". These are the building blocks to build the official scorer.
Next, you should check the per block validation paper and identify the metrics used there. The scorer should at least output them. You can add further metrics, that you think are useful.
The scoring itself will require two functions: The per-datapoint scorer and the final_agg function (See here for more infos).
The per-datapoint scorer does most of the comparison. It first runs the pipeline and then extracts all relevant information from the pipeline and datapoint. Then you calculate all error metrics that are required. If there are errormetrics, that can not be individually calculated on a per-datapoint level, make sure that you return all the relevant information for them wrapped in no_agg. This allows you to access that info in the final_agg function.
By convention we also pass most of the raw information out of the scorer, in case people want to do further analysis based on that. Note, that we prefix the raw results with raw__ to make it easier to filter them out later from the results (see stride length scorer for examples)
This is a little bit complicated to explain. Best to work through one of the existing scorer and play around with that.
Once you have first iteration of the scoring function, you can create a scorer ({block}_score) like so: Scorer(per_datapoint_score, final_aggregation=final_agg).
Now you can test the scorer using mobgap.utils.evaluation.Evaluate. See examples/stride_length/_02_sl_evaluation.py for example. I would suggest to only test on a couple of participants from the Free-Living dataset to get started (just add [:3] behind the dataset, to just run the first 3 participants).
Storing Results
As mentioned above, the validation results are stored in a second git repository. Now is a good time to check the setup again to make sure you have this secondary git repo clones and the .env file set up correctly. Within the secondary git repo, make sure you have checked out the branch that you created there specific to your block.
Like before, I would suggest to test saving with just a couple of participants.
If you inspect the results you get from the participants (e.g. by running the Evaluation in the debugger), you will see that the results are a relatively complex data structure.
The docstrings in mobgap/gait_sequences/_evaluation_scorer.py and mobgap/stride_length/evaluation.py are a good starting point to better understand what for results you expect.
For more basic information the tpcp guide might also be helpful.
As we expect similar scoring result structure and naming for all blocks in mobgap, we added some helper methods, to simplify working with the validation results.
Specifically, you can call the following methods on the Evaluation instance: get_single_results_as_df, get_aggregated_results_as_df, get_raw_results (see docstrings for more information).
WARNING: When you inspect the aggregated results, it might be that you see a lot of NaNs in the output. This is , because the default aggregater across all recordings is just mean. If one of your scores is NaN for one of the participants, the overall value will be NaN. I am considering switching to nan-mean, but I am not sure, if this is always the best option.
These should give you well formatted results assuming certain structure of the scorer outputs.
To simlify storing the results even further, we also provide mobgap.utils.evaluation.save_evaluation_results. This allows you to store a subset of the results in a proper subdirectory and file structure.
We use this function to store the results in the validation-result git repo:
Note, that we use the raw_result_filter to only store the results that we are actually interested in. For example, in most cases the "raw-raw" results might not really be required for the evaluation. I would suggest to start by storing just the raw values you need. We can always add more. Rerunning the pipeline does not take that long.
Inspect the files that were stored on disc to check if everything is as expected.
For now I would suggest to not commit them yet, and also to not create the results for all recordings yet. Stick to your small test sample.
Presentation of the results
Presentation of the results happens in the _01_{block_name}_analysis.py file.
This file will be part of the documentation with the output of each block (blocks are defined using the # %% syntax) rendered.
This allows us to include tables (pandas Dataframes) and plots (generated with seaborn/matplotlib).
We assume that technical people that are interested in the validation result, will have a look at this. Hence, include sufficient information so that people can understand what they are seeing and what this means.
Ideally, it should read like an actual technical report with methods, results, discussion, ... . I would suggest a less formal tone though. Write something you would like to read. Be transparent. There is nothing to hide!
To write this technical report, we need to load the results stored before.
For this we use the mobgap.data.validation_results.ValidationResultLoader.
This loader hides some magic. Specifically, it can either download the correct result files from the git repository, or use the local files you have available. This allows other people to run the file locally, without the hassle of cloning the second git repo.
With the structure shown below, we can dynamically switch between the local and the online version using env vars. Note, if you want to use online results from another branch then main, you need to change version. In the future, we will update the files to point to a fixed release and not just a branch.
As the goal of this script is presentation, I would suggest algorithms to proper names and ensuring that plot and table labels don't contain variable names, but properly formatted names.
(Yes, the existing validation scripts are not the best example of that yet)
To help the presentation of the results, we need to calculate a final level of aggregations (across all participants and per cohort).
For this we can use the existing apply_aggregations and apply_transformation metrics.
Writing them down "seems" a little wordy, but it makes it realivly easy to see what metrics are calculated and to expand the list.
Below, you can see all aggregations for SL:
from functools import partial
from mobgap.pipeline.evaluation import CustomErrorAggregations as A
from mobgap.utils.df_operations import (
CustomOperation,
apply_aggregations,
apply_transformations,
)
from mobgap.utils.tables import FormatTransformer as F
custom_aggs = [
CustomOperation(
identifier=None,
function=A.n_datapoints,
column_name=[("n_datapoints", "all")],
),
("wb__detected", ["mean", A.conf_intervals]),
("wb__reference", ["mean", A.conf_intervals]),
("wb__error", ["mean", A.loa]),
("wb__abs_error", ["mean", A.conf_intervals]),
("wb__rel_error", ["mean", A.conf_intervals]),
("wb__abs_rel_error", ["mean", A.conf_intervals]),
CustomOperation(
identifier=None,
function=partial(
A.icc,
reference_col_name="wb__reference",
detected_col_name="wb__detected",
icc_type="icc2",
),
column_name=[("icc", "wb_level"), ("icc_ci", "wb_level")],
),
]
format_transforms = [
CustomOperation(
identifier=None,
function=lambda df_: df_[("n_datapoints", "all")].astype(int),
column_name="n_datapoints",
),
*(
CustomOperation(
identifier=None,
function=partial(
F.value_with_range,
value_col=("mean", c),
range_col=("conf_intervals", c),
),
column_name=c,
)
for c in [
"wb__reference",
"wb__detected",
"wb__abs_error",
"wb__rel_error",
"wb__abs_rel_error",
]
),
CustomOperation(
identifier=None,
function=partial(
F.value_with_range,
value_col=("mean", "wb__error"),
range_col=("loa", "wb__error"),
),
column_name="wb__error",
),
CustomOperation(
identifier=None,
function=partial(
F.value_with_range,
value_col=("icc", "wb_level"),
range_col=("icc_ci", "wb_level"),
),
column_name="icc",
),
]
final_names = {
"n_datapoints": "# participants",
"wb__detected": "WD mean and CI [m]",
"wb__reference": "INDIP mean and CI [m]",
"wb__error": "Bias and LoA [m]",
"wb__abs_error": "Abs. Error [m]",
"wb__rel_error": "Rel. Error [%]",
"wb__abs_rel_error": "Abs. Rel. Error [%]",
"icc": "ICC",
}
def format_tables(df: pd.DataFrame) -> pd.DataFrame:
return (
df.pipe(apply_transformations, format_transforms)
.rename(columns=final_names)
.loc[:, list(final_names.values())]
)
What we get at the end is a fucntion that applies our calculations to any subset of our results. From there on we can use basic pandas to group and apply methods to generate the tables we need. The main goal is to replicate the table represented in this paper
Add plots and further analysis at your discretion.
You can see how the rendered output looks like by running.
poetry run poe docs
poetry run poe docs_preview
And then navigate to the shown URL in the browser.
Final runs and saving results
Once you are happy with everything, change your script to use the entire dataset and run it.
Now you should have all results generated locally.
Then run:
poetry run poe update_validation_results
This will update the index of available result files.
The index is used by the Result-Downloader to check what files are available and check that they are correctly downloaded.
Commit your results and the new index and push. Create a PR in the validation result repo. In case you havn't done that yet, also create PR for your code branch in the main repo.
You will see, that the build documentation CI task fails. To fix this for now, update the version in your ValidationResultLoader creation to the name of your branch in the result repo.
After the typical process of code review, we can merge both PRs (just remember to change the version back).
In case you have to update results in the process, follow the same steps as above:
Rerun the script
run poetry run poe update_validation_results
Commit and push updated results
For the results repo we might consider using a squash merge to reduce the final file size.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This is a place to discuss the process of re-validation for the core devs.
Some technical background
tpcp
validation concept. This means, we have an algorithm wrapped by a pipeline and a scoring function wrapped by a Scorer. Then both parts are combined with a dataset by a function liketpcp.validate.validate
(https://tpcp.readthedocs.io/en/latest/auto_examples/index.html#validation)mobgap.data.validation_results.ValidationResultLoader
) that can either use a local copy of this repo or download the result files using pooch from Github. When the documentation is build on ReadTheDocs, the latter is done, and the files required to display the results are pulled directly from the other github repo.Step by step guide to create a revalidation for a new algorithm block
Setup
poetry install --all-extras
.env
with the following content:The
{...}
should be replaced by the respective full path on your local computer.5. Test that this setup works, by opening
revalidation/gait_sequences/_01_gsd_analysis.py
. Run it. Everything should run without error and you should NOT see any download progressbars appear, indicating that your local version of the validation files were used.6. Test that
examples/data/_04_tvs_data_no_exc.py
can be executed without error.7. When both pass we can start writing code
The required parts
Note, that we will not cover including the results of the old algorithms in this section. This will be covered later.
The basic idea, we will follow here is to get everything working with 1 algorithm, on 1 datapoint, using 1 performance metric and then expand, once we know everything is working.
git switch -c {your branch name}
.revalidation
folder and rename all references to your block name (within filenames and file content). Clean out the actual code files with the exception of the headlines.mobgap/cadence
) create two filesevaluation.py
andpipeline.py
, in case they don't exist already.docs/conf.py
around line 200 with a reference to your new folderThe pipeline
revalidation/{you_block_file}/_02_{block}_result_generation_no_exc.py
for testing. To test the pipeline you need to load the TVS dataset and import the algorithm you want to test.Here are the relevant parts from the SL evaluation as reference:
If the pipeline works as expected, you can move on to the scorer.
Scorer
Scoring consists of a couple of steps. I would suggest starting with the evaluation example that already exists for all blocks (e.g.
examples/stride_length/_02_sl_evaluation.py
). These examples only show the evaluation "primitives". These are the building blocks to build the official scorer.Next, you should check the per block validation paper and identify the metrics used there. The scorer should at least output them. You can add further metrics, that you think are useful.
The scoring itself will require two functions: The per-datapoint scorer and the
final_agg
function (See here for more infos).The per-datapoint scorer does most of the comparison. It first runs the pipeline and then extracts all relevant information from the pipeline and datapoint. Then you calculate all error metrics that are required. If there are errormetrics, that can not be individually calculated on a per-datapoint level, make sure that you return all the relevant information for them wrapped in
no_agg
. This allows you to access that info in thefinal_agg
function.By convention we also pass most of the raw information out of the scorer, in case people want to do further analysis based on that. Note, that we prefix the raw results with
raw__
to make it easier to filter them out later from the results (see stride length scorer for examples)This is a little bit complicated to explain. Best to work through one of the existing scorer and play around with that.
Once you have first iteration of the scoring function, you can create a scorer (
{block}_score
) like so:Scorer(per_datapoint_score, final_aggregation=final_agg)
.Now you can test the scorer using
mobgap.utils.evaluation.Evaluate
. Seeexamples/stride_length/_02_sl_evaluation.py
for example. I would suggest to only test on a couple of participants from the Free-Living dataset to get started (just add[:3]
behind the dataset, to just run the first 3 participants).Storing Results
As mentioned above, the validation results are stored in a second git repository. Now is a good time to check the setup again to make sure you have this secondary git repo clones and the .env file set up correctly. Within the secondary git repo, make sure you have checked out the branch that you created there specific to your block.
Like before, I would suggest to test saving with just a couple of participants.
If you inspect the results you get from the participants (e.g. by running the Evaluation in the debugger), you will see that the results are a relatively complex data structure.
The docstrings in
mobgap/gait_sequences/_evaluation_scorer.py
andmobgap/stride_length/evaluation.py
are a good starting point to better understand what for results you expect.For more basic information the tpcp guide might also be helpful.
As we expect similar scoring result structure and naming for all blocks in mobgap, we added some helper methods, to simplify working with the validation results.
Specifically, you can call the following methods on the
Evaluation
instance:get_single_results_as_df
,get_aggregated_results_as_df
,get_raw_results
(see docstrings for more information).WARNING: When you inspect the aggregated results, it might be that you see a lot of NaNs in the output. This is , because the default aggregater across all recordings is just
mean
. If one of your scores is NaN for one of the participants, the overall value will be NaN. I am considering switching to nan-mean, but I am not sure, if this is always the best option.These should give you well formatted results assuming certain structure of the scorer outputs.
To simlify storing the results even further, we also provide
mobgap.utils.evaluation.save_evaluation_results
. This allows you to store a subset of the results in a proper subdirectory and file structure.We use this function to store the results in the validation-result git repo:
Example usage from the Stride Length validation:
Note, that we use the
raw_result_filter
to only store the results that we are actually interested in. For example, in most cases the "raw-raw" results might not really be required for the evaluation. I would suggest to start by storing just the raw values you need. We can always add more. Rerunning the pipeline does not take that long.Inspect the files that were stored on disc to check if everything is as expected.
For now I would suggest to not commit them yet, and also to not create the results for all recordings yet. Stick to your small test sample.
Presentation of the results
Presentation of the results happens in the
_01_{block_name}_analysis.py
file.This file will be part of the documentation with the output of each block (blocks are defined using the
# %%
syntax) rendered.This allows us to include tables (pandas Dataframes) and plots (generated with seaborn/matplotlib).
We assume that technical people that are interested in the validation result, will have a look at this. Hence, include sufficient information so that people can understand what they are seeing and what this means.
Ideally, it should read like an actual technical report with methods, results, discussion, ... . I would suggest a less formal tone though. Write something you would like to read. Be transparent. There is nothing to hide!
To write this technical report, we need to load the results stored before.
For this we use the
mobgap.data.validation_results.ValidationResultLoader
.This loader hides some magic. Specifically, it can either download the correct result files from the git repository, or use the local files you have available. This allows other people to run the file locally, without the hassle of cloning the second git repo.
With the structure shown below, we can dynamically switch between the local and the online version using env vars. Note, if you want to use online results from another branch then main, you need to change
version
. In the future, we will update the files to point to a fixed release and not just a branch.As the goal of this script is presentation, I would suggest algorithms to proper names and ensuring that plot and table labels don't contain variable names, but properly formatted names.
(Yes, the existing validation scripts are not the best example of that yet)
To help the presentation of the results, we need to calculate a final level of aggregations (across all participants and per cohort).
For this we can use the existing
apply_aggregations
andapply_transformation
metrics.Writing them down "seems" a little wordy, but it makes it realivly easy to see what metrics are calculated and to expand the list.
Below, you can see all aggregations for SL:
What we get at the end is a fucntion that applies our calculations to any subset of our results. From there on we can use basic pandas to group and apply methods to generate the tables we need. The main goal is to replicate the table represented in this paper
Add plots and further analysis at your discretion.
You can see how the rendered output looks like by running.
And then navigate to the shown URL in the browser.
Final runs and saving results
Once you are happy with everything, change your script to use the entire dataset and run it.
Now you should have all results generated locally.
Then run:
This will update the index of available result files.
The index is used by the Result-Downloader to check what files are available and check that they are correctly downloaded.
Commit your results and the new index and push. Create a PR in the validation result repo. In case you havn't done that yet, also create PR for your code branch in the main repo.
You will see, that the
build documentation
CI task fails. To fix this for now, update theversion
in yourValidationResultLoader
creation to the name of your branch in the result repo.After the typical process of code review, we can merge both PRs (just remember to change the version back).
In case you have to update results in the process, follow the same steps as above:
poetry run poe update_validation_results
For the results repo we might consider using a
squash merge
to reduce the final file size.TODO: old algo results, Daummy Algos
Beta Was this translation helpful? Give feedback.
All reactions