K-fold v2 #315

jorshi · 2021-11-04T21:02:16Z

Updates the hear eval pipeline to support K-Fold datasets. Depends on preprocessed data updates that are proposed in hearbenchmark/hear-preprocess#100 -- review those first. The small datasets that are used for tests in this will need to be updated in order for tests to run properly.

This builds off of @khumairraj's PR #310

heareval/predictions/task_predictions.py

turian · 2021-11-04T21:51:10Z

heareval/predictions/task_predictions.py

@@ -700,7 +705,7 @@ def label_vocab_nlabels(embedding_path: Path) -> Tuple[pd.DataFrame, int]:


 def dataloader_from_split_name(


All @khumairraj :)

heareval/predictions/task_predictions.py

turian · 2021-11-04T22:33:59Z

heareval/predictions/task_predictions.py

+    fold as the train split.
+    Folds will be sorted before applying the above strategy
+    Total data splits will be equal to n, n being the total number of folds.
+    Each fold will be tested by training on the remaining folds.


I'd just write an explicit example of what is returned, key and values

or say:

"""
With 5-fold, for example, we would have:
test=fold1, val=fold2, train=fold3..5,
test=fold2, val=fold3, train=fold4..5,1,
...

turian · 2021-11-04T22:41:16Z

heareval/predictions/task_predictions.py

@@ -925,6 +1028,8 @@ def task_predictions(
        seed_everything(42, workers=False)

    metadata = json.load(embedding_path.joinpath("task_metadata.json").open())
+    metadata["mode"] = "folds"  # remove me, only for testing


What is the convention that was defined? Now I'm forgetting

I don't think we have a convention defined yet - need to check this b/c metadata["mode"] might not actually be defined for the existing open tasks

Updated this so if "fold" is in metadata (and is a list of fold string names) then it will perform k-fold. Otherwise will work as normal with pre-defined splits.

turian · 2021-11-04T22:42:52Z

heareval/predictions/task_predictions.py

            "embedding_path": str(embedding_path),
        }
    )
+


The function is getting a bit long, can we break it down?

Can we save the scores of the different folds?

Did my best to pull some stuff out with out going crazy.

I'm still kinda concerned tho, it's really long. Any way we can go further?

khumairraj

Thanks for completing this @jorshi. It looks much cleaner.
Also, there was some code duplication before, mostly around the behaviour of finding the best grid point and then retraining on all(train + Val) the data(without validation steps like early stopping, etc) with the characteristics of the best grid point(including the early stopping epoch and others). But, I see in this pr that we have removed that behavior. I think it was introducing lots of conditions, and it is good that we removed it.

turian · 2021-11-07T14:39:49Z

heareval/predictions/runner.py

-            != json.load(open(task_path.joinpath("test.embedding-dimensions.json")))[1]
-        ):
+        # Ensure all embedding sizes are the same across splits/folds
+        embedding_size = embedding_sizes[0]


Do we even need embedding_size any more then? if we have embedding_sizes?

heareval/predictions/task_predictions.py

turian · 2021-11-07T14:42:23Z

heareval/predictions/task_predictions.py

    metadata: Dict[str, Any],
+    data_splits: Dict[str, List[str]],


What is data_splits? Maybe it's worth having a doc-string here? Or is this called by another function with the same set of parameters?

I am wondering if it even makes sense that we have a class for such a long list of things? Or most of them if it makes sense to divide it in a sensible way?

turian · 2021-11-07T14:43:09Z

heareval/predictions/task_predictions.py

+                json.load(embedding_path.joinpath(f"{split_name}.json").open())
+            )
+        test_target_events = {}
+        for split_name in data_splits["test"]:


Oh man this seems really confusing, particularly if you understand the other pattern from hear-preprocess that splits is a list, not a dict of lists. How do we fix this and make it clearer?

heareval/predictions/task_predictions.py

turian · 2021-11-07T14:44:50Z

heareval/predictions/task_predictions.py

@@ -845,7 +885,7 @@ def task_predictions_train(
        logger=logger,
    )
    train_dataloader = dataloader_from_split_name(
-        "train",
+        data_splits["train"],


Again this stuff is so confusing. Can we put all this weird logic in one place with a clear explanation, so this weird pattern doesn't occur throughout this file? The fewer patterns you have to remember, the better

heareval/predictions/task_predictions.py

khumairraj and others added 10 commits September 27, 2021 19:34

modify dataset to work with list of split_names

a031ad1

make trainer to find grid point

0ada816

make full trainer

e23e891

make data split from folds and have fold mode in predictor

3110f5d

linting + fixes

dbad90e

more linting and fixes

5887b02

more fixes

2cfcb17

fix

839cc24

Removing some conditionals from task predictions function

ec6414f

Updating task_predictions to support kfold

3a3ed19

jorshi commented Nov 4, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

jorshi commented Nov 4, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 4, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

Updating for event tasks

6595643

turian reviewed Nov 4, 2021

View reviewed changes

heareval/predictions/task_predictions.py Show resolved Hide resolved

turian reviewed Nov 4, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 4, 2021

View reviewed changes

jorshi added 2 commits November 4, 2021 15:57

Aggregrating results by mean and std over folds

5eba94b

Cleaning up and pulling out some extra stuff from task_predictions

6cc8af8

jorshi changed the title ~~[WIP] K-fold v2~~ K-fold v2 Nov 5, 2021

khumairraj approved these changes Nov 5, 2021

View reviewed changes

jorshi added 4 commits November 5, 2021 09:49

Current epoch for event based grid search

675773b

Update task_embeddings to use splits in metadata

69acdc5

Update prediction runner to use splits

ca64c20

mypy

ac3bc04

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Show resolved Hide resolved

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

turian reviewed Nov 7, 2021

View reviewed changes

heareval/predictions/task_predictions.py Outdated Show resolved Hide resolved

khumairraj mentioned this pull request Nov 7, 2021

[WIP] Add Training for Data with Folds; Do full training on test and valid before final testing #310

Closed

khumairraj added 9 commits November 7, 2021 18:28

Update for fixes and checks

a323330

Define the term

6652fc2

Linting

54939bf

update the location of the open tasks

14d9cb2

Update CI

875a4b1

Update actions

edf200e

update readme

28c0de2

update rreadme

581f72a

version bump

b7e4491

turian approved these changes Nov 7, 2021

View reviewed changes

turian merged commit 9547ddb into main Nov 7, 2021

turian deleted the add_kfold_3 branch November 7, 2021 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K-fold v2 #315

K-fold v2 #315

jorshi commented Nov 4, 2021 •

edited

Loading

turian Nov 4, 2021

jorshi Nov 4, 2021

turian Nov 4, 2021

turian Nov 4, 2021

turian Nov 4, 2021

jorshi Nov 4, 2021

jorshi Nov 5, 2021

turian Nov 4, 2021

jorshi Nov 5, 2021

turian Nov 7, 2021

khumairraj Nov 7, 2021

khumairraj left a comment

turian Nov 7, 2021

turian Nov 7, 2021

khumairraj Nov 7, 2021

turian Nov 7, 2021

khumairraj Nov 7, 2021

turian Nov 7, 2021

khumairraj Nov 7, 2021 •

edited

Loading

		@@ -700,7 +705,7 @@ def label_vocab_nlabels(embedding_path: Path) -> Tuple[pd.DataFrame, int]:


		def dataloader_from_split_name(

K-fold v2 #315

K-fold v2 #315

Conversation

jorshi commented Nov 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

jorshi commented Nov 4, 2021 •

edited

Loading

khumairraj Nov 7, 2021 •

edited

Loading