Populate make_classification_df date functionality with random dates #851

BhuvanashreeM · 2021-08-29T13:43:48Z

This fixes the date functionality of make_classification_df as mentioned in #845

Overview of the problem :

running dask_ml.datasets.make_classification_df with a date range provided, fills the date column with just one unique date
in dask.dataframe.from_array, specifying the chunksize to be equal to chunks (passed in through make_classification_df), populates the date column with NaN values.

Findings :

The helper function random_date works perfectly fine, generating a random date given the start and end
this line populates the list with the same date value, rather than calling the random_date function len(X_df) time, which is the required fix.
Run Existing Tests
Code Formatting (black, flake8, isort)
Custom Tests: Added

Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.

stsievert · 2021-08-29T23:10:57Z

dask_ml/datasets.py

@@ -451,8 +451,8 @@ def make_classification_df(
            [
                X_df,
                dd.from_array(
-                    np.array([random_date(*dates)] * len(X_df)),
-                    chunksize=chunks,
+                    np.array([random_date(*dates) for i in range(len(X_df))]),


Looks like there's some test errors because random_dates doesn't have a random seed. Maybe this signature would help?

def random_dates(start, end, random_state=None): rng = check_random_state(random_state) ...

Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df

BhuvanashreeM · 2021-08-31T11:25:20Z

Here's what I've done in the latest commit :

included a seed(random_state) as an additional argument to the function random_date
added a simple test in (test_datasets.py) for checking randomness given the seed, I could add more complex tests for checking true randomness in the date column. Will that be requisite? @stsievert

Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same?

stsievert · 2021-08-31T12:59:42Z

tests/test_datasets.py

+        chunks=100,
+        dates=(date(2014, 1, 1), date(2015, 1, 1)),
+    )
+    check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute())


Good, this checks the random state.

Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.

Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert

The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.

I think X["date"].nunique() >= threshold would be a lot simpler.

Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?

Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.

stsievert · 2021-08-31T13:02:51Z

dask_ml/datasets.py

-                    chunksize=chunks,
+                    np.array(
+                        [
+                            random_date(*dates, random_state + i)


What happens when random_state isn't an integer? Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).

Okay, I will have to raise a ValueError exception there, on it.

Why not use this code?

rng = check_random_state(random_state) dates = [random_date(*dates, rng) for i in range(len(X_df))] ...

The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.

My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:

[ins] In [193]: def random_dates(random_state): ...: return random_state.randint(100) ...: [ins] In [194]: rng = np.random.RandomState(42) [ins] In [196]: [random_dates(rng) for _ in range(20)] Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]

I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")

That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.

If random_date were a private function, I wouldn't really care

random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.

Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html

It takes those values and produces the correct random seed generator.

@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state

stsievert · 2021-09-03T14:29:04Z

dask_ml/datasets.py

@@ -381,10 +381,11 @@ def make_classification(
    return X, y


-def random_date(start, end):
+def random_date(start, end, random_state=None):
+    rng_random_date = dask_ml.utils.check_random_state(random_state)


Nit:

Suggested change

rng_random_date = dask_ml.utils.check_random_state(random_state)

rng_random_date = sklearn.utils.check_random_state(random_state)

That way the .compute() can be avoided (especially relevant on repeated calls.).

stsievert · 2021-09-03T14:29:13Z

dask_ml/datasets.py

+        or not isinstance(random_state, np.random.RandomState)
+        or not isinstance(random_state, int)
+    ):
+        random_state = None


Why is the block necessary? None is already the default value for random_state.

This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"

fix populates make_classification_df with random dates

4b55f9a

BhuvanashreeM changed the title ~~Populate make_classification_df date functionality with random dates #845~~ Populate make_classification_df date functionality with random dates Aug 29, 2021

stsievert reviewed Aug 29, 2021

View reviewed changes

added-seed-to-random_date-and-modified-test_datasets

424fca4

stsievert reviewed Aug 31, 2021

View reviewed changes

BhuvanashreeM added 2 commits September 3, 2021 08:49

check-for-unique-values

b740b27

checks-for-random_state-type

24461fb

stsievert reviewed Sep 3, 2021

View reviewed changes

removed-redundant-compute-calls-in-random_date

3a7c9d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate make_classification_df date functionality with random dates #851

Populate make_classification_df date functionality with random dates #851

BhuvanashreeM commented Aug 29, 2021 •

edited

Loading

stsievert Aug 29, 2021

BhuvanashreeM Aug 31, 2021

BhuvanashreeM commented Aug 31, 2021

stsievert Aug 31, 2021

BhuvanashreeM Aug 31, 2021

stsievert Aug 31, 2021

BhuvanashreeM Aug 31, 2021

stsievert Sep 1, 2021

stsievert Aug 31, 2021

BhuvanashreeM Aug 31, 2021

stsievert Aug 31, 2021

BhuvanashreeM Aug 31, 2021

stsievert Aug 31, 2021

BhuvanashreeM Sep 1, 2021

stsievert Sep 1, 2021

BhuvanashreeM Sep 3, 2021 •

edited

Loading

stsievert Sep 3, 2021

BhuvanashreeM Sep 3, 2021

stsievert Sep 3, 2021

stsievert Sep 3, 2021

BhuvanashreeM Sep 4, 2021

	rng_random_date = dask_ml.utils.check_random_state(random_state)
	rng_random_date = sklearn.utils.check_random_state(random_state)

Populate make_classification_df date functionality with random dates #851

Are you sure you want to change the base?

Populate make_classification_df date functionality with random dates #851

Conversation

BhuvanashreeM commented Aug 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BhuvanashreeM commented Aug 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BhuvanashreeM Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BhuvanashreeM commented Aug 29, 2021 •

edited

Loading

BhuvanashreeM Sep 3, 2021 •

edited

Loading