Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate make_classification_df date functionality with random dates #851

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

BhuvanashreeM
Copy link

@BhuvanashreeM BhuvanashreeM commented Aug 29, 2021

This fixes the date functionality of make_classification_df as mentioned in #845

Overview of the problem :

  • running dask_ml.datasets.make_classification_df with a date range provided, fills the date column with just one unique date
  • in dask.dataframe.from_array, specifying the chunksize to be equal to chunks (passed in through make_classification_df), populates the date column with NaN values.

Findings :

  • The helper function random_date works perfectly fine, generating a random date given the start and end

  • this line populates the list with the same date value, rather than calling the random_date function len(X_df) time, which is the required fix.

  • Run Existing Tests

  • Code Formatting (black, flake8, isort)

  • Custom Tests: Added

Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.

@BhuvanashreeM BhuvanashreeM changed the title Populate make_classification_df date functionality with random dates #845 Populate make_classification_df date functionality with random dates Aug 29, 2021
@@ -451,8 +451,8 @@ def make_classification_df(
[
X_df,
dd.from_array(
np.array([random_date(*dates)] * len(X_df)),
chunksize=chunks,
np.array([random_date(*dates) for i in range(len(X_df))]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's some test errors because random_dates doesn't have a random seed. Maybe this signature would help?

def random_dates(start, end, random_state=None):
    rng = check_random_state(random_state)
    ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df

@BhuvanashreeM
Copy link
Author

Here's what I've done in the latest commit :

  • included a seed(random_state) as an additional argument to the function random_date
  • added a simple test in (test_datasets.py) for checking randomness given the seed, I could add more complex tests for checking true randomness in the date column. Will that be requisite? @stsievert

Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same?

chunks=100,
dates=(date(2014, 1, 1), date(2015, 1, 1)),
)
check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, this checks the random state.

Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.

I think X["date"].nunique() >= threshold would be a lot simpler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.

chunksize=chunks,
np.array(
[
random_date(*dates, random_state + i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when random_state isn't an integer? Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will have to raise a ValueError exception there, on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use this code?

rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:

[ins] In [193]: def random_dates(random_state):
           ...:     return random_state.randint(100)
           ...:

[ins] In [194]: rng = np.random.RandomState(42)

[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.

If random_date were a private function, I wouldn't really care

Copy link
Author

@BhuvanashreeM BhuvanashreeM Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html

It takes those values and produces the correct random seed generator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state

@@ -381,10 +381,11 @@ def make_classification(
return X, y


def random_date(start, end):
def random_date(start, end, random_state=None):
rng_random_date = dask_ml.utils.check_random_state(random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
rng_random_date = dask_ml.utils.check_random_state(random_state)
rng_random_date = sklearn.utils.check_random_state(random_state)

That way the .compute() can be avoided (especially relevant on repeated calls.).

or not isinstance(random_state, np.random.RandomState)
or not isinstance(random_state, int)
):
random_state = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the block necessary? None is already the default value for random_state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants