-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populate make_classification_df date functionality with random dates #851
base: main
Are you sure you want to change the base?
Populate make_classification_df date functionality with random dates #851
Conversation
dask_ml/datasets.py
Outdated
@@ -451,8 +451,8 @@ def make_classification_df( | |||
[ | |||
X_df, | |||
dd.from_array( | |||
np.array([random_date(*dates)] * len(X_df)), | |||
chunksize=chunks, | |||
np.array([random_date(*dates) for i in range(len(X_df))]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's some test errors because random_dates
doesn't have a random seed. Maybe this signature would help?
def random_dates(start, end, random_state=None):
rng = check_random_state(random_state)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df
) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df
Here's what I've done in the latest commit :
Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same? |
chunks=100, | ||
dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
) | ||
check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, this checks the random state.
Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint
function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date
is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold
would be redundant. Open to any thoughts you might have here @stsievert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same random_seed=123
is passed to both calls to make_classification_df
; I would expect every value in X_df["date"]
and X_df1["date"]
to be the same.
I think X["date"].nunique() >= threshold
would be a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2
be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, threshold=n_samples/2
is more than good. I think threshold=2
would suffice; that'd make sure #845 is resolved.
chunksize=chunks, | ||
np.array( | ||
[ | ||
random_date(*dates, random_state + i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when random_state
isn't an integer? Scikit-learn allows for random_state
to be an integer, None
or an instance of np.random.RandomState
(source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will have to raise a ValueError exception there, on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use this code?
rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above will produce the same random number since the seed(rng
) remains the same in subsequent calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point: I think np.random.RandomState
and None
should be acceptable types for random_state
. I'm fine expanding the for-loop, though I don't think that needs to happen:
[ins] In [193]: def random_dates(random_state):
...: return random_state.randint(100)
...:
[ins] In [194]: rng = np.random.RandomState(42)
[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That runs counter to the use of random_state
in Scikit-learn. random_date
is public, so it should accept all types of random_state
that Scikit-learn accepts.
If random_date
were a private function, I wouldn't really care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state
. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn's check_random_state
function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html
It takes those values and produces the correct random seed generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stsievert the accepted types of random_state
in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state
dask_ml/datasets.py
Outdated
@@ -381,10 +381,11 @@ def make_classification( | |||
return X, y | |||
|
|||
|
|||
def random_date(start, end): | |||
def random_date(start, end, random_state=None): | |||
rng_random_date = dask_ml.utils.check_random_state(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
rng_random_date = dask_ml.utils.check_random_state(random_state) | |
rng_random_date = sklearn.utils.check_random_state(random_state) |
That way the .compute()
can be avoided (especially relevant on repeated calls.).
dask_ml/datasets.py
Outdated
or not isinstance(random_state, np.random.RandomState) | ||
or not isinstance(random_state, int) | ||
): | ||
random_state = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the block necessary? None
is already the default value for random_state
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"
This fixes the date functionality of make_classification_df as mentioned in #845
Overview of the problem :
dask_ml.datasets.make_classification_df
with a date range provided, fills the date column with just one unique datechunksize
to be equal tochunks
(passed in through make_classification_df), populates the date column with NaN values.Findings :
The helper function
random_date
works perfectly fine, generating a random date given the start and endthis line populates the list with the same date value, rather than calling the
random_date
functionlen(X_df)
time, which is the required fix.Run Existing Tests
Code Formatting (black, flake8, isort)
Custom Tests: Added
Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.