Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in train_test_split with shuffle=False #992

Open
divir94 opened this issue May 11, 2024 · 0 comments
Open

Unexpected behavior in train_test_split with shuffle=False #992

divir94 opened this issue May 11, 2024 · 0 comments

Comments

@divir94
Copy link

divir94 commented May 11, 2024

When using train_test_split with shuffle=False and a Dask dataframe, I notice 2 issues - 1) The index is actually shuffled and 2) the train/test size seems incorrect. The behavior doesn't match sklearn or when you pass a raw DataFrame.

Minimal Complete Verifiable Example:
Setup

import pandas as pd
import numpy as np
import dask.dataframe as dd

from sklearn.model_selection import train_test_split as sk_train_test_split
from dask_ml.model_selection import train_test_split as dd_train_test_split

df = pd.DataFrame(np.random.rand(10, 3), columns=["y", "x1", "x2"])
ddf = dd.from_pandas(df, 5)

With sklearn.model_selection, order is maintained (i.e. no shuffle)

y = df["y"]
X = df[["x1", "x2"]]

X_train, X_valid, y_train, y_test = sk_train_test_split(X, y, test_size=0.5, shuffle=False)
y_train, y_test
Output:
(0    0.166713
 1    0.961016
 2    0.483907
 3    0.979503
 4    0.553724
 Name: y, dtype: float64,
 5    0.158432
 6    0.078795
 7    0.440427
 8    0.673160
 9    0.657797
 Name: y, dtype: float64)

With dask_ml.model_selection using Pandas Dataframe, order is maintained (i.e. no shuffle)

y = df["y"]
X = df[["x1", "x2"]]

X_train, X_valid, y_train, y_test = dd_train_test_split(X, y, test_size=0.5, shuffle=False)
y_train, y_test
(0    0.166713
 1    0.961016
 2    0.483907
 3    0.979503
 4    0.553724
 Name: y, dtype: float64,
 5    0.158432
 6    0.078795
 7    0.440427
 8    0.673160
 9    0.657797
 Name: y, dtype: float64)

With dask_ml.model_selection using Dask Dataframe, , order is NOT maintained and train/test size is incorrect.

y = ddf["y"]
X = ddf[["x1", "x2"]]

X_train, X_valid, y_train, y_test = dd_train_test_split(X, y, test_size=0.5, shuffle=False)
y_train.compute(), y_test.compute()
(0    0.166713
 1    0.961016
 2    0.483907
 3    0.979503
 8    0.673160
 9    0.657797
 Name: y, dtype: float64,
 4    0.553724
 5    0.158432
 6    0.078795
 7    0.440427
 Name: y, dtype: float64)

Environment:

  • Dask version: 2023.11.0
  • Python version: 3.11.8
  • Operating System: MacOS
  • Install method (conda, pip, source): micromamba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant