Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make automatic centering in PCA methods optional #808

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hristog
Copy link
Contributor

@hristog hristog commented Mar 20, 2021

  • Closes: Flag to skip centering in PCA #734.
  • Tests/black/pyflakes passing: Yes.
  • Tests added to the following files:**
    • tests/test_pca.py
    • tests/test_incremental_pca.py
  • tl;dr of design choices made:
    • IncrementalPCA doesn't get affected by the new parameter and IncrementalPCA(..., center=False ,...) is unsupported.
    • whiten=True does not imply, nor enforce center=True. Instead it warns about possible unexpected behavior, if the underlying data hasn't been centered already.

@hristog
Copy link
Contributor Author

hristog commented Mar 21, 2021

Once the maintainers confirm approval of the proposed changes, I'll be happy to update the .rst docs accordingly, to document the updated API, in a separate commit (as part of this PR).

dask_ml/decomposition/incremental_pca.py Outdated Show resolved Hide resolved
warnings.warn(
"Whitening requires centering. Please, ensure that your data "
"is already centered, in order to avoid unexpected behavior.",
RuntimeWarning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should have some way to silence this warning. I'm inclined to just document this requirement, and trust that the user has done it (no warning).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll remove the warning (the new param has been already added to the doc-string). Please, let me know if any other documentation-related changes are necessary. I've checked api.rst and compose.rst but it doesn't seem that we need any updates to either.

dask_ml/decomposition/pca.py Show resolved Hide resolved
hristog added a commit to hristog/dask-ml that referenced this pull request Mar 24, 2021
@hristog
Copy link
Contributor Author

hristog commented Mar 24, 2021

From the latest CI runs (e.g., this job).

black, version 19.10b0
would reformat /home/vsts/work/1/s/tests/test_pca.py
Oh no! 💥 💔 💥
1 file would be reformatted, 99 files would be left unchanged.

Oops - pretty sure that I did this check locally. Will re-run black to confirm.

Update: This is strange. I'll check if my local branch has actually propagated, and is the same as what the Azure pipeline has picked up.

Result of ./ci/code_checks.sh run locally from the root dir of `dask-ml`:
Checking flake8...
Checking flake8... DONE
Checking black...
black, version 20.8b1
would reformat /git/dask-ml/dask_ml/model_selection/utils_test.py
would reformat /git/dask-ml/tests/ensemble/test_blockwise.py
would reformat /git/dask-ml/tests/model_selection/test_keras.py
would reformat /git/dask-ml/tests/model_selection/test_hyperband.py
would reformat /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection.py
would reformat /git/dask-ml/tests/model_selection/test_incremental.py
Oh no! 💥 💔 💥
6 files would be reformatted, 94 files would be left unchanged.
Checking black... DONE
Checking isort...
5.6.4
ERROR: /git/dask-ml/tests/test_spectral_clustering.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/test_incremental_pca.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/linear_model/test_glm.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection_sklearn.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/docs/dimensions.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/dask_ml/_compat.py Imports are incorrectly sorted and/or formatted.
...

Update (2): Seems that it's the same code version, both locally and remotely.

Locally:

$ git show
commit 01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9 (HEAD -> 734-pca-skip-centering, origin/734-pca-skip-centering)
...
    Add param validation to PCA classes (#808)

Remotely: The most recent commit (and only one since Tom's review): 01dd9d4.

I'll try reproducing black's results via the same version as per the CI pipeline (black, version 19.10b0).

Update (3): Victory! It was indeed due to the different local version of black==black-20.8b1 vs remote (black==19.10b0, as per .pre-commit-config.yaml and ci/*.yaml).

@hristog hristog force-pushed the 734-pca-skip-centering branch 2 times, most recently from 266ae3f to 4c82add Compare March 31, 2021 09:23
@hristog
Copy link
Contributor Author

hristog commented Apr 8, 2021

Hi @TomAugspurger - I'm writing to check if you've had a chance to look into the further changes that I pushed since your review, and whether you'd like any additional refinements to be done on top of that.
Thanks!

@TomAugspurger
Copy link
Member

It'll be a week or so before I can take a look. Thanks for your patience!

In the meantime, perhaps @jameslamb has a chance to glance through this (no worries if you don't though)?

@hristog
Copy link
Contributor Author

hristog commented Apr 12, 2021

It'll be a week or so before I can take a look. Thanks for your patience!

It's all good - just wanted to check if this is still on your radar. I'm happy to wait as long as necessary, and incorporate potential further feedback.
Thanks for your confirmation, @TomAugspurger!

@hristog hristog requested a review from jameslamb April 12, 2021 21:15
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.

Please consider some of the review comments I left.

I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called center doesn't tell you much about the functional benefit of the changes in this pull request.

@@ -127,6 +127,7 @@ def __init__(
self,
n_components=None,
whiten=False,
center=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing new arguments in the middle of a signature is a breaking change for anyone using these classes with positional arguments. Can you please put this new argument at the end of the signature instead of in the middle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hristog you marked this as resolved, but I don't think it is yet. Could you put copy back between whiten and batch_size and make center the last keyword argument?

Comment on lines 63 to 77
center : bool, optional (default True)
When False (True by default), the underlying data gets centered at zero
by subtracting the mean of the data from the data itself.

PCA is performed on centered data due to its being a regression model,
without an intercept. As such, its pricipal components originate at the
origin of the transformed space.

`center` set to False may be employed when performing PCA on already
centered data.

Since centering is a required step as part of whitening, `center` set
to False and `whiten` set to True is a combination which may result in
unexpected behavior, if performed on not previously centered data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
center : bool, optional (default True)
When False (True by default), the underlying data gets centered at zero
by subtracting the mean of the data from the data itself.
PCA is performed on centered data due to its being a regression model,
without an intercept. As such, its pricipal components originate at the
origin of the transformed space.
`center` set to False may be employed when performing PCA on already
centered data.
Since centering is a required step as part of whitening, `center` set
to False and `whiten` set to True is a combination which may result in
unexpected behavior, if performed on not previously centered data.
center : bool, optional (default True)
When True (the default), the underlying data gets centered at zero
by subtracting the mean of the data from the data itself.
PCA is performed on centered data due to its being a regression model,
without an intercept. As such, its principal components originate at the
origin of the transformed space.
``center=False`` may be employed when performing PCA on already
centered data.
Since centering is a required step as part of whitening, ``center`` set
to False and ``whiten`` set to True is a combination which may result in
unexpected behavior, if performed on not previously centered data.

I believe there are a few mistakes in this documentation as it's currently written.

  • two backticks should be used for inline code formatting, since these docstrings are interpreted as reStructuredText by Sphinx
  • the docstring should say that the default behavior (center=True) is to automatically center the data (it currently says the opposite)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring should say that the default behavior (center=True) is to automatically center the data (it currently says the opposite)

Well spotted, thanks!
Updating as per the double backtick suggestion. I'll update the rest of the affected docstrings, to conform to the same guideline (unless we'd like to do that in a separate, dedicated PR?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was a maintainer here, I'd prefer that a docstring-only change be made in a separate pull request, and that this comment only apply to the lines you are already adding / changing as of this pull request

@@ -149,21 +164,28 @@ class PCA(sklearn.decomposition.PCA):
>>> dX = da.from_array(X, chunks=X.shape)
>>> pca = PCA(n_components=2)
>>> pca.fit(dX)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
PCA(n_components=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why were these changes necessary? They seem unrelated to the current pull request. Could you please revert them or explain why they're necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends whether those are considered to be demonstrative examples (in doctest style) or actual doctests which are meant to pass.

If the case is the latter, then the proposed changes should probably still hold.

For example, in their original state, these doctests fail when run with:

py.test --verbose --doctest-modules dask_ml/

If, on the other hand, they're just meant to serve for demonstration only (section 15 of the numpydoc standard), then I'll be definitely happy to revert back these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is adding one new parameter with a default value set to the current behavior. I wouldn't expect any tests to fail, and if they are I'd like to know more about what failed (a link to logs, if you have it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the doctests get run as part of the CI steps, and I imagine the doctests haven't been updated for quite some time (haven't checked with git blame) or the intention has never been that they get run as actual first-class-citizen test (but, instead, serve for demonstration).

E.g., when running against the latest main:

$ py.test --verbose --doctest-modules dask_ml/
...

dask-ml/dask_ml/wrappers.py:92: DocTestFailure
___________________ [doctest] dask_ml.decomposition.pca.PCA ____________________
142 
143     Examples
144     --------
145     >>> import numpy as np
146     >>> import dask.array as da
147     >>> from dask_ml.decomposition import PCA
148     >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
149     >>> dX = da.from_array(X, chunks=X.shape)
150     >>> pca = PCA(n_components=2)
151     >>> pca.fit(dX)
Expected:
    PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
      svd_solver='auto', tol=0.0, whiten=False)
Got:
    PCA(n_components=2)
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, ok I see. I'm not too familiar with that type of test, apologies. But it does seem unrelated to the change in this pull request, so I think it would be better to revert those changes here and open a separate issue / PR about the general topic of "doctest tests are failing / not run"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if we run doctests here or not. I wouldn't worry about that failing one. Scikit-Learn changed the repr of their objects to only show the parameters that are not the default, so the docstring changed.

dask_ml/decomposition/pca.py Outdated Show resolved Hide resolved
dask_ml/decomposition/pca.py Outdated Show resolved Hide resolved
self.copy = copy
self.batch_size = batch_size
self.svd_solver = svd_solver
self.iterated_power = iterated_power
self.random_state = random_state

def _check_params(self):
super()._check_params()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of defining an empty _check_params() on the parent class and then calling it here? In my opinion, that is adding a bit of additional indirection for no functional benefit. Could you please remove this line and the parent class's empty _check_params()?

Copy link
Contributor Author

@hristog hristog Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to future-proof this by ensuring any checks that get eventually introduced in the superclass, get considered at this level as well.
Happy to remove the perceived as redundant check, at the current stage. Just my opinion is that it introduces a bit more exposure to bugs, if the class hierarchy doesn't get utilized in cases like this. But it may as well be done at a further stage.

@hristog
Copy link
Contributor Author

hristog commented Apr 12, 2021

I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.

Please consider some of the review comments I left.

I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called center doesn't tell you much about the functional benefit of the changes in this pull request.

Thanks for your review, @jameslamb! I'll try to address your comments and make changes, wherever applicable, ASAP.

@hristog hristog changed the title Introduce a center param to PCA (#734) Make automatic centering in PCA methods optional Apr 12, 2021
hristog added a commit to hristog/dask-ml that referenced this pull request Apr 15, 2021
Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that https://github.com/dask/dask-ml/pull/808/files#r615299186 is the only outstanding issue. Good to go once that's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flag to skip centering in PCA
3 participants