Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/rank features group #546

Merged

Conversation

VladimirShitov
Copy link
Collaborator

@VladimirShitov VladimirShitov commented Jun 14, 2023

PR Checklist

Description of changes
Previously, for non-numerical features the standard statistical test was run by ep.tl.rank_features_groups (e.g. Wilcoxon rank sum test). This PR adds functionality to run statistical tests specifically developed for categorical features (e.g. Chi-square test).

Technical details

  • The same approach that in scanpy.tl.rank_genes_groups is used. E.g., when the reference is set to "rest", for each subgroup of groupby, the composition of a categorical variable is compared to the composition in all other groups mixed together. This is not a common approach, I would say, but it is consistent with scanpy, which is used for numerical features.
  • The default test is G-test, which is similar to the Chi-square test but should work better for groups with a small expected number of observations.
  • P-values should be treated carefully. I would only use them for ranking marker features, and re-run statistical analysis in a conventional way to test your hypotheses.

@VladimirShitov
Copy link
Collaborator Author

Note: function parameters are not validated as extensively as in scanpy. I might add this in coming commits

@VladimirShitov VladimirShitov linked an issue Jun 14, 2023 that may be closed by this pull request
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're a legend @VladimirShitov !

  1. Left lots of minor comments
  2. A file named _utils is usually a slight code smell because ideally ever piece of code should have a clear purpose. I would actually move ALL of the feature_ranks code including this into a _feature_ranks_groups.py. What do you think? IMO we have lots of customization now and this would make sense.

Thank you so much.

ehrapy/tools/_datatypes.py Outdated Show resolved Hide resolved
ehrapy/tools/_scanpy_tl_api.py Outdated Show resolved Hide resolved
ehrapy/tools/_scanpy_tl_api.py Outdated Show resolved Hide resolved
ehrapy/tools/_scanpy_tl_api.py Outdated Show resolved Hide resolved
ehrapy/tools/_scanpy_tl_api.py Outdated Show resolved Hide resolved
tests/tools/test_features_ranking.py Outdated Show resolved Hide resolved
tests/tools/test_features_ranking.py Outdated Show resolved Hide resolved
tests/tools/test_features_ranking.py Outdated Show resolved Hide resolved
tests/tools/test_features_ranking.py Outdated Show resolved Hide resolved
tests/tools/test_features_ranking.py Show resolved Hide resolved
@VladimirShitov
Copy link
Collaborator Author

Thank you for your comments, Lukas! Please, check the discussion on renaming datatypes.py above. Everything else is fixed

Signed-off-by: zethson <[email protected]>
@Zethson Zethson merged commit 1574e6c into theislab:development Jun 20, 2023
@VladimirShitov VladimirShitov deleted the feature/rank-features-group branch June 20, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more statistical tests for comparing features
2 participants