Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds on-the-fly enabling of function argument casting #14

Conversation

AartGoossens
Copy link
Contributor

@AartGoossens AartGoossens commented Mar 14, 2018

Unittests should make clear how it can be used.
I don't necessarily like the implementation (not very transparent to the end user what's happening and it's not very Pythonic) but I realize it serves a purpose for some use-cases. From that perspective I like this approach way more than if isinstance(..) then logic in every algorithm.

@AartGoossens AartGoossens changed the title Adds on-the-fly enabling of function argument casting POC DO NOT MERGE - Adds on-the-fly enabling of function argument casting Mar 14, 2018
@AartGoossens AartGoossens requested a review from sladkovm March 14, 2018 15:54
@AartGoossens AartGoossens force-pushed the feature/on_the_fly_input_argument_casting branch from 3a456da to 24565c8 Compare March 14, 2018 19:15
@AartGoossens
Copy link
Contributor Author

The namedtuple in sweat.algorithms.metrics.core doesn't like to play with my approach. I think it's fixable so ignore it for the sake of the discussion. ;)

@sladkovm
Copy link
Contributor

Mind twisting, but cool implementation. We can try to keep it and see if it sticks. The main added value, of course, is a support of list-2-list functionality, which is only relevant, if you write quick API gateways.

What I'm still puzzled about is how NumPy behaves:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: type(np.log(np.arange(1,5)))
Out[3]: numpy.ndarray

In [4]: type(np.log(pd.Series(np.arange(1,5))))
Out[4]: pandas.core.series.Series

It returns the same type as an input!

Pandas, on the other hand, does not do it. In fact, to make use of most of the functions that pandas provide you first need to convert your array-like into Stream/DataFrame and I don't know what is the best strategy to tackle this. My original approach was to add typecasting at the end of each function.

@AartGoossens
Copy link
Contributor Author

I tried to figure out how NumPy does this but it looks like they do that in some c library... I don't want to go that way (if only because I'm not comfortable with c...).
If I can make sure it cannot cause any race conditions during testing I think I'm fine with merging the approach in this PR.

About the pandas.Series methods that we're currently using (like pandas.Series.rolling): I think most of these methods on the Series class use NumPy methods in the background so I think it should be fairly straightforward to move away from depending on these and use only NumPy methods for this. The result of that will be that by design we will accept only numpy.ndarrays as input( but allow type casting via the code in this PR). Oh and that will actually fix type casting for e.g. multiple_best_intervals because now it will cast to ndarray which will break that method.

@sladkovm
Copy link
Contributor

There are some methods in pandas that are not available in NumPy. One notable example is the emwa rolling mean that was deprecated in NumPy. I remember I've read once on the stack that it was a decision of NumPy to focus on a low level handling of manipulation arrays in memory while leaving all statistical computations for other libraries.

Another example, that would be a shame not to implement in pandas is anything that has to do with grouping values. Here is how time in zones is calculated in core.py:

z = pd.Series(compute_zones(power, **kwargs))
tiz = z.groupby(z).count()

@AartGoossens
Copy link
Contributor Author

That's interesting. It makes sense for Numpy to stay more low-level indeed.
So basically for some algorithms we have to also rely on these methods on pandas.Series/DataFrame. For these algorithms we cannot avoid converting the input argument(s) to pd.Series, since only accepting pd.Series as input arguments for these algorithms would make the interface really confusing.

Is this then the solution?

  • All algorithms are designed to work with numpy.ndarray
  • ...although probably most of them will accept pd.Series or list as input arguments (a side effect of how numpy ufuncs work) but output will still be numpy.ndarray
  • There is a utility method enable_type_casting() available that enables on-the-fly-type-casting of input and output arguments to/from pd.Series and list.
  • Algorithms that require pd.Series specific methods need to convert numpy.ndarray to pd.Series themselves. Output should still be numpy.ndarray.

@sladkovm
Copy link
Contributor

This indeed sounds like a workable solution to me. And what lean startup teaches us - let the users do the talking ;)

@AartGoossens AartGoossens force-pushed the feature/on_the_fly_input_argument_casting branch from 24565c8 to 845bff9 Compare March 30, 2018 10:46
@AartGoossens
Copy link
Contributor Author

AartGoossens commented Mar 30, 2018

@sladkovm I finally found time to finish this PR with the input of our discussions. Can you take a look to see if I missed anything?
I also took the liberty to refactor some of the method signatures in metrics/core.py. Please let me know if you don't agree with the changes.

@AartGoossens AartGoossens changed the title POC DO NOT MERGE - Adds on-the-fly enabling of function argument casting Adds on-the-fly enabling of function argument casting Mar 30, 2018
@AartGoossens AartGoossens force-pushed the feature/on_the_fly_input_argument_casting branch from 845bff9 to 5cb2065 Compare March 30, 2018 11:11
@AartGoossens AartGoossens force-pushed the feature/on_the_fly_input_argument_casting branch from 5cb2065 to cc7a80f Compare March 30, 2018 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants