-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow get_df on all data_types #887
Conversation
To be honest I am wondering how this PR is necessary. but I will let another also comment about it. |
Hey, I think it's as necessary as the The changes in this PR are convenient for anyone who, like myself, prefers to work with pandas DataFrames, regardless of data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorenzomag Thanks a lot for initializing this. But I have the same opinion as @dachengx's. In my understanding, the PandaS data frame is designed for relational data on purpose like SQL, which makes it different from the structured array. If we implement a trick like this, many functions intended for relational data frames may fail in an implicit way, which is not ideal. Therefore, I think we probably need a stronger argument to implement this - do you have any examples of how this can solve certain types of practical problems?
Hi @yuema137, thanks for your comment. I do understand the issue now. No past use of I am just more comfortable working with Pandas and would love to have my data returned directly as df regardless of data type, but undestand that you don't want to encourage this use of pandas. |
@lorenzomag Hi Lorenzo, sorry for getting back to you after a long time... |
@yuema137, don't worry, this PR is not urgent at all! I'll add that today. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorenzomag !
What is the problem / what does the code in this PR do
Pandas' data strucutres can contain any type of Python objects. Therefore, there should be no issues in having np.arrays as element of a Pandas DataFrame.
Currently, the implementation of the Pandas method to convert from strucutred arrays to Pandas DataFrames is not optimal (it just doesn't work well). While we wait for them to fix it, I propose a fix myself just for our code.
This PR adds a function to properly convert between Numpy's structured arrays and pandas' DataFrames, allowing the user to use the
get_df()
method on any data type.The PR also adds a simple test in
test_core.py
. Further tests can be implemented if deemed necessary.Can you briefly describe how it works?
For each column in the strucutred array, the function
convert_structured_array_to_dataframe()
instrax/utils.py
checks the dimensionality of the column data (col.ndim
). Ifcol.ndim>1
, convert the column data so that each row contains and np.ndarray object that can be held by the Pandas dataframe. Otherwise, leave as is.For event-type data, nothing changes, and the overhead is basically null. For loading peak-type n-dimensional data, there might be a very slight increase in computation time to convert the structured array to Pandas DataFrame. However, this was not possible before thsi PR, so it's an improvement nonetheless.
Can you give a minimal working example (or illustrate with a figure)?
Setup in
montecarlo-development
using strax as in the PR:Both
arr
anddf
are populated with the same data.These all equal
numpy.ndarray
.