-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove seemingly equivalent ways of producing subsets of DataFrames #323
Comments
+1 on both proposals. |
I like the idea. I don't really like the name On Sun, Jul 14, 2013 at 9:14 AM, Viral B. Shah [email protected]:
|
One possibility is renaming |
I do like the name |
+1 for |
Here's the semantics I'm thinking to implement. Indexing behaves as:
Select always returns a SubDataFrame because it is equivalent to df[RowIndex, All Columns]. |
I like your approach. It'll certainly stress test SubDataFrame support with things like If I want a copy, is the best way with |
I think (hope) |
|
I think it's alright if we generalize |
Ah, I forgot about that. I'm not super comfortable with mixing those meanings. It's not clear to me that they're really the same operation or that they won't have signatures that clash. |
I personally think it would make more sense to rename the k-max function to |
kmax doesn't describe what it does – it finds a contiguous range of elements at the indices in a collection if the collection were fully sorted without fully sorting it. That's far more general and useful and it's commonly known as select, unfortunately. |
I would also prefer renaming |
Let's keep debating this. For me, the gains from adopting standard SQL terminology are very large: we just switched to |
I'm sympathetic to that. Let's see if we can come up with a better name for the algorithmic select operation. |
@johnmyleswhite I meant to say that |
That's what I understood, Viral. |
I still think the two uses are similar: |
Ok. I missed that |
This is going to make deprecation a nightmare since if we rename |
@johnmyleswhite If the decision is moving towards SQL naming conventions, is there really a need for a My confusion/idea stems from your idea above where you are saying 'some are more equal than others'...I feel like a beginning user (especially an R user) might feel uncomfortable and keep trying to cast a |
I'm sure that some R users will be confused. There's really no way to avoid confusion as people learn about pass-by-reference semantics. But I hope that we can design semantics that are simple, if unfamiliar. My "some are more equal than others" comment refers to the fact that operations that produce equivalent results in R often have very different performance characteristics: the use of |
Yeah, that makes sense now. At first I was thinking that it was just the output that made a At the risk of being annoying, does |
I have to think a bit more about how views are implemented in SQL to decide whether they're the same. I'm inclined to think they're not: SQL views update when the underlying database changes, whereas a SubDataFrame will continue to reuse the same indices (which may be wrong) if the underlying database changes. |
Reading about views in SQL more, I believe I was right: a view is just a stored SQL query, whereas a @StefanKarpinski: changing to |
I don't think this is an issue anymore; DataFrames doesn't have |
Funny timing, as we just deprecated |
We have too many ways of producing things that seem like subsets of a
DataFrame
. After I addedfilter
, we now have four ways that seem superficially equivalent, but reflect two different algorithms:I think we should:
SubDataFrame
objects everywhere when accessing all columns of a DataFrame, which is whatsub
does, but direct indexing does not do.sub
tofilter
after flipping the argument order. We will then removesub
andsubset
completely.Having multiple ways of doing the same thing troubles me in general, but seems particularly problematic when the different ways are not strictly aliases of one another. Over time, end-users discover that, even though all subset operations are equal in outcome, some are more equal than others.
The text was updated successfully, but these errors were encountered: