Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

by() gives different number of columns for input with 0 rows #1273

Closed
JockLawrie opened this issue Nov 9, 2017 · 2 comments
Closed

by() gives different number of columns for input with 0 rows #1273

JockLawrie opened this issue Nov 9, 2017 · 2 comments

Comments

@JockLawrie
Copy link

Hi there,

I just ran into a situation in which the by() function is fed a DataFrame with 0 rows. The result is a DataFrame with only 1 column, which seems inconsistent with the case in which the input has more than 0 rows. In the MWE below, I think the result should have 2 columns in both cases, just a different number of rows. Thoughts?

using DataFrames

# Input has 2 rows. Result has 2 columns. OK.
tbl = DataFrame(x1 = [11, 22], x2 = ["aa", "bb"])
result = by(tbl, :x1, df -> DataFrame(n = size(df, 1)))

# Input has 0 rows. Result has only 1 column. Inconsistent?
tbl = DataFrame(x1 = DataArray(Int, 0), x2 = DataArray(String, 0))
result = by(tbl, :x1, df -> DataFrame(n = size(df, 1)))
@nalimilan
Copy link
Member

You're right that it's inconsistent, but unfortunately I'm not sure we can really fix this with the current system. The problem is that we can't find out what the anonymous function will return until we call it, and we can't call it if the data frame is empty. This is somewhat similar to reductions on empty arrays, except that we cannot use inference to figure out the type of the result (since DataFrame does not include any information about column types).

I think this illustrates a wider issue with the by API, which is that it doesn't provide enough information. It would make more sense to have something similar to dplyr's summarise, which takes one argument for each column to create: that way we would know at least the names and numbers of columns. With type-stable data frames we could rely on inference to find out the type of columns in most cases, and fall back to Any or Union{} when that doesn't work. But one of the strengths of summarise is that you can refer to columns without repeating df[...], which is only possible in Julia with macros. At this point, probably better leave this to DataFramesMeta or Query.

See the more general description at #1256.

@JockLawrie
Copy link
Author

Ah I see.
Not a big deal in this case. Let's leave it for now.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants