You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just ran into a situation in which the by() function is fed a DataFrame with 0 rows. The result is a DataFrame with only 1 column, which seems inconsistent with the case in which the input has more than 0 rows. In the MWE below, I think the result should have 2 columns in both cases, just a different number of rows. Thoughts?
using DataFrames
# Input has 2 rows. Result has 2 columns. OK.
tbl =DataFrame(x1 = [11, 22], x2 = ["aa", "bb"])
result =by(tbl, :x1, df ->DataFrame(n =size(df, 1)))
# Input has 0 rows. Result has only 1 column. Inconsistent?
tbl =DataFrame(x1 =DataArray(Int, 0), x2 =DataArray(String, 0))
result =by(tbl, :x1, df ->DataFrame(n =size(df, 1)))
The text was updated successfully, but these errors were encountered:
You're right that it's inconsistent, but unfortunately I'm not sure we can really fix this with the current system. The problem is that we can't find out what the anonymous function will return until we call it, and we can't call it if the data frame is empty. This is somewhat similar to reductions on empty arrays, except that we cannot use inference to figure out the type of the result (since DataFrame does not include any information about column types).
I think this illustrates a wider issue with the by API, which is that it doesn't provide enough information. It would make more sense to have something similar to dplyr's summarise, which takes one argument for each column to create: that way we would know at least the names and numbers of columns. With type-stable data frames we could rely on inference to find out the type of columns in most cases, and fall back to Any or Union{} when that doesn't work. But one of the strengths of summarise is that you can refer to columns without repeating df[...], which is only possible in Julia with macros. At this point, probably better leave this to DataFramesMeta or Query.
Hi there,
I just ran into a situation in which the by() function is fed a DataFrame with 0 rows. The result is a DataFrame with only 1 column, which seems inconsistent with the case in which the input has more than 0 rows. In the MWE below, I think the result should have 2 columns in both cases, just a different number of rows. Thoughts?
The text was updated successfully, but these errors were encountered: