by() gives different number of columns for input with 0 rows #1273

JockLawrie · 2017-11-09T01:42:08Z

Hi there,

I just ran into a situation in which the by() function is fed a DataFrame with 0 rows. The result is a DataFrame with only 1 column, which seems inconsistent with the case in which the input has more than 0 rows. In the MWE below, I think the result should have 2 columns in both cases, just a different number of rows. Thoughts?

using DataFrames

# Input has 2 rows. Result has 2 columns. OK.
tbl = DataFrame(x1 = [11, 22], x2 = ["aa", "bb"])
result = by(tbl, :x1, df -> DataFrame(n = size(df, 1)))

# Input has 0 rows. Result has only 1 column. Inconsistent?
tbl = DataFrame(x1 = DataArray(Int, 0), x2 = DataArray(String, 0))
result = by(tbl, :x1, df -> DataFrame(n = size(df, 1)))

nalimilan · 2017-11-09T09:12:50Z

You're right that it's inconsistent, but unfortunately I'm not sure we can really fix this with the current system. The problem is that we can't find out what the anonymous function will return until we call it, and we can't call it if the data frame is empty. This is somewhat similar to reductions on empty arrays, except that we cannot use inference to figure out the type of the result (since DataFrame does not include any information about column types).

I think this illustrates a wider issue with the by API, which is that it doesn't provide enough information. It would make more sense to have something similar to dplyr's summarise, which takes one argument for each column to create: that way we would know at least the names and numbers of columns. With type-stable data frames we could rely on inference to find out the type of columns in most cases, and fall back to Any or Union{} when that doesn't work. But one of the strengths of summarise is that you can refer to columns without repeating df[...], which is only possible in Julia with macros. At this point, probably better leave this to DataFramesMeta or Query.

See the more general description at #1256.

JockLawrie · 2017-11-10T02:59:49Z

Ah I see.
Not a big deal in this case. Let's leave it for now.
Thanks.

nalimilan closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

by() gives different number of columns for input with 0 rows #1273

by() gives different number of columns for input with 0 rows #1273

JockLawrie commented Nov 9, 2017

nalimilan commented Nov 9, 2017

JockLawrie commented Nov 10, 2017

by() gives different number of columns for input with 0 rows #1273

by() gives different number of columns for input with 0 rows #1273

Comments

JockLawrie commented Nov 9, 2017

nalimilan commented Nov 9, 2017

JockLawrie commented Nov 10, 2017