Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow functions in breaks argument to stat_bin() #4561

Closed
aijordan opened this issue Jul 22, 2021 · 9 comments
Closed

Allow functions in breaks argument to stat_bin() #4561

aijordan opened this issue Jul 22, 2021 · 9 comments
Labels
feature a feature request or enhancement

Comments

@aijordan
Copy link
Contributor

Allowing breaks in stat_bin() to accept functions would be useful in the context of grouping structures.

The changes would also enable facet histograms for estimators with variable bin width, e.g, equal area of all blocks, or the "essential histogram" (doi:10.1093/biomet/asz081). The same effect (with facet_*) is of course already possible using geom_rect().

The required code additions would be minimal and follow the same idea as #1890

A simple illustration how this could work is given below:

df <- data.frame(
    x = c(rexp(1000), rexp(1000, 5)),
    study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) +
    geom_histogram(aes(y = ..density..), breaks = function(x) qexp(c(0, .25, .5, .75, .95, .99), 1/mean(x))) +
    facet_wrap(vars(study))

image

@clauswilke
Copy link
Member

A proposed implementation was submitted here: #4560

@aijordan Since you already made a PR there's no need to close it unless you feel it has a problem or discussion/review arrives at the conclusion that it's not workable. But feature requests should always be discussed in an issue first, as the request may be reasonable even if the particular implementation doesn't pass review.

@clauswilke clauswilke added the feature a feature request or enhancement label Jul 22, 2021
@teunbrand
Copy link
Collaborator

This feature request might also be a nice opportunity to allow rlang lambda notation for not only the breaks argument, but allow it for the binwidth argument too.

@clauswilke
Copy link
Member

@teunbrand Could you provide an example of what this would do and how it would work?

@teunbrand
Copy link
Collaborator

Sure; it won't work for breaks because that is the proposed feature I don't have on my machine, but here are some examples of how I imagine what it would do for the binwidth argument. It's really just a syntactic nicety.

This already allowed (the function itself isn't very exciting):

library(ggplot2)

df <- data.frame(
  x = c(rexp(1000), rexp(1000, 5)),
  study = c(rep("A", 1000), rep("B", 1000))
)

ggplot(df, aes(x)) +
  geom_histogram(binwidth = function(x) {diff(range(x)/20)})

It is currently not allowed to write the function to the binwidth argument as a rlang lambda function. In other places in ggplot2, you can use rlang lambda functions, such as scales or the data argument in layers.

ggplot(df, aes(x)) +
  geom_histogram(binwidth = ~ diff(range(.x))/20)
#> Warning: Computation failed in `stat_bin()`:
#> `width` must be a numeric scalar

ggplot2 already has an internal function to convert formulas to functions when appropriate, introduced in v3.3.4.

ggplot(df, aes(x)) +
  geom_histogram(binwidth = ggplot2:::allow_lambda(~ diff(range(.x))/20))

Created on 2021-07-22 by the reprex package (v1.0.0)

I'd propose to apply the allow_lambda() function somewhere in the stat instead of it being user supplied, which needn't be very complicated. For example, in the constructor of stat_bin() or in a ggproto method, you might add lines like the following :

binwidth <- allow_lambda(binwidth)
breaks <- allow_lambda(breaks)

@aijordan
Copy link
Contributor Author

Thanks, if there is interest in this, I could reopen PR #4560 with included rlang lambda function notation support for both breaks and binwidth.

I feel I should point out that function input for breaks (and bindwidth already) does not play nice with the default position = "stack", but since a warning is issued that is probably fine!?

library(ggplot2)

df <- data.frame(
  x = c(rexp(1000), rexp(1000, 5)),
  study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) +
  geom_histogram(
    mapping = aes(fill = study),
    binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)),
    boundary = 0)
#> Warning: position_stack requires non-overlapping x intervals

Another issue with unequal bin widths arises when using geom_freqpoly, where a wide bin at the tail can add excessive padding. This already happens currently when specifying breaks, and therefore probably belongs in its own issue if at all.

ggplot(df, aes(x)) + geom_freqpoly(aes(y = ..density..), bins = 20)

ggplot(df, aes(x)) +
  geom_freqpoly(
    mapping = aes(y = ..density..),
    breaks = c(0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x)))

It seems that padding is not optional for geom_freqpoly and users can work around this case by using stat_bin(geom = GeomPath) with custom padding when specifying breaks.

ggplot(df, aes(x)) +
  stat_bin(
    mapping = aes(y = ..density..),
    geom = GeomPath,
    breaks = c(-0.01, 0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x), max(df$x) + 0.01))

Created on 2021-07-23 by the reprex package (v2.0.0)

@hadley
Copy link
Member

hadley commented Apr 19, 2022

@aijordan a PR with those changes would be great thanks! Do you also want to look into the freqpoly padding issue?

@aijordan
Copy link
Contributor Author

Seems to me that a quick "solution" would be to have geom_freqpoly respect a pad argument in ... (the documentation states that these get passed on to the respective layer)

That is, change the following if-condition in geom_freqpoly, where padding is enforced, to identical(stat, "bin") && !exists("pad", params)

params <- list2(na.rm = na.rm, ...)
if (identical(stat, "bin")) {
    params$pad <- TRUE
}

Then, custom padding via the breaks argument would be possible, as above.

I guess anything more than this quick fix probably requires changes to bin_vector (in /R/bin.R).

@aijordan
Copy link
Contributor Author

A small change would reduce the perceived padding in general: The function bin_out() in /R/bin.R returns a data frame with variables xmin and xmax. As far as I can tell, these are never used explicitly (GeomBar calculates these from the width variable).

However, they seem to be used implicitly somewhere to determine the range of x values that are (invisibly) displayed. GeomPath and GeomPoint only use the x variable (the bin midpoints) that is present in the data frame to display data, so we get unexpectedly large padding to the left and right. For a small width of the left- and rightmost bins, this issue is almost imperceivable. Is it the intended behavior that the entire value range of every bin should always be displayed, even when the geom does not use the full range for display?

Here are two of the previous examples, when xmin and xmax are removed from the output of bin_out() (Unfortunately, I did not set a seed before, so the actual data is different):

library(ggplot2)

set.seed(20220617)
df <- data.frame(
  x = c(rexp(1000), rexp(1000, 5)),
  study = c(rep("A", 1000), rep("B", 1000))
)

ggplot(df, aes(x)) + geom_freqpoly(aes(y = ..density..), bins = 20)

ggplot(df, aes(x)) +
  geom_freqpoly(
    mapping = aes(y = ..density..),
    breaks = c(0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x)))

Created on 2022-06-17 by the reprex package (v2.0.1)

@clauswilke Do you prefer a single or two separate pull requests, or should the padding be discussed in a new issue anyway? There is the additional question, whether geom_freqpoly() and geom_histogram() should pass on the pad argument to stat_bin() instead of overwriting/ignoring it. Also, it seems that I wasn't quick enough and cannot revive the old PR since its target was the "master" branch, which has since been deprecated.

@teunbrand
Copy link
Collaborator

Considering this fixed by #5963. Padding can be discussed in a new issue if so desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants