Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sbc_hist #193

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Add sbc_hist #193

wants to merge 8 commits into from

Conversation

jpritikin
Copy link

Compared to the previous version, I added:

  • the pretty uncertainty interval butterfly
  • a bit more documentation
  • an example

@jgabry
Copy link
Member

jgabry commented May 17, 2019

Thanks a lot! Will review soon.

@jgabry
Copy link
Member

jgabry commented May 17, 2019

@jpritikin How do you feel about enabling this https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork so I can push directly to your PR branch?

@jpritikin
Copy link
Author

@jpritikin How do you feel about enabling this https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork so I can push directly to your PR branch?

Looks like it's already enabled. Go ahead with your changes.

@jgabry
Copy link
Member

jgabry commented May 17, 2019

Ok great thanks.

@jgabry
Copy link
Member

jgabry commented May 17, 2019

I made a bunch of edits so that it conforms to the conventions we've been using in bayesplot. I still need to add a few tests and I'd like to get @avehtari (and/or @dpsimpson) to take a look and comment on a few things:

  • The first argument is called ranks even though it’s really indicators that can be made into ranks. This lines up with what @bgoodri did in rstan::sbc() so that’s nice. But should we call this argument something else?

  • The worst argument is used to specify how many parameters to plot (e.g., worst=5 to plot the 5 worst parameters). I like this idea, but do we want any mechanism for selecting parameters by name? It seems less important to have that for this plot than the other plots in bayesplot, but for consistency should we have it?

  • The way the "worst" parameters are determined is using KL divergence. Seems reasonable, but is this the best option for this scenario?

  • Should there be changes to the default settings for the arguments?

Example

# create some fake inputs to use for sbc_hist()
set.seed(19)
pars <- paste0("beta[", 1:4, "]")
samples_per_prior <- 511
n_replications <- 500
ranks <- list()
for (n in 1:n_replications) {
  r1 <- matrix(0, nrow=samples_per_prior, ncol=length(pars),
               dimnames=list(NULL, pars))
  for (p1 in 1:length(pars)) {
    r1[sample.int(samples_per_prior, floor(runif(1, 0, samples_per_prior))), p1] <- 1
  }
  ranks[[n]] <- r1
}

color_scheme_set("purple")
sbc_hist(ranks)

sbc_hist

@jgabry jgabry removed their request for review May 17, 2019 22:42
@codecov-io
Copy link

codecov-io commented May 17, 2019

Codecov Report

Merging #193 into master will decrease coverage by 0.03%.
The diff coverage is 97.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #193      +/-   ##
==========================================
- Coverage   99.31%   99.28%   -0.04%     
==========================================
  Files          30       31       +1     
  Lines        4088     4175      +87     
==========================================
+ Hits         4060     4145      +85     
- Misses         28       30       +2
Impacted Files Coverage Δ
R/sbc.R 97.7% <97.7%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5ad462...633bd0b. Read the comment docs.

@jgabry
Copy link
Member

jgabry commented May 17, 2019

And one other consideration: do we want the only allowed input type to be a list of matrices? This type of input could also be stored in a 3-D array or a (long) data frame. This doesn’t necessarily need to be decided now since can always add that ability later without breaking anything.

@jpritikin
Copy link
Author

jpritikin commented May 17, 2019

Generally looks good, but why did you get rid of the horz dotted lines at the top and bottom on the butterfly? Especially for the lower side, the histogram bars can cover most of the butterfly and it's hard to see the boundary. If you don't like the top dotted line, at least keep the bottom one.

@jgabry
Copy link
Member

jgabry commented May 17, 2019

Oops. I meant to put that back in. I had removed it while trying to figure something else out. I’ll add it back.

@jpritikin
Copy link
Author

I'm curious what you kept the horz line in the middle of the butterfly. I mean, what matters is whether the histogram bars are within the butterfly or not. It doesn't really matter whether they are close to the mean, correct?

@jgabry
Copy link
Member

jgabry commented May 17, 2019

I just did it to match the paper, and I think it looks nice. But now that you mention it I think it could draw the eye to that part of the plot which isn’t the most relevant. Thanks for the feedback.

* remove segment in center of CI
* add segments at extremes of CI (lower one is in front of histogram, upper is behind, I think it looks nice)
* add x-axis ticks and labels at 0, samples_per_prior/2,samples_per_prior (this does convey some information)
@jgabry
Copy link
Member

jgabry commented May 18, 2019

Here's the result of the example from above after the following changes:

  • remove segment in center of CI
  • add faint line segments (not full lines) at extremes of CI (lower one is in front of histogram, upper is behind, I think it looks nice)
  • add x-axis ticks and labels at 0, samples_per_prior/2, samples_per_prior (this does actually convey some information)

sbc_hist

also rename samples_per_prior to thinned_sample_size in internal code
@jgabry
Copy link
Member

jgabry commented May 18, 2019

sbc_hist

@jpritikin
Copy link
Author

Beautiful 👍

@avehtari
Copy link
Contributor

Thanks for working on this. Some comments

Butterfly histogram is useful also for PITs and MCMC rank plots (see #179), so this should be modularized so that there would be only one butterfly histogram plotting function.

The first argument is called ranks even though it’s really indicators that can be made into ranks. This lines up with what @bgoodri did in rstan::sbc() so that’s nice. But should we call this argument something else?

As in other pull request @jpritikin noticed that it's waste of memory to save indicators, it would be better to provide just ranks for this function.

The worst argument is used to specify how many parameters to plot (e.g., worst=5 to plot the 5 worst parameters). I like this idea, but do we want any mechanism for selecting parameters by name?

Yes, we want to be able to select parameters by name, too. There are often specific parameters we care more.

The way the "worst" parameters are determined is using KL divergence. Seems reasonable, but is this the best option for this scenario?

In the paper and dicussions outside the paper, we don't recommend any quantity. KL is known also to miss some things other diagnostics (there's a 20+ uniformity tests paper) or human can see.

#' @param thin An integer indicating the thinning interval to use when plotting
#' so that the histograms consist of (close to) independent realizations. Set
#' the thin argument such that the resulting number of draws approximately
#' matches the effective sample size.

The current arxiv version is wrong about this and the updated version is not complete from other parts. The difference is not big for thetic chains, but antithetic chains may have effective sample size larger than the number of iterations. In this case the chain should be thinned, too. One way is to thin first by 3 (thinning by odd lag removes the antithetic behavior) and then check again the effective sample size.

IMO there is no need to warn about uneven samples per bin, if the variation is small. For example, if we thin we we may often have uneven division but, e.g.. if we have some bins with 50 and some 51, the variation is 2% which is visually very small. Alternatively it would be useful to round to even.

@jpritikin
Copy link
Author

As in other pull request @jpritikin noticed that it's waste of memory to save indicators

@avehtari Actually I realized that you can't change the thinning if you don't have the raw ranks data. It's not actually as much data as I originally thought. I don't think it's a problem to pass around the raw ranks.

we don't recommend any quantity

If there are 1000s of parameters, are you suppose to inspect every plot? There's got to be a way to prioritize, even if it's not perfect.

there's a 20+ uniformity tests paper

Might be nice to cite. DOI 10.1080/02331880500178562 ?

One way is to thin first by 3 (thinning by odd lag removes the antithetic behavior) and then check again the effective sample size.

Can this be (partially) automated?

@avehtari
Copy link
Contributor

If there are 1000s of parameters, are you suppose to inspect every plot?

I would not recommend that.

There's got to be a way to prioritize, even if it's not perfect.

I just stated that we don't recommend anything as Jonah asked my opinion. The reason is that we don't know what to recommend. Now when thinking this, it might be useful to have different options, for example showing most smiley ones, most frowny ones, most biased ones, specifically edge problems, etc. If I would need to choose just one, I think I would continue testing something computed from ecdf vs. envelope for uniform. I'm sorry that I don't have better answer for this.

I also think that it's better to show "worst" ones instead of just trying to make hypothesis test that all is fine or not. Visual inspection helps to learn more.

Might be nice to cite. DOI 10.1080/02331880500178562 ?

Yes. 34 different tests and they recommend any one of them as the best for all. Good to cite as it shows that there is no one good solution.

Can this be (partially) automated?

If ESS>S, thin by 3. Then proceed as before.

@jpritikin
Copy link
Author

I also think that it's better to show "worst" ones instead of just trying to make hypothesis test that all is fine or not.

Agreed. How about if we add some other measures of uniformity in addition to KL and require the user to specify which measure to use in conjunction with worst?

@tjmahr
Copy link
Collaborator

tjmahr commented May 20, 2019

What's the point of the butterfly shape, as opposed to a ribbon?

I don't think the bin boundaries are working correctly. First one straddles 0 and last one do not appear to evenly line up with 128, at least on my device. Histograms work on continuous data, so they don't respect the discrete boundaries of the rankings by default.

The use of the ribbon expands the x-axis so that the nonsensical values are part of the binning, so I think we are getting 128 ranks pushed into 29 bins which is causing uneven binning.

sbc_hist(ranks) + 
  geom_text(aes(label = stat(count)), stat = "bin", bins = 32)

image

This is probably a better sense of what 32 bins should be...

sbc_hist(ranks) + 
  stat_count(aes(x = plyr::round_any(u, 4, floor)))

image

@tjmahr
Copy link
Collaborator

tjmahr commented May 20, 2019

Aah shoot, that last figure has 33 bins because of how 129 was being rounded.

General problem still stands.

@jgabry
Copy link
Member

jgabry commented May 21, 2019

Maybe using geom_bar() instead of geom_histogram() let us get around that problem?

@avehtari
Copy link
Contributor

Another modularization suggestion. We need independent draws also for MCMC rank ecdf difference plots, and maybe for other things. Provide a common helper function to do thinning with couple options:

  • user set value
  • thinning by percentage of autocorrelation time: effective sample size calculation computes autocorrelation series and truncates that by Geyer's initial (pair) positive sequence rule. Often the last autocorrelations are small and for practical purposes useful thinning could be obtained by thinning at value where sum of autocorrelations is between 0.9 and 1.1. times of the sum of autocorrelations until the stopping rule. This rule would work also for antithetic chains.

sbc_hist could use this thinning function, too, with possibility of favoring thinning values which make the histogram division more even (ecdf difference plots are not sensitive to this)

@jpritikin
Copy link
Author

Consider using uniftest to rank uniformity.

@avehtari
Copy link
Contributor

@jpritikin thanks for pointing out the package, although it doesn't solve the problem which of the many tests to use.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 97.70115% with 2 lines in your changes missing coverage. Please review.

Project coverage is 99.28%. Comparing base (b5ad462) to head (633bd0b).
Report is 603 commits behind head on master.

Files with missing lines Patch % Lines
R/sbc.R 97.70% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #193      +/-   ##
==========================================
- Coverage   99.31%   99.28%   -0.04%     
==========================================
  Files          30       31       +1     
  Lines        4088     4175      +87     
==========================================
+ Hits         4060     4145      +85     
- Misses         28       30       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants