-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample_prior function #2876
Conversation
What's this for? Generating sample data?
…On 26 Feb 2018 23:00, "Colin" ***@***.***> wrote:
This allows samples from a model, ignoring all observed variables. See the
screenshot below for an example in a simple model.
Right now it relies on unofficial python3.6 behavior, and official
python3.7 behavior
<https://mail.python.org/pipermail/python-dev/2017-December/151283.html>.
Namely, dictionaries keeping insertion order. I would love a suggestion to
avoid that requirement, but I can also take a swing at having tree_dict
subclass from OrderedDict instead.
[image: image]
<https://user-images.githubusercontent.com/2295568/36700677-87c7b582-1b1e-11e8-8dc0-cfd0efb5db09.png>
------------------------------
You can view, comment on, or merge this pull request online at:
#2876
Commit Summary
- Add sample_prior function
File Changes
- *M* pymc3/sampling.py
<https://github.com/pymc-devs/pymc3/pull/2876/files#diff-0> (59)
- *M* pymc3/tests/test_sampling.py
<https://github.com/pymc-devs/pymc3/pull/2876/files#diff-1> (12)
Patch Links:
- https://github.com/pymc-devs/pymc3/pull/2876.patch
- https://github.com/pymc-devs/pymc3/pull/2876.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2876>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8DiPYmNSGrxc6swnYZ6fiWe-7010Ypks5tYzePgaJpZM4SUEpy>
.
|
Yep! That would be one use case. Or faster prototyping (for example, seeing if the generated data looks reasonable). We wanted to use something like this last week for generating a toy data set for a gerrymandering project. |
This paper shows a good example of where you might usefully use prior sampling. |
Ha, this is great!!! I was thinking excatly the same in another issue the other day: #2856 (comment) |
pymc3/sampling.py
Outdated
for _ in indices: | ||
point = {} | ||
for var_name, var in model.named_vars.items(): | ||
val = var.distribution.random(point=point, size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use the part from smc
where it samples from the prior:
https://github.com/pymc-devs/pymc3/blob/801accb5f236ab9daa89a8fcd9d09a3ba4ed0a39/pymc3/step_methods/smc.py#L186-L193
Otherwise you will get error with bounded RVs:
AttributeError: 'TransformedDistribution' object has no attribute 'random'
This is great then. I often wonder how to do the Stan "data generation
process" part.
…On 27 Feb 2018 05:28, "Junpeng Lao" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pymc3/sampling.py
<#2876 (comment)>:
> +
+ if vars is None:
+ vars = set(model.named_vars.keys())
+
+ if random_seed is not None:
+ np.random.seed(random_seed)
+
+ if progressbar:
+ indices = tqdm(range(samples))
+
+ try:
+ prior = {var: [] for var in vars}
+ for _ in indices:
+ point = {}
+ for var_name, var in model.named_vars.items():
+ val = var.distribution.random(point=point, size=size)
It is better to use the part from smc where it samples from the prior:
https://github.com/pymc-devs/pymc3/blob/801accb5f236ab9daa89a8fcd9d09a
3ba4ed0a39/pymc3/step_methods/smc.py#L186-L193
Otherwise you will get error with bounded RVs:
AttributeError: 'TransformedDistribution' object has no attribute 'random'
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2876 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8DiG28JAKsHD3UtNoPO6Y31qN6-fHDks5tY5JtgaJpZM4SUEpy>
.
|
@junpenglao That's a good sign that we even named the functions the same! You seemed to sketch out a pretty complete method in the comment on the other issue (along with some good edge cases for the test) - I'll hopefully update later today. |
I was hoping someone will pick it up ;-) |
Also need to add to release note. |
fc62dcc
to
efb9b28
Compare
@junpenglao i updated to sample correctly from transformed variables. I decided against (for now) using the trick from I am a little confused because sampling from a transformed distribution is super slow: changing |
I have tried a few things without much luck to fix the speed problem. I might give a try tomorrow to do something similar to what |
Agree - that is more for the initialization. After this PR we can replace the jitter function currently used with sample_prior (with jitter etc to handle corner cases).
[Edit]: using forward_val doesnt speed up things currently, but potentially could if we rewrite it into numpy functions. |
pymc3/sampling.py
Outdated
prior = {var: [] for var in vars} | ||
for _ in indices: | ||
point = {} | ||
for var_name, var in model.named_vars.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name_vars
also contains Deterministic and Potential, which dont have distribution and random
RELEASE-NOTES.md
Outdated
- Plots of discrete distributions in the docstrings | ||
- Add logitnormal distribution | ||
- New function `pm.sample_prior` which generates test data from a model in the absence of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this further up as it's a major feature. Also, I think we should add author names to who contributed to feature / bugfix.
pymc3/sampling.py
Outdated
@@ -1207,6 +1207,66 @@ def sample_ppc_w(traces, samples=None, models=None, weights=None, | |||
return {k: np.asarray(v) for k, v in ppc.items()} | |||
|
|||
|
|||
def sample_prior(samples=500, model=None, vars=None, size=None, | |||
random_seed=None, progressbar=True): | |||
"""Generate samples from the prior of a model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a bit more description of why this is useful and when you would use it. It's really the prior predictive we're sampling from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its helpful -- I think one use case for this function is to generate a unique starting point for each chain, when multiple are required, like in this case #2856
Would be great to add a NB with some motivation and example usage we can add to the docs. |
Some of this seems maybe trickier than I thought. I've tried a few methods that are almost clever and don't work. My current favorite approach tries to clone the whole model, but I am not able to clone an
Here is the current implementation of
|
Why is forward pass random (like what you did before) doesnt work? (besides of slowness) I would like to contribute a bit more to this issue, as efficient forward random is quite important for likelihood-free that I would like to address. Could you share your experiments? |
cdad527
to
a0d638a
Compare
Gosh, it is easy to forget how useful outside input can be sometimes. I am going to focus on that instead of the many hours I spent trying to get something else to work :D It looks like forward pass continues to work, and I actually fixed the speed problem in a ninja edit last week. @twiecki would you rather have an example NB along with this PR, or merge this to master to start working more bugs out? |
@ColCarroll rather with this one :). the API shouldn't change all that much. |
@ColCarroll did you push the new changes? |
Yes - the major change is this line for deterministic variables:
(it is complicated because passing unused variables throws an error) |
Nice!!! LGTM |
Failure looks like something related to Working on a tiny case study notebook to use as well. |
LGTM I think fine to mark that with xfail, since we often have errors like that. Maybe add in to one of the other notebooks sample prior, that might be easier than doing your own notes. |
Awesome update. The fact that now we can check |
Maybe the test fail could be fix by specifying the dtype? similar to #2891 (comment)? |
pymc3/tests/test_sampling.py
Outdated
|
||
assert (prior['mu'] < 0).all() | ||
assert (prior['positive_mu'] > 0).all() | ||
assert (prior['x_obs'] < 0).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line?
51451e3
to
9a80d0d
Compare
Updated this to use #2902. Huge thanks to @lucianopaz, as that code cleans this up a lot, and it looks quite tricky! You might take a look to make sure I did not mess anything up:
The first one I think is good, the second one might not be wanted elsewhere. Note that now we just sample all the points we want from each node as we scan through, so it is quite fast, and no longer uses a progressbar since it is not iterative. I have confirmed that I can sample from the Efron-Morris baseball generative model, and am going to work on turning that into an actual example notebook. |
|
||
names = get_default_varnames(model.named_vars, include_transformed=False) | ||
# draw_values fails with auto-transformed variables. transform them later! | ||
values = draw_values([model[name] for name in names], size=samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is really efficient! However, is it sure that the values draw in the children of a graph is depending on the samples from their parent? In the previous implementation, we always sample by evaluating a point
which contains samples from a higher hierarchy. For example, if b ~ p(a), we sample a_tilde first then sample b~p(a_tilde). Is it the case her also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple example:
X = theano.shared(np.arange(3))
with pm.Model() as m:
ind = pm.Categorical('i', np.ones(3)/3)
x = pm.Deterministic('X', X[ind])
prior=pm.sample_generative(10)
prior
{'X': array([0, 0, 2, 1, 2, 2, 1, 0, 0, 1]),
'i': array([1, 0, 0, 2, 2, 0, 2, 1, 0, 0])}
i
and X
should be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super helpful example! Let me take a look at it -- there's some work already to avoid some edge cases, and I would have thought this got caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caught the bug (will add tests for all this, too). Your example runs as desired now!
I make sure I evaluate the params by making a dictionary of index integers to nodes (avoids non-hashability of ndarray
). After evaluating the nodes, I was accidentally using the index integer to check if it was a child of another node. This was never true, so I never supplied that value to the rest of the graph.
if size is None: | ||
return func(*values) | ||
else: | ||
return np.array([func(*value) for value in zip(*values)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size
seems to only be imposed to param
's with a random
method, and we hope the content of values
to be the right size
in the end. Shouldn't there be some enforcement of the size
, for the numbers.Number
, np.ndarray
, tt.TensorConstant
, tt.sharedvar.SharedVariable
and tt.TensorVariable
in point cases for us to be sure that values
will in fact have the desired output size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am relying here on theano
catching those sorts of errors, and giving more informative errors than I could. I am running this on a few different models to make sure it gives reasonable results, but so far those sorts of inputs get broadcast in a sensible manner.
pymc3/distributions/distribution.py
Outdated
to_eval = set(missing_inputs) | ||
missing_inputs = set() | ||
for param in to_eval: | ||
try: # might evaluate in a bad order, | ||
evaluated[param] = _draw_value(params[param], point=point, givens=givens.values(), size=size) | ||
if any(param in j for j in named_nodes_children.values()): | ||
givens[param.name] = (params[param], evaluated[param]) | ||
if any(params[param] in j for j in named_nodes_children.values()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I actually commented on an older commit so it shows as outdated, sorry for the mess. First off, it looks very nice, however I think this line is confusing, You're trying to see if the node params[param]
is a child of some other named node. If params[param]
were to be a named node, that information should be available in the dictionary named_nodes_parents
. If params[param]
were not to be a named node, then it would not be registered neither in the named_nodes_parents
nor the named_nodes_children
dictionaries.
If params[param]
is a named node, you should be able to replace this line by:
if named_nodes_parents[params[param]]:
If params[param]
is not a named node, then I think it shouldn't bee added to givens
but I may be overlooking something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also just checked out travis, and your suggestion will also fix failing tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Are you adding more tests or is this ready?
Not yet - I am now looking at
In particular,
I would guess there is still something funny going on with passing nodes appropriately. |
This is a difficult model to generate from. But yeah there seems to be some problem with the last RV |
Since this is currently blocked by #2909, I suggested we rolled back to the original implementation with (slower) forward passing. I have a version that works fairly OK and could serve as a baseline implementation: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/Miscellaneous/Test_sample_prior.ipynb |
I'm confused if this is working or not? :) |
Closing this based on the newer PR |
This allows samples from a model, ignoring all observed variables. See the screenshot below for an example in a simple model.
Right now it relies on unofficial python3.6 behavior, and official python3.7 behavior. Namely, dictionaries keeping insertion order. I would love a suggestion to avoid that requirement, but I can also take a swing at having
tree_dict
subclass fromOrderedDict
instead.