Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add contrib.rand_zipfian #9747

Merged
merged 18 commits into from
Feb 23, 2018
Merged

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Feb 9, 2018

Description

Add log-uniform distribution sampler similar to
https://www.tensorflow.org/api_docs/python/tf/nn/log_uniform_candidate_sampler
Note that tf implementation supports sampling w/o replacement, which is not available in this PR.
@sxjscience

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@eric-haibin-lin eric-haibin-lin requested a review from szha as a code owner February 9, 2018 00:29
@@ -18,9 +18,82 @@
# coding: utf-8
# pylint: disable=wildcard-import, unused-wildcard-import
"""Contrib NDArray API of MXNet."""
import math
from ..context import current_context
from ..random import uniform
try:
from .gen_contrib import *
except ImportError:
pass

__all__ = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to include it in all

true_classes = true_classes.as_in_context(ctx).astype('float64')
expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range
# cast sampled classes to fp64 to avoid interget division
sampled_cls_fp64 = sampled_classes.astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the output always float64?

Copy link
Member Author

@eric-haibin-lin eric-haibin-lin Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64-bit is adopted because this sampler is usually used for extremely large number of classes. Returned samples are always actually always in int64. The fp64 here is used to calculate the probability of a particular classes. (Limited precision of fp32 treat 50M - 1 and 50M - 2 as the same number, yielding nan when taking the log)

<NDArray 3x2 @cpu(0)>
"""
assert(isinstance(true_classes, Symbol)), "unexpected type %s" % type(true_classes)
if ctx is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbol doesn't need ctx

@eric-haibin-lin eric-haibin-lin changed the title Add contrib.rand_log_uniform [WIP] Add contrib.rand_log_uniform Feb 9, 2018
@eric-haibin-lin eric-haibin-lin changed the title [WIP] Add contrib.rand_log_uniform Add contrib.rand_log_uniform Feb 11, 2018
list of NDArrays
A 1-D `int64` `NDArray` for sampled candidate classes, a 1-D `float64` `NDArray` for \
the expected count for true classes, and a 1-D `float64` `NDArray` for the \
expected count for sampled classes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to write the docstring as:

Returns
--------
samples : NDArray
    A 1-D `int64` `NDArray` for sampled candidate classes
exp_count_true : NDArray
   ...
exp_count_sample : NDArray
   ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# cast sampled classes to fp64 to avoid interget division
sampled_cls_fp64 = sampled_classes.astype('float64')
expected_count_sampled = ((sampled_cls_fp64 + 2.0) / (sampled_cls_fp64 + 1.0)).log() / log_range
return [sampled_classes, expected_count_true, expected_count_sampled]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to return a list here.

__all__ = []
__all__ = ["rand_log_uniform"]

def rand_log_uniform(true_classes, num_sampled, range_max, ctx=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not be called as rand_log_uniform because LogUniform has a specific meaning. Should be called something like rand_zipfian, or log_uniform_candidate_sampler like in TF.

sampled_classes = (rand.exp() - 1).astype('int64') % range_max

true_classes = true_classes.as_in_context(ctx).astype('float64')
expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be expected_count_true = ((true_classes + 2.0) / (true_classes + 1.0)).log() / log_range * num_sampled. Otherwise it should be called something like prob_true_class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I should either multiply it by num_sampled or change the name. Will do an update.

[ 0.12453879]
<NDArray 1 @cpu(0)>
>>> exp_count_sample
[ 0.22629439 0.12453879 0.12453879 0.12453879]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example output looks suspicious as it does not sum up to 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've misunderstood the term. It should be correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's suspicious at first glance because the exp_count of 1 is larger than the exp_count of 3. However, the sampling result show that 3 is much more often then 1. We need to sample multiple times and test if the empirical expectation matches the true expectation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a coincident for the first 5 samples. If I sample 50 times, it returns:


1
3
3
3
2
0
0
0
0
1
3
1
1
3
0
2
0
4
0
3
1
3
1
2
2
1
1
2
0
1
0
2
0
0
0
0
0
0
4
1
1
4
0
4
2
0
0
2
1
0

0's = 19

1's = 12

2's = 8

3's = 7

4's = 4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good

__all__ = ["log_uniform_candidate_sampler"]

# pylint: disable=line-too-long
def log_uniform_candidate_sampler(true_classes, num_sampled, range_max, ctx=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it go under contrib.random? since other sampling methods in random just have the distribution as name, should we follow the same convention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed to rand_zipfian to follow the convention. Extra namespaces such as contrib.random might over-complicate APIs since there are just a few operators in nd.contrib.

@eric-haibin-lin eric-haibin-lin changed the title Add contrib.rand_log_uniform Add contrib.rand_zipfian Feb 19, 2018
sampled_cls_fp64 = sampled_classes.astype('float64')
expected_prob_sampled = ((sampled_cls_fp64 + 2.0) / (sampled_cls_fp64 + 1.0)).log() / log_range
expected_count_sampled = expected_prob_sampled * num_sampled
return [sampled_classes, expected_count_true, expected_count_sampled]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I forgot to update this

@eric-haibin-lin eric-haibin-lin merged commit 9158352 into apache:master Feb 23, 2018
dabraude pushed a commit to dabraude/incubator-mxnet that referenced this pull request Feb 24, 2018
* draft

* move to contrib

* rename op

* CR comments

* Update contrib.py

* Update contrib.py

* Update random.py

* update example in the doc

* update example in symbol doc

* CR comments

* update op name

* update op name

* update op name in test

* update test

* Update contrib.py
dabraude pushed a commit to dabraude/incubator-mxnet that referenced this pull request Feb 24, 2018
* draft

* move to contrib

* rename op

* CR comments

* Update contrib.py

* Update contrib.py

* Update random.py

* update example in the doc

* update example in symbol doc

* CR comments

* update op name

* update op name

* update op name in test

* update test

* Update contrib.py
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* draft

* move to contrib

* rename op

* CR comments

* Update contrib.py

* Update contrib.py

* Update random.py

* update example in the doc

* update example in symbol doc

* CR comments

* update op name

* update op name

* update op name in test

* update test

* Update contrib.py
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* draft

* move to contrib

* rename op

* CR comments

* Update contrib.py

* Update contrib.py

* Update random.py

* update example in the doc

* update example in symbol doc

* CR comments

* update op name

* update op name

* update op name in test

* update test

* Update contrib.py
@eric-haibin-lin eric-haibin-lin deleted the log-uniform branch September 18, 2018 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants