Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The purpose of bucket when training #16

Open
zealscott opened this issue Aug 13, 2020 · 4 comments
Open

The purpose of bucket when training #16

zealscott opened this issue Aug 13, 2020 · 4 comments

Comments

@zealscott
Copy link

zealscott commented Aug 13, 2020

Hi,

Thanks for your insight paper and code!

I am wondering why use different bucket to generate the train data according to the probility of each bucket.

And how to determine the size of each bucket?

## select bucket

Thanks a lot!

@boathit
Copy link
Owner

boathit commented Aug 13, 2020

First, we should make it clear that the bucket trick is not necessary and it is only used to do a more efficient batching, i.e., we want the sequences on each batch have similar lengths otherwise we are wasting the computation resource (note that the training step for each batch is determined by the longest sequence in that batch).

Next, we should guarantee that each sequence should be picked by with equal chance in the training. Once we use the buckets, we actually split the sequence picking into two steps: selecting the bucket and then selecting the sequence from that bucket. Thus, the probability of a sequence being selected is the product of the probability its bucket being selected and the probability of it being selected within that bucket. That's why the probability of each bucket being selected is proportional to its size.

@zealscott
Copy link
Author

Thanks for your reply! And its a clever design.

Another question is that I am confused about the number of iteration:

createTrainVal(region, "$datapath/$cityname.h5", datapath, downsamplingDistort, 1_000_000, 10_000)

In default settings, 100w data(x20 for noise and distort) for training, so we have 2,000w training data. But the iteration_num = 67000*128, which is fewer than the training data.

t2vec/train.py

Line 251 in f518c3e

num_iteration = 67000*128 // args.batch

So I am curious how to determine the iteration number? Is it enough to use fewer than one epoch training?

Looking forward to your reply, and thanks again.

@boathit
Copy link
Owner

boathit commented Aug 23, 2020

It is okay to use fewer than one epoch in the cases when the dataset contains redundant samples. You can check the convergence on the validation dataset.

@zealscott
Copy link
Author

Ok, I got your points. Last question, in trainning we use gen_loss of each word, in val data we use the pre_loss of each trajectory, is there any significance to do that differently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants