Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The doc_stride Parameter in chunk_into_passages Can Cause Errors or Unexpected Behaviour #536

Closed
EhsanM4t1qbit opened this issue Sep 13, 2020 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@EhsanM4t1qbit
Copy link

Describe the bug
I suspect that there is a bug in the function chunk_into_passages in samples.py, used for breaking down a long paragraph into multiple passages for QA tasks.
There is a moving window for selecting a chunk of the paragraph. The window starting point is passage_start_t which moves by doc_stride tokens, while the window end token, passage_end_t, moves by passage_len_t tokens. I see a few problematic possible scenarios here.

passage_id = 0
doc_len_t = len(doc_offsets)    
while True:
    passage_start_t = passage_id * doc_stride
    passage_end_t = passage_start_t + passage_len_t
    passage_start_c = doc_offsets[passage_start_t].   #  line 228
.
.
.
    if passage_end_t >= doc_len_t:
        break   
  • doc_stride > doc_len_t > passage_len_t: This will cause an error on line 228.
E.g. doc_stride = 200, doc_len_t = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 200-300


09/11/2020 19:23:15 - ERROR - farm.data_handler.processor -   Error message: list index out of range
  • doc_len_t > doc_stride > passage_len_t: This will silently skip a number of tokens.
E.g. doc_len_t = 200, doc_stride = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 150-250
  • doc_stride < passage_len_t: There will be an overlap between the two chunks.
E.g. doc_len_t = 200, doc_stride = 100, passage_len_t = 150
First passage: tokens 0-150
Second passage: tokens 100-250

Note that it's not straightforward to set passage_len_t since it is dependent on a number of other parameters.

passage_len_t = max_seq_len - question_len_t - n_special_tokens

The simple solution is to get rid of doc_stride and set passage_start_t to passage_end_t+1 at the end of the while loop.

@brandenchan
Copy link
Contributor

Hey @EhsanM4t1qbit, thanks for the issue. This seems very related to #510 and I totally agree we need to have some kind of message to avoid situations where doc_stride > passage_len_t. This has been introduced with #538

However, we don't recommend setting passage_start_t to passage_end_t+1. It is actually desirable for there to be some overlap between chunks. Without the overlap, the QA system will not be able to select answer spans that go across two passages. To give a sense of some defaults that we use, we often set doc_stride = 128, max_seq_len=384.

Does this address your issue?

@EhsanM4t1qbit
Copy link
Author

EhsanM4t1qbit commented Sep 14, 2020

Thanks for your response. I didn't realize that the overlap helps with going across two passages. In that case, yes, it does address the issue.
Just one more question to pick your brain, why not set max_seq_len to 512?

@brandenchan
Copy link
Contributor

So the self attention layers in transformers are a key part of what makes them so powerful. But they are computationally very expensive and scale quadratically against seq_len. That means that a doubling of seq_len can result in a quadrupling of operations that need to be completed.

In our end to end experiments, we’ve found that setting it to 384 instead of 512 results in about a 33% speed with little change in accuracy and so we’ve stuck with it since!

@brandenchan
Copy link
Contributor

Closing the issue now. Feel free to reopen for follow ups regarding the original chunking issue. Or open a new issue if you have any more questions regarding the models!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants