The `doc_stride` Parameter in `chunk_into_passages` Can Cause Errors or Unexpected Behaviour #536

EhsanM4t1qbit · 2020-09-13T20:53:45Z

Describe the bug
I suspect that there is a bug in the function chunk_into_passages in samples.py, used for breaking down a long paragraph into multiple passages for QA tasks.
There is a moving window for selecting a chunk of the paragraph. The window starting point is passage_start_t which moves by doc_stride tokens, while the window end token, passage_end_t, moves by passage_len_t tokens. I see a few problematic possible scenarios here.

passage_id = 0
doc_len_t = len(doc_offsets)    
while True:
    passage_start_t = passage_id * doc_stride
    passage_end_t = passage_start_t + passage_len_t
    passage_start_c = doc_offsets[passage_start_t].   #  line 228
.
.
.
    if passage_end_t >= doc_len_t:
        break

doc_stride > doc_len_t > passage_len_t: This will cause an error on line 228.

E.g. doc_stride = 200, doc_len_t = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 200-300


09/11/2020 19:23:15 - ERROR - farm.data_handler.processor -   Error message: list index out of range

doc_len_t > doc_stride > passage_len_t: This will silently skip a number of tokens.

E.g. doc_len_t = 200, doc_stride = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 150-250

doc_stride < passage_len_t: There will be an overlap between the two chunks.

E.g. doc_len_t = 200, doc_stride = 100, passage_len_t = 150
First passage: tokens 0-150
Second passage: tokens 100-250

Note that it's not straightforward to set passage_len_t since it is dependent on a number of other parameters.

passage_len_t = max_seq_len - question_len_t - n_special_tokens

The simple solution is to get rid of doc_stride and set passage_start_t to passage_end_t+1 at the end of the while loop.

The text was updated successfully, but these errors were encountered:

brandenchan · 2020-09-14T09:52:19Z

Hey @EhsanM4t1qbit, thanks for the issue. This seems very related to #510 and I totally agree we need to have some kind of message to avoid situations where doc_stride > passage_len_t. This has been introduced with #538

However, we don't recommend setting passage_start_t to passage_end_t+1. It is actually desirable for there to be some overlap between chunks. Without the overlap, the QA system will not be able to select answer spans that go across two passages. To give a sense of some defaults that we use, we often set doc_stride = 128, max_seq_len=384.

Does this address your issue?

EhsanM4t1qbit · 2020-09-14T16:55:07Z

Thanks for your response. I didn't realize that the overlap helps with going across two passages. In that case, yes, it does address the issue.
Just one more question to pick your brain, why not set max_seq_len to 512?

brandenchan · 2020-09-16T08:18:35Z

So the self attention layers in transformers are a key part of what makes them so powerful. But they are computationally very expensive and scale quadratically against seq_len. That means that a doubling of seq_len can result in a quadrupling of operations that need to be completed.

In our end to end experiments, we’ve found that setting it to 384 instead of 512 results in about a 33% speed with little change in accuracy and so we’ve stuck with it since!

brandenchan · 2020-09-16T08:20:35Z

Closing the issue now. Feel free to reopen for follow ups regarding the original chunking issue. Or open a new issue if you have any more questions regarding the models!

EhsanM4t1qbit added the bug Something isn't working label Sep 13, 2020

Timoeller assigned brandenchan Sep 14, 2020

brandenchan mentioned this issue Sep 14, 2020

Add asserts on doc_stride and max_seq_len to prevent issues with sliding window #538

Merged

brandenchan closed this as completed Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `doc_stride` Parameter in `chunk_into_passages` Can Cause Errors or Unexpected Behaviour #536

The `doc_stride` Parameter in `chunk_into_passages` Can Cause Errors or Unexpected Behaviour #536

EhsanM4t1qbit commented Sep 13, 2020

brandenchan commented Sep 14, 2020

EhsanM4t1qbit commented Sep 14, 2020 •

edited

Loading

brandenchan commented Sep 16, 2020

brandenchan commented Sep 16, 2020

The doc_stride Parameter in chunk_into_passages Can Cause Errors or Unexpected Behaviour #536

The doc_stride Parameter in chunk_into_passages Can Cause Errors or Unexpected Behaviour #536

Comments

EhsanM4t1qbit commented Sep 13, 2020

brandenchan commented Sep 14, 2020

EhsanM4t1qbit commented Sep 14, 2020 • edited Loading

brandenchan commented Sep 16, 2020

brandenchan commented Sep 16, 2020

The `doc_stride` Parameter in `chunk_into_passages` Can Cause Errors or Unexpected Behaviour #536

The `doc_stride` Parameter in `chunk_into_passages` Can Cause Errors or Unexpected Behaviour #536

EhsanM4t1qbit commented Sep 14, 2020 •

edited

Loading