The doc_stride
Parameter in chunk_into_passages
Can Cause Errors or Unexpected Behaviour
#536
Labels
bug
Something isn't working
Describe the bug
I suspect that there is a bug in the function
chunk_into_passages
insamples.py
, used for breaking down a long paragraph into multiple passages for QA tasks.There is a moving window for selecting a chunk of the paragraph. The window starting point is
passage_start_t
which moves bydoc_stride
tokens, while the window end token,passage_end_t
, moves bypassage_len_t
tokens. I see a few problematic possible scenarios here.doc_stride
>doc_len_t
>passage_len_t
: This will cause an error on line 228.doc_len_t
>doc_stride
>passage_len_t
: This will silently skip a number of tokens.doc_stride
<passage_len_t
: There will be an overlap between the two chunks.Note that it's not straightforward to set
passage_len_t
since it is dependent on a number of other parameters.The simple solution is to get rid of
doc_stride
and setpassage_start_t
topassage_end_t+1
at the end of the while loop.The text was updated successfully, but these errors were encountered: