Question about Wikipedia Corpus Preprocessing #222

manveertamber · 2022-07-18T20:24:38Z

Hi,

In this thread: #42, it was mentioned that the pages were split into 100 word passages using spaCy en-web tokenizer. I tried to reproduce this myself, counting if a token was a word using is_alpha from spaCy, but my passages on average were slightly longer than the Wikipedia DPR 100 word passages. Could you elaborate on how the tokenizer was used to count 100 words please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Wikipedia Corpus Preprocessing #222

Question about Wikipedia Corpus Preprocessing #222

manveertamber commented Jul 18, 2022

Question about Wikipedia Corpus Preprocessing #222

Question about Wikipedia Corpus Preprocessing #222

Comments

manveertamber commented Jul 18, 2022