Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Question about Wikipedia Corpus Preprocessing #222

Open
manveertamber opened this issue Jul 18, 2022 · 0 comments
Open

Question about Wikipedia Corpus Preprocessing #222

manveertamber opened this issue Jul 18, 2022 · 0 comments

Comments

@manveertamber
Copy link

Hi,

In this thread: #42, it was mentioned that the pages were split into 100 word passages using spaCy en-web tokenizer. I tried to reproduce this myself, counting if a token was a word using is_alpha from spaCy, but my passages on average were slightly longer than the Wikipedia DPR 100 word passages. Could you elaborate on how the tokenizer was used to count 100 words please?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant