-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_with_spacy() in wiki_ann.py returns a total of 10 sentences #13
Comments
Hi @Lasmali, I tried to replicate this problem by removing the downloaded dataset folder Have you tried clearing the wikiann dataset folder? |
I believe the error is cause by an update in spacy's conll_ner2json() function. More specifically this update explosion/spaCy#4186. It is quite recent, which explains why we are getting different results. The change causes the default behaviour of conll_ner2json() to put sentences in groups of 10 via the n_sents parameter. The newer spacy conll_ner2json version returns a list of length 9593 each element holding 10 sentences, hence indexing into file_as_json effectively gives you 10 sentences. I suspect your file_as_json is a singleton list with the entry being a list of all 95930 sentences hence the file_as_json[0] line. I think a good fix would be not to rely on sklearn for the splitting of the dataset, and just return the spacy goldcorpus as is to give the user a reliable experience following what ever standard spacy is setting while also removing the dependency to sklearn. It would be quite trivial for users to split their data into training and validation. What do you think? def load_with_spacy(self):
...
...
file_as_string = file.read()
file_as_json = conll_ner2json(file_as_string) # has n_sents=10 as default in update
# file_as_json is of length 9593 with each entry having 10 sentences
all_sents = file_as_json[0]['paragraphs'][0]['sentences'] # only 10 sentences are extracted
train_sents, dev_sents = train_test_split(all_sents, test_size=0.3, random_state=42) |
I tried it out with spaCy v2.1.4 and after upgrading my spaCy version i also get the 10 sentences as you mention. I agree that we should not rely on sklearn for this. However I believe we would have to perform the dev/train because otherwise we would not be able to instantiate the GoldCorpus. One solution would be to make the user pass a splitting function and set a simple default one that does not depend on any other framework. What do you think of this solution? |
Is there a reason why most of the data from WikiAnn is excluded here?
danlp/danlp/datasets/wiki_ann.py
Line 49 in 2877bd0
Instead of having 95.930 sentences it returns a GoldCorpus with 10 sentences.
The text was updated successfully, but these errors were encountered: