Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Imdb sentiment dataset reader #962

Merged
merged 8 commits into from
Apr 16, 2020

Conversation

sgrechanik-h
Copy link
Contributor

This PR implements a dataset reader for the IMDb sentiment classification dataset. It also includes a json configuration for BERT (en, cased) which is mostly the same as the configuration for rusentiment except for the max seq length and batch size (which I set to values such that I don't get out-of-memory on my hardware).

This PR also includes a fix for the sets_accuracy metric which should now correctly work for string labels (i.e. wrap them into sets instead converting them to sets). Also I added reporting of cached files in download_decompress.

@dilyararimovna
Copy link
Contributor

You should definitely try conversational English BERT to check whether that classifier will be better or not.

@sgrechanik-h
Copy link
Contributor Author

@dilyararimovna Conversational BERT seems to be slightly better: ~93.5% accuracy vs ~92.5% accuracy on the original cased english BERT with the same hyperparameters. However, I trained them for about 3 epochs, so maybe I'll try training them for a longer time.

@dilyararimovna
Copy link
Contributor

dilyararimovna commented Aug 13, 2019

It's time to get rid out of necessity to cover one class to list of classes.
Please, do the following steps:
Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_iterators/snips_intents_iterator.py#L31 with

result.append((text, intent))

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L37 with

format: str = "csv", class_sep: str = None,

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L50 with

sep (str): delimeter for ``"csv"`` files. Default: None -> only one class per sample

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L90-L93 with

                if isinstance(x, list):
                    if class_sep is None:
                        # each sample is a tuple ("text", "label")
                        data[data_type] = [([row[x_] for x_ in x], str(row[y]))
                                           for _, row in df.iterrows()]
                    else:
                        # each sample is a tuple ("text", ["label", "label", ...])
                        data[data_type] = [([row[x_] for x_ in x], str(row[y]).split(class_sep))
                                           for _, row in df.iterrows()]
                else:
                    if class_sep is None:
                        # each sample is a tuple ("text", "label")
                        data[data_type] = [(row[x], str(row[y])) for _, row in df.iterrows()]
                    else:
                        # each sample is a tuple ("text", ["label", "label", ...])
                        data[data_type] = [(row[x], str(row[y]).split(class_sep)) for _, row in df.iterrows()]

Thank you!

@sergei-mironov
Copy link
Contributor

sergei-mironov commented Aug 14, 2019

It's time to get rid out of necessity to cover one class to list of classes.

My opinion here is that we should focus on sentiment reader in this ticket. basic_classification_reader.py is a common component, many configurations may be affected by changes in it, so I think it is better to do this kind of work by a separate PR.

Regarding the labels problem I would suggest to make ImdbReader to return singleton lists for now to stay compatible with other components.

@dilyararimovna
Copy link
Contributor

@grwlf Then you should revert proba2labels component to its previous version. Because the current version returns non-consistent labels.

deeppavlov/dataset_readers/imdb_reader.py Outdated Show resolved Hide resolved
deeppavlov/dataset_readers/imdb_reader.py Outdated Show resolved Hide resolved
sgrechanik-h and others added 2 commits August 14, 2019 15:44
@dilyararimovna
Copy link
Contributor

@grwlf Are you planning to release pre-trained classification models?

@sergei-mironov
Copy link
Contributor

@grwlf Are you planning to release pre-trained classification models?

Do you mean models for this Sentiment task? Not sure about that. I think we could either release the model or provide an instruction on how to fine-tune existing bert. What do you think would be better?
@yoptar ?

@yoptar
Copy link
Contributor

yoptar commented Aug 19, 2019

What do you think would be better?

Most of our configs come with pretrained models, so it would be preferable in this case as well.

@sergei-mironov
Copy link
Contributor

What do you think would be better?

Most of our configs come with pretrained models, so it would be preferable in this case as well.

OK, we will train the model.

@sgrechanik-h
Copy link
Contributor Author

So, how do we upload the weights to files.deeppavlov.ai?

@yoptar
Copy link
Contributor

yoptar commented Sep 4, 2019

So, how do we upload the weights to files.deeppavlov.ai?

@sgrechanik-h,
You can send me a link to the model on [email protected] and I'll reply with its url on files.deeppavlov.ai
Or you can publish a link on some other site in the config.

@sergei-mironov
Copy link
Contributor

So, how do we upload the weights to files.deeppavlov.ai?

@sgrechanik-h,
You can send me a link to the model on [email protected] and I'll reply with its url on files.deeppavlov.ai
Or you can publish a link on some other site in the config.

Could you please check this Google.Drive folder? It should contain sentiment_imdb_conv_bert_v0.tar.gz and sentiment_imdb_bert_v0.tar.gz checkpoints.

https://drive.google.com/drive/folders/1fXAR02g-drJKOHhSGND6TLNkhzErz9PE?usp=sharing

data_path = Path(data_path)

if url is None:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
Copy link

@avmhawk avmhawk Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hidden logic, why should the file be available at this link? remove, pls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please suggest how to fix.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced, many dataset readers of DeepPavlov have their URLs hardcoded, and forcing the user to provide the URL in the config file will harm usability. (However almost all of these harcoded URLs lead to files.deeppavlov.ai/*, so the imdb dataset should probably be uploaded to files.deeppavlov.ai as well).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoptar what do you think about this? Maybe, i was wrong

@@ -192,6 +192,11 @@ def download_decompress(url: str, download_path: [Path, str], extract_paths=None
extracted = extracted_path.exists()
if not extracted and not arch_file_path.exists():
simple_download(url, arch_file_path)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that the following logic is implemented here(maybe i'm wrong):

        if extracted_path.exists():
            extracted = True
            log.info(f'Found cached and extracted {url} in {extracted_path}')
        elif arch_file_path.exists():
            log.info(f'Found cached {url} in {arch_file_path}')
        else:
            simple_download(url, arch_file_path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is much more readable, thanks.

download_decompress(url, data_path)
mark_done(data_path)

alternative_data_path = data_path / "aclImdb"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, you check data_path, download file and after try "alternative_data_path"? and if it exists you don't use original data_path anywhere or am I misunderstood?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's naming problem and you should use "extracted_data_path" for this var.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I'm trying to solve here is that the archive contains the aclImdb folder where all the data files are located, however data_path itself probably ends with aclImdb, so the data files are probably somewhere in ~/.deeppavlov/downloads/aclImdb/aclImdb/. This may create some confusion, and if data_path is provided manually, it may be set to ~/.deeppavlov/downloads/aclImdb/aclImdb/ instead of the correct ~/.deeppavlov/downloads/aclImdb by the user. This logic forgives this mistake.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it's not explicit solution, but it works....

for Computational Linguistics (ACL 2011).
"""

@overrides
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need it to, I don't know, protect us from misspelling the word read or from the future changes of the function's name in the base class?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, is a redundant solution. we all write code and monitor its consistency, using a third-party little-known library and operations with metaclasses can shoot us in the future.

data_path = Path(data_path)

if url is None:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}

@yoptar yoptar merged commit db31b59 into deeppavlov:dev Apr 16, 2020
surkovv pushed a commit to surkovv/DeepPavlov that referenced this pull request Aug 24, 2022
* feat: Imdb sentiment dataset reader

* fix: download_decompress: report found cached files

* Use accuracy instead of set_accuracy

* fix: Proba2Labels: return single label for max_proba=True

* Config for conversational BERT for imdb

* Imdb: produce singleton lists of labels instead of labels

* Don't convert data_path to Path twice

Co-Authored-By: Aleksei Lymar <[email protected]>

* imdb: always use utf-8

Co-Authored-By: Aleksei Lymar <[email protected]>

Co-authored-by: Aleksei Lymar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants