Chinese_Skewed_TxtClf

Chinese text classification datasets and their machine-learning based classifiers described in the paper:

Yuen-Hsien Tseng, "The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification," Journal of Educational Media & Library Sciences, Vol. 57, No. 1 (March 2020).

Datasets are (details of the datasets can be found in the article listed below):

WebDes
News
CTC
CnonC

Classifiers:

Naive Bayes (NB)
Support Vector Machine (SVM)
Random Forest (RF)
Single hidden-layer neural network (NN)
Convolutional Neural Networks (CNN)
Recurrent Convolutional Neural Networks (RCNN)
Facebook's fastText
Bidirectional Encoder Representations from Transformers (BERT)

1. Description of Files:

Datasets: datasets mentioned above.
BERT_txtclf: a folder for running BERT classifier.
BERT_txtclf_HowTo.docx: a document describing how to run the BERT classifier for the datasets.
TxtClfer.ipynb: Self-explained Jupyter Notebook for NB, SVM, NN, CNN, RCNN. You can save it into TxtClfer.py for running in command mode.
fastText_run_log.txt: a document and log file to describe how to run fastText classifier for the datasets.
ft_metrics.sh: batch execution file to run fastText.
ft_metrics.py: code required by the above batch execution file.

Note: To be able to run the BERT classifier under BERT_txtclf, you must download those imported files (or simply download all files) from https://github.com/google-research/bert to folder BERT_txtclf.

2. To cite this datasets, source codes, or experiment results:

Yuen-Hsien Tseng, "The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification," Journal of Educational Media & Library Sciences, Vol. 57, No. 1 (March 2020).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BERT_txtclf		BERT_txtclf
Datasets		Datasets
.gitignore		.gitignore
BERT_txtclf_HowTo.docx		BERT_txtclf_HowTo.docx
Conv2FastText.py		Conv2FastText.py
ExtractOut.py		ExtractOut.py
LICENSE		LICENSE
README.md		README.md
TermFreq-utf8.txt		TermFreq-utf8.txt
TxtClfer.ipynb		TxtClfer.ipynb
batch.sh		batch.sh
dataset_stat.py		dataset_stat.py
fastText_run_log.txt		fastText_run_log.txt
ft_metrics.py		ft_metrics.py
ft_metrics.sh		ft_metrics.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese_Skewed_TxtClf

1. Description of Files:

2. To cite this datasets, source codes, or experiment results:

About

Releases

Packages

Languages

License

SamTseng/Chinese_Skewed_TxtClf

Folders and files

Latest commit

History

Repository files navigation

Chinese_Skewed_TxtClf

1. Description of Files:

2. To cite this datasets, source codes, or experiment results:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages