Skip to content

Latest commit

 

History

History
97 lines (87 loc) · 8.33 KB

README.md

File metadata and controls

97 lines (87 loc) · 8.33 KB

WYWEB

An evaluation bentchmark for classical Chinese. This work has been accepted by Findings of ACL 2023.

Classical Chinese is a treasure of the entire human cultural history. We contribute this work with the hope of helping the entire community to be more prosperous. Our work will be an open, community-driven project which improves with the advancement of technology.

We hope more people join in to make this benchmark better and more useful.

Leader-board

Online leader-board

See WYWEB on CADAL for the official leader-board.

Main

Models Avg. PUNC GLNER GJC FSPC TLC Xuci WYWRC IRC
Human 88.0 92.4 94.3 90.3 80.0 89.0 85.3 80.0 92.3
DeBERTa-base 75.9 83.3 86.7 85.2 61.1 86.7 72.4 45.1 86.7
GuwenBERT-base 72.9 82.5 82.8 84.8 61.3 85.1 71.7 28.0 86.8
GuwenBERT-large 75.6 83.1 86.1 84.9 58.5 87.6 73.4 44.4 87.8
GuwenBERT-base-fs 74.6 82.9 84.8 84.2 61.0 86.7 70.0 42.1 85.3
RoBERTa-CCBC 74.5 82.5 84.7 84.5 59.5 85.0 73.2 40.7 86.1
RoBERTa-CCLC 75.3 82.8 86.1 84.7 58.6 87.1 74.9 41.0 86.9
SikuBERT 73.7 80.8 82.8 82.2 60.9 82.4 70.4 44.0 85.8
SikuRoBERTa 73.5 81.4 82.8 82.5 62.2 83.8 68.5 41.0 85.8
RoBERTa-wwm-ext 72.1 78.8 79.8 81.3 59.2 78.3 71.0 42.1 86.2

WYWMT

Model BLEU chrF2 TER ROUGE-1 ROUGE-2 ROUGE-L
Human 45.6 44.2 34.4 77.4 50.7 76.2
guwenbert-base 40.1 38.1 37.5 72.5 46.0 70.3
guwenbert-large 38.8 37.2 38.1 70.1 43.7 67.7
guwenbert-base-fs 36.3 35.2 39.2 68.3 41.2 65.7
roberta-CCBC 39.1 37.1 36.8 71.4 44.9 69.3
roberta-CCLC 39.8 38.0 36.4 71.6 45.3 69.3
SikuBERT 38.8 36.2 37.9 72.0 45.5 69.8
SikuRoBERTa 39.1 36.5 37.7 72.2 45.7 70.0
DeBERTa-base 39.5 37.8 35.9 71.9 44.2 68.7
Roberta-wwm-ext 38.0 35.8 39.1 69.9 43.2 66.7

How to test new models?

This is an evaluation benchmark for classical Chinese NLP providing several tasks. Researchers could quickly evaluate pre-trained language models with a few lines of code using the evaluation toolkit.

Quick Run The Base Line

python run.py  \
                --tag wywweb \
                --do_train \
                --max_seq_len 512 \
                --dump 1000 \
                --task_name GJCTask \
                --data_dir data/tasks/gjc \
                --output_dir output/deberta/GJCTask \
                --num_train_epochs 6 \
                --model_dir_or_name bozhou/DeBERTa-base \
                --learning_rate 2e-5 \
                --train_batch_size 48 \
                --fp16 True \
                --workers 4 \
                --warmup 1000 

Test your model and contact us to update the leader board.

  • test your model on every task.
  • get the best dev set score, use this model to evaluate test set.
  • send result of the test set to us.
  • maintainers validate the result and then update the leader board.

Task Description

Task Train Dev Test Description Metric Source
PUNC 90k 20k 20k Sequence labeling F1 Authoritative Texts
TLC 28k 6k 6k Sentence classification Accuracy Ancient prose
GJC 100k 20k 20k Sentence classification Accuracy Daizhige
XuCi 800 200 200 Token similarity Accuracy Exam papers
WYWRC 3k 500 500 Reading comprehension Accuracy Exam papers
IRC 3k 1k 1k Reading comprehension Accuracy Exam papers
WYWMT 20k 3k 3k Machine Translation BLEU online
GLNER 80k 18k 18k Sequence labeling F1 \citet{GULIAN2020}
FSPC 3000 1000 1000 Sentence classification Accuracy THU-FSPC

Cite us

@inproceedings{zhou-etal-2023-wyweb,
    title = "{WYWEB}: A {NLP} Evaluation Benchmark For Classical {C}hinese",
    author = "Zhou, Bo  and
      Chen, Qianglong  and
      Wang, Tianyu  and
      Zhong, Xiaomi  and
      Zhang, Yin",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.204",
    doi = "10.18653/v1/2023.findings-acl.204",
    pages = "3294--3319"
}