- Publication: https://ieeexplore.ieee.org/document/9865479
- Demo: https://youtu.be/ggmAvpA4oHQ
Enter your sentence: Lễ tân thân thiện, có thang máy, vị trí ks thuận tiện, view thành phố rất đẹp. Phòng sạch nhưng hơi nhỏ & thiếu bình đun siêu tốc. Sẽ quay lại & giới thiệu bạn bè
=> FACILITIES#DESIGN&FEATURES,positive
=> LOCATIONIGENERAL,positive
=> ROOM_AMENITIES#DESIGN&FEATURES,negative
=> ROOMS#CLEANLINESS,positive
=> ROOMS#DESIGN&FEATURES,negative
=> SERVICE#GENERAL,positive
- I. Introduction
- II. The VLSP 2018 Aspect-based Sentiment Analysis Dataset
- III. Vietnamese Preprocessing
- IV. Model Development
- V. Experimental Results
This work aimed to solve the Aspect-based Sentiment Analysis (ABSA) problem for Vietnamese. Specifically, we focus on 2 sub-tasks of the Aspect Category Sentiment Analysis (ACSA):
- Aspect Category Detection (ACD): Detect
Aspect#Category
pairs in each review (e.g.,HOTEL#CLEANLINESS
,RESTAURANT#PRICES
,SERVICE#GENERAL
, etc.) - Sentiment Polarity Classification (SPC): Classify the Sentiment Polarity (
Positive
,Negative
,Neutral
) of eachAspect#Category
pair.
Here, we proposed 2 End-to-End solutions (ACSA-v1 and ACSA-v2), which used PhoBERT as a Pre-trained language model for Vietnamese to handle the above tasks simultaneously on 2 domains of the VLSP 2018 ABSA Dataset: Hotel and Restaurant.
Domain | Dataset | No. Reviews | No. Aspect #Cate ,Polarity |
Avg. Length | Vocab Size | No. words in Test /Dev not in Training set |
---|---|---|---|---|---|---|
Training | 3,000 | 13,948 | 47 | 3,908 | - | |
Hotel | Dev | 2,000 | 7,111 | 23 | 2,745 | 1,059 |
Test | 600 | 2,584 | 30 | 1,631 | 346 | |
Training | 2,961 | 9,034 | 54 | 5,168 | - | |
Restaurant | Dev | 1,290 | 3,408 | 50 | 3,398 | 1,702 |
Test | 500 | 2,419 | 163 | 3,375 | 1,729 |
- The Hotel domain consists of 34 following
Aspect#Category
pairs:
['FACILITIES#CLEANLINESS', 'FACILITIES#COMFORT', 'FACILITIES#DESIGN&FEATURES', 'FACILITIES#GENERAL', 'FACILITIES#MISCELLANEOUS', 'FACILITIES#PRICES', 'FACILITIES#QUALITY', 'FOOD&DRINKS#MISCELLANEOUS', 'FOOD&DRINKS#PRICES', 'FOOD&DRINKS#QUALITY', 'FOOD&DRINKS#STYLE&OPTIONS', 'HOTEL#CLEANLINESS', 'HOTEL#COMFORT', 'HOTEL#DESIGN&FEATURES', 'HOTEL#GENERAL', 'HOTEL#MISCELLANEOUS', 'HOTEL#PRICES', 'HOTEL#QUALITY', 'LOCATION#GENERAL', 'ROOMS#CLEANLINESS', 'ROOMS#COMFORT', 'ROOMS#DESIGN&FEATURES', 'ROOMS#GENERAL', 'ROOMS#MISCELLANEOUS', 'ROOMS#PRICES', 'ROOMS#QUALITY', 'ROOM_AMENITIES#CLEANLINESS', 'ROOM_AMENITIES#COMFORT', 'ROOM_AMENITIES#DESIGN&FEATURES', 'ROOM_AMENITIES#GENERAL', 'ROOM_AMENITIES#MISCELLANEOUS', 'ROOM_AMENITIES#PRICES', 'ROOM_AMENITIES#QUALITY', 'SERVICE#GENERAL']
- The Restaurant domain consists of 12 following
Aspect#Category
pairs:
['AMBIENCE#GENERAL', 'DRINKS#PRICES', 'DRINKS#QUALITY', 'DRINKS#STYLE&OPTIONS', 'FOOD#PRICES', 'FOOD#QUALITY', 'FOOD#STYLE&OPTIONS', 'LOCATION#GENERAL', 'RESTAURANT#GENERAL', 'RESTAURANT#MISCELLANEOUS', 'RESTAURANT#PRICES', 'SERVICE#GENERAL']
2. Constructing *.csv
Files for Model Development
For models to easily process the dataset, I transformed the original *.txt
files into *.csv
form using the VLSP2018Parser class in vlsp2018_processor.py.
I already provided these *.csv
files for both domains in the datasets folder. However, if you want to re-generate them, you can run the following command
python processors/vlsp2018_processor.py
Each row in the *.csv
will contains review and their corresponding Aspect#Category,Polarity
labels, with the value 1
demonstrating the existence of the Aspect#Category
in the review associated with its Positive
label, and the same for 2
and 3
for Negative
and Neutral
labels, respectively. Finally, the value 0
indicates that the Aspect#Category
does not exist in the review.
👉 I already provided the preprocessed data for this project in the datasets folder.
1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset
flowchart LR
style A fill:#ffccff,stroke:#660066,stroke-width:2px;
style B fill:#cceeff,stroke:#0066cc,stroke-width:2px;
style C fill:#ccffcc,stroke:#009933,stroke-width:2px;
style F fill:#ffcc99,stroke:#ff6600,stroke-width:2px;
style G fill:#ccccff,stroke:#6600cc,stroke-width:2px;
style H fill:#ccff99,stroke:#66cc00,stroke-width:2px;
style I fill:#ffcccc,stroke:#cc0000,stroke-width:2px;
A[/📄 Input Text/]
B([🔠 Lowercase])
subgraph C [VietnameseToneNormalizer]
direction TB
C1([🌀 Normalize\nUnicode])
C2([🖋️ Normalize\nSentence Typing])
C1 --> C2
end
subgraph E [VietnameseTextCleaner]
E1{{"<i class='fas fa-code'></i> Remove HTML"}}
E2{{"<i class='far fa-smile'></i> Remove Emoji"}}
E3{{"<i class='fas fa-link'></i> Remove URL"}}
E4{{"<i class='far fa-envelope'></i> Remove Email"}}
E5{{"<i class='fas fa-phone'></i> Remove Phone Number"}}
E6{{"<i class='fas fa-hashtag'></i> Remove Hashtags"}}
E7{{"<i class='fas fa-ban'></i> Remove Unnecessary Characters"}}
E1 --> E2 --> E3 --> E4 --> E5 --> E6 --> E7
end
F([💬 Normalize\nTeencode])
G([🛠️ Correct\nVietnamese Errors])
H([🔪 Word\nSegmentation])
I[/📄 Preprocessed Text/]
click G "https://huggingface.co/bmd1905/vietnamese-correction-v2"
click H "https://github.com/vncorenlp/VnCoreNLP"
A --> B --> C --> E --> E1
E --> F --> G --> E
F --> H --> I
2. The vietnamese_processor.py
I implemented 3 classes in the vietnamese_processor.py to preprocess raw Vietnamese text data. This is my improved version from the work by behitek:
(a) VietnameseTextCleaner: Simple regex-based text cleaning to remove HTML, Emoji, URL, Email, Phone Number, Hashtags, and other unnecessary characters.
(b) VietnameseToneNormalizer: Normalize Unicode (eg., 'ờ' != 'ờ'
) and sentence typing (eg., lựơng
=> lượng
, thỏai mái
=> thoải mái
).
(c) VietnameseTextPreprocessor:
Combine the above classes and add these following steps to the pipeline:
- normalize_teencodes(text: str):
- Convert teencodes to its original form.
- I also provided the
extra_teencodes
parameter to add your own teencode definitions based on the dataset used. Theextra_teencodes
must be a dict with keys as the original form and values as a list of teencodes. - You should be careful when using single word replacement for teencodes, because it can cause misinterpretation. For example,
'giá': ['price', 'gia']
can replace the word'gia'
in'gia đình'
, making it become'giá đình'
.
- correct_vietnamese_errors(texts: List):
- Use the pre-trained model by bmd1905 to correct Vietnamese errors.
- The inference time for this model is quite slow, so I implemented this method to process the text in batch. That's why you should pass a list of texts as input.
- word_segment(text: str):
- Use VnCoreNLP to segment Vietnamese words.
- This tool is chosen because: "PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data".
- I already implemented script to automatically download necessary components of this tool into the VnCoreNLP folder, so you don't need to do anything.
Example Usage
from processors.vietnamese_processor import VietnameseTextPreprocessor
extra_teencodes = {
'khách sạn': ['ks'], 'nhà hàng': ['nhahang'], 'nhân viên': ['nv'],
'cửa hàng': ['store', 'sop', 'shopE', 'shop'],
'sản phẩm': ['sp', 'product'], 'hàng': ['hàg'],
'giao hàng': ['ship', 'delivery', 'síp'], 'đặt hàng': ['order'],
'chuẩn chính hãng': ['authentic', 'aut', 'auth'], 'hạn sử dụng': ['date', 'hsd'],
'điện thoại': ['dt'], 'facebook': ['fb', 'face'],
'nhắn tin': ['nt', 'ib'], 'trả lời': ['tl', 'trl', 'rep'],
'feedback': ['fback', 'fedback'], 'sử dụng': ['sd'], 'xài': ['sài'],
}
preprocessor = VietnameseTextPreprocessor(vncorenlp_dir='./VnCoreNLP', extra_teencodes=extra_teencodes, max_correction_length=512)
sample_texts = [
'Ga giường không sạch, nhân viên quên dọn phòng một ngày. Chất lựơng "ko" đc thỏai mái 😔',
'Cám ơn Chudu24 rất nhiềuGia đình tôi có 1 kỳ nghỉ vui vẻ.Resort Bình Minh nằm ở vị trí rất đẹp, theo đúng tiêu chuẩn, còn về ăn sáng thì wa dở, chỉ có 2,3 món để chọn',
'Giá cả hợp líĂn uống thoả thíchGiữ xe miễn phíKhông gian bờ kè thoáng mát Có phòng máy lạnhMỗi tội lúc quán đông thì đợi hơi lâu',
'May lần trước ăn mì k hà, hôm nay ăn thử bún bắp bò. Có chả tôm viên ăn lạ lạ. Tôm thì k nhiều, nhưng vẫn có tôm thật ở nhân bên trong. ',
'Ngồi ăn Cơm nhà *tiền thân là quán Bão* Phần vậy là 59k nha. Trưa từ 10h-14h, chiều từ 16h-19h. À,có sữa hạt sen ngon lắmm. #food #foodpic #foodporn #foodholic #yummy #deliciuous'
]
preprocessed_texts = preprocessor.process_batch(sample_texts, correct_errors=True)
preprocessor.close_vncorenlp()
print(preprocessed_texts)
Based on the original BERT paper, the model achieved the best results when concatenating last 4 layers of BERT together. So we applied that method to the PhoBERT layer in our model architectures and combined it with 2 output construction ways below, ACSA-v1 and ACSA-v2, to form the final solutions.
👉 Notebook Solutions: Hotel-v1.ipynb, Restaurant-v1.ipynb
We transformed each Aspect#Category
pair and their corresponding Polarity
labels in each dataset's review into a list of C
one-hot vectors, where C
is the number of Aspect#Category
pairs:
- Each vector has 3 polarity labels,
Positive
,Negative
,Neutral
, and 1None
label to indicate whether or not the input has thisAspect#Category
so that it can have a polarity. Labels that exists will be1
, otherwise0
. - Therefore, we need to create
C
Dense layers with 4 neurons for each to predict the polarity of the correspondingAspect#Category
pair. - Softmax function will be applied here to get the probability distribution over the 4 polarity classes.
However, we will not simply feedforward the learned feature to each Dense layer one-by-one. Instead, we will concatenate them into a single Dense layer consisting of:
- 34
Aspect#Categories
× 4Polarities
= 136 neurons for the Hotel domain. - 12
Aspect#Categories
× 4Polarities
= 48 neurons for the Restaurant domain.
Finally, the binary_crossentropy
loss function will be applied to treat each Dense layer in the final Concatenated Dense layer as a binary classification problem.
In this ACSA problem, each Aspect#Category,Polarity
can represent an independent binary classification task (Is this Aspect#Category
Positive
or not?, Is this Aspect#Category
Negative
or not?, etc.).
So you might wonder that instead of treating each Aspect#Category,Polarity
as a separate output neuron with Sigmoid, why we one-hot encoded them within a single 4-neuron block for each and used Softmax? The key issue here is that the polarities within an Aspect#Category
are not entirely independent. For example:
- If the
Aspect#Category
is stronglyPositive
, it's less likely to beNegative
orNeutral
. - If the
Aspect#Category
is veryNegative
, it's less likely to bePositive
orNeutral
.
Using separate Sigmoids doesn't inherently capture this relationship. You could end up with outputs like: Positive
=0.9, Negative
=0.8, Neutral
=0.7. This doesn't make sense because the polarities should be mutually exclusive and the sum of the probabilities should be 1, which is what Softmax does.
The Concatenation mixes the independent Aspect#Category,Polarity
information and allows the network to learn complex/shared relationships between them. For example, if the model sees that HOTEL#CLEANLINESS
is Positive
, it might be more likely to predict HOTEL#QUALITY
as Positive
as well.
When using this Concatenation, the binary_crossentropy
will be applied to each output independently and the Softmax constraint is maintained during forward and backward passes for each Aspect#Category
. This approach not only allows the model to learn to predict multiple Aspect#Category,Polarity
simultaneously as binary classification problems but also maintains the mutual exclusivity of 4 polarities within each Aspect#Category
.
Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/04/22/MultitaskLearning.html
👉 Notebook Solutions: Hotel-v2.ipynb, Restaurant-v2.ipynb
The only difference of this approach from the above is that it will branch into many sub-models by using C
Dense layers (34 for Hotel and 12 for Restaurant) but not concatenating them into a single one. Each model will predict each task independently, not sharing parameters between them.
The Softmax function is applied here to get the probability distribution over the 4 polarity classes directly without converting them into one-hot vectors. Therefore, the categorical_crossentropy
loss function will be used to treat each Dense layer as a multi-class classification problem.
Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/05/05/MultitaskLearning_MultiBranch.html
1. Evaluation on the VLSP 2018 ABSA Dataset
VLSP has their own Java evaluation script for their ACSA tasks. You have to prepare 2 files:
- The ground-truth file: 3-VLSP2018-SA-Hotel-test.txt for the Hotel domain and 3-VLSP2018-SA-Restaurant-test.txt for the Restaurant domain.
- The predicted file that has the same format as the ground-truth file. You can find the example predictions of the models in the experiments/predictions folder.
I already provided a script to run the evaluation for each domain and approach. You can run the following command to get the evaluation results:
source ./evaluators/vlsp_evaluate.sh
Task | Method | Hotel | Restaurant | ||||
---|---|---|---|---|---|---|---|
Precision | Recall | F1-score | Precision | Recall | F1-score | ||
Aspect# Category |
VLSP best submission | 76.00 | 66.00 | 70.00 | 79.00 | 76.00 | 77.00 |
Bi-LSTM+CNN | 84.03 | 72.52 | 77.85 | 82.02 | 77.51 | 79.70 | |
BERT-based Hierarchical | - | - | 82.06 | - | - | 84.23 | |
Multi-task | 87.45 | 78.17 | 82.55 | 81.09 | 85.61 | 83.29 | |
Multi-task Multi-branch | 63.21 | 57.86 | 60.42 | 80.81 | 87.39 | 83.97 | |
Aspect# Category, Polarity |
VLSP best submission | 66.00 | 57.00 | 61.00 | 62.00 | 60.00 | 61.00 |
Bi-LSTM+CNN | 76.53 | 66.04 | 70.90 | 66.66 | 63.00 | 64.78 | |
BERT-based Hierarchical | - | - | 74.69 | - | - | 71.30 | |
Multi-task | 81.90 | 73.22 | 77.32 | 69.66 | 73.54 | 71.55 | |
Multi-task Multi-branch | 57.55 | 52.67 | 55.00 | 68.69 | 74.29 | 71.38 |
The predictions in the experiments/predictions folder and the evaluation results in the evaluators folder are obtained from older models I did couple years ago.
I finished the paper on this project in 2021, so the above results are obtained from the experiments I conducted at that time, which is located from this e8439bc commit. Something to note if you want to re-run the notebooks in that commit to obtain the above results:
- You can download the weights for each model here.
- As the notebooks in this commit are deprecated, you can face some issues when running them. For example, when calling the
create_model
function, you will face the following error when initializing the input layer.
<class 'keras.src.backend.common.keras_tensor.KerasTensor'> is not allowed only (<class 'tensorflow.python.framework.tensor.Tensor'> ...)
This error is because the PhoBERT model in the current huggingface version does not support KerasTensor
input in the notebook version of TensorFlow/Keras. There are 2 ways to fix this:
- Downgrade the version of TensorFlow to nearly the same as when I did this project, around
2.10
. - Use TensorFlow's Subclassing by creating your own model class, which is inherited from
keras.Model
. This is how I fixed that issue in this latest update.