Aspect-based Sentiment Analysis for Vietnamese

Publication: https://ieeexplore.ieee.org/document/9865479
Demo: https://youtu.be/ggmAvpA4oHQ

Enter your sentence: Lễ tân thân thiện, có thang máy, vị trí ks thuận tiện, view thành phố rất đẹp. Phòng sạch nhưng hơi nhỏ & thiếu bình đun siêu tốc. Sẽ quay lại & giới thiệu bạn bè
=> FACILITIES#DESIGN&FEATURES,positive
=> LOCATIONIGENERAL,positive
=> ROOM_AMENITIES#DESIGN&FEATURES,negative
=> ROOMS#CLEANLINESS,positive
=> ROOMS#DESIGN&FEATURES,negative
=> SERVICE#GENERAL,positive

I. Introduction
II. The VLSP 2018 Aspect-based Sentiment Analysis Dataset
- 1. Dataset Overview
- 2. Constructing *.csv Files for Model Development
III. Vietnamese Preprocessing
- 1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset
- 2. The vietnamese_processor.py
IV. Model Development
- ACSA-v1. Multi-task Approach
- ACSA-v2. Multi-task with Multi-branch Approach
V. Experimental Results
- 1. Evaluation on the VLSP 2018 ABSA Dataset
- 2. Some Notes about the Results

I. Introduction

This work aimed to solve the Aspect-based Sentiment Analysis (ABSA) problem for Vietnamese. Specifically, we focus on 2 sub-tasks of the Aspect Category Sentiment Analysis (ACSA):

Aspect Category Detection (ACD): Detect Aspect#Category pairs in each review (e.g., HOTEL#CLEANLINESS, RESTAURANT#PRICES, SERVICE#GENERAL, etc.)
Sentiment Polarity Classification (SPC): Classify the Sentiment Polarity (Positive, Negative, Neutral) of each Aspect#Category pair.

Here, we proposed 2 End-to-End solutions (ACSA-v1 and ACSA-v2), which used PhoBERT as a Pre-trained language model for Vietnamese to handle the above tasks simultaneously on 2 domains of the VLSP 2018 ABSA Dataset: Hotel and Restaurant.

II. The VLSP 2018 Aspect-based Sentiment Analysis Dataset

1. Dataset Overview

Domain	Dataset	No. Reviews	No. `Aspect`# `Cate`,`Polarity`	Avg. Length	Vocab Size	No. words in `Test`/`Dev` not in `Training` set
	Training	3,000	13,948	47	3,908	-
Hotel	Dev	2,000	7,111	23	2,745	1,059
	Test	600	2,584	30	1,631	346
	Training	2,961	9,034	54	5,168	-
Restaurant	Dev	1,290	3,408	50	3,398	1,702
	Test	500	2,419	163	3,375	1,729

The Hotel domain consists of 34 following Aspect#Category pairs:

['FACILITIES#CLEANLINESS', 'FACILITIES#COMFORT', 'FACILITIES#DESIGN&FEATURES', 'FACILITIES#GENERAL', 'FACILITIES#MISCELLANEOUS', 'FACILITIES#PRICES', 'FACILITIES#QUALITY', 'FOOD&DRINKS#MISCELLANEOUS', 'FOOD&DRINKS#PRICES', 'FOOD&DRINKS#QUALITY', 'FOOD&DRINKS#STYLE&OPTIONS', 'HOTEL#CLEANLINESS', 'HOTEL#COMFORT', 'HOTEL#DESIGN&FEATURES', 'HOTEL#GENERAL', 'HOTEL#MISCELLANEOUS', 'HOTEL#PRICES', 'HOTEL#QUALITY', 'LOCATION#GENERAL', 'ROOMS#CLEANLINESS', 'ROOMS#COMFORT', 'ROOMS#DESIGN&FEATURES', 'ROOMS#GENERAL', 'ROOMS#MISCELLANEOUS', 'ROOMS#PRICES', 'ROOMS#QUALITY', 'ROOM_AMENITIES#CLEANLINESS', 'ROOM_AMENITIES#COMFORT', 'ROOM_AMENITIES#DESIGN&FEATURES', 'ROOM_AMENITIES#GENERAL', 'ROOM_AMENITIES#MISCELLANEOUS', 'ROOM_AMENITIES#PRICES', 'ROOM_AMENITIES#QUALITY', 'SERVICE#GENERAL']

The Restaurant domain consists of 12 following Aspect#Category pairs:

['AMBIENCE#GENERAL', 'DRINKS#PRICES', 'DRINKS#QUALITY', 'DRINKS#STYLE&OPTIONS', 'FOOD#PRICES', 'FOOD#QUALITY', 'FOOD#STYLE&OPTIONS', 'LOCATION#GENERAL', 'RESTAURANT#GENERAL', 'RESTAURANT#MISCELLANEOUS', 'RESTAURANT#PRICES', 'SERVICE#GENERAL']

2. Constructing `*.csv` Files for Model Development

For models to easily process the dataset, I transformed the original *.txt files into *.csv form using the VLSP2018Parser class in vlsp2018_processor.py. I already provided these *.csv files for both domains in the datasets folder. However, if you want to re-generate them, you can run the following command

python processors/vlsp2018_processor.py

Each row in the *.csv will contains review and their corresponding Aspect#Category,Polarity labels, with the value 1 demonstrating the existence of the Aspect#Category in the review associated with its Positive label, and the same for 2 and 3 for Negative and Neutral labels, respectively. Finally, the value 0 indicates that the Aspect#Category does not exist in the review.

III. Vietnamese Preprocessing

👉 I already provided the preprocessed data for this project in the datasets folder.

1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset

flowchart LR
    style A fill:#ffccff,stroke:#660066,stroke-width:2px;
    style B fill:#cceeff,stroke:#0066cc,stroke-width:2px;
    style C fill:#ccffcc,stroke:#009933,stroke-width:2px;
    style F fill:#ffcc99,stroke:#ff6600,stroke-width:2px;
    style G fill:#ccccff,stroke:#6600cc,stroke-width:2px;
    style H fill:#ccff99,stroke:#66cc00,stroke-width:2px;
    style I fill:#ffcccc,stroke:#cc0000,stroke-width:2px;

    A[/📄 Input Text/]
    B([🔠 Lowercase])

    subgraph C [VietnameseToneNormalizer]
        direction TB
        C1([🌀 Normalize\nUnicode])
        C2([🖋️ Normalize\nSentence Typing])
        C1 --> C2
    end  

    subgraph E [VietnameseTextCleaner]
        E1{{"<i class='fas fa-code'></i> Remove HTML"}}
        E2{{"<i class='far fa-smile'></i> Remove Emoji"}}
        E3{{"<i class='fas fa-link'></i> Remove URL"}}
        E4{{"<i class='far fa-envelope'></i> Remove Email"}}
        E5{{"<i class='fas fa-phone'></i> Remove Phone Number"}}
        E6{{"<i class='fas fa-hashtag'></i> Remove Hashtags"}}
        E7{{"<i class='fas fa-ban'></i> Remove Unnecessary Characters"}}
        E1 --> E2 --> E3 --> E4 --> E5 --> E6 --> E7 
    end

    F([💬 Normalize\nTeencode])
    G([🛠️ Correct\nVietnamese Errors])
    H([🔪 Word\nSegmentation])
    I[/📄 Preprocessed Text/]

    click G "https://huggingface.co/bmd1905/vietnamese-correction-v2"
    click H "https://github.com/vncorenlp/VnCoreNLP"
    
    A --> B --> C --> E --> E1
    E --> F --> G --> E 
    F --> H --> I

Loading

2. The vietnamese_processor.py

I implemented 3 classes in the vietnamese_processor.py to preprocess raw Vietnamese text data. This is my improved version from the work by behitek:

(a) VietnameseTextCleaner: Simple regex-based text cleaning to remove HTML, Emoji, URL, Email, Phone Number, Hashtags, and other unnecessary characters.

(b) VietnameseToneNormalizer: Normalize Unicode (eg., 'ờ' != 'ờ') and sentence typing (eg., lựơng => lượng, thỏai mái => thoải mái).

(c) VietnameseTextPreprocessor:

Combine the above classes and add these following steps to the pipeline:

normalize_teencodes(text: str):
- Convert teencodes to its original form.
- I also provided the extra_teencodes parameter to add your own teencode definitions based on the dataset used. The extra_teencodes must be a dict with keys as the original form and values as a list of teencodes.
- You should be careful when using single word replacement for teencodes, because it can cause misinterpretation. For example, 'giá': ['price', 'gia'] can replace the word 'gia' in 'gia đình', making it become 'giá đình'.
correct_vietnamese_errors(texts: List):
- Use the pre-trained model by bmd1905 to correct Vietnamese errors.
- The inference time for this model is quite slow, so I implemented this method to process the text in batch. That's why you should pass a list of texts as input.
word_segment(text: str):
- Use VnCoreNLP to segment Vietnamese words.
- This tool is chosen because: "PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data".
- I already implemented script to automatically download necessary components of this tool into the VnCoreNLP folder, so you don't need to do anything.

Example Usage

from processors.vietnamese_processor import VietnameseTextPreprocessor
extra_teencodes = { 
    'khách sạn': ['ks'], 'nhà hàng': ['nhahang'], 'nhân viên': ['nv'],
    'cửa hàng': ['store', 'sop', 'shopE', 'shop'], 
    'sản phẩm': ['sp', 'product'], 'hàng': ['hàg'],
    'giao hàng': ['ship', 'delivery', 'síp'], 'đặt hàng': ['order'], 
    'chuẩn chính hãng': ['authentic', 'aut', 'auth'], 'hạn sử dụng': ['date', 'hsd'],
    'điện thoại': ['dt'],  'facebook': ['fb', 'face'],  
    'nhắn tin': ['nt', 'ib'], 'trả lời': ['tl', 'trl', 'rep'], 
    'feedback': ['fback', 'fedback'], 'sử dụng': ['sd'], 'xài': ['sài'], 
}

preprocessor = VietnameseTextPreprocessor(vncorenlp_dir='./VnCoreNLP', extra_teencodes=extra_teencodes, max_correction_length=512)
sample_texts = [
    'Ga giường không sạch, nhân viên quên dọn phòng một ngày. Chất lựơng "ko" đc thỏai mái 😔',
    'Cám ơn Chudu24 rất nhiềuGia đình tôi có 1 kỳ nghỉ vui vẻ.Resort Bình Minh nằm ở vị trí rất đẹp, theo đúng tiêu chuẩn, còn về ăn sáng thì wa dở, chỉ có 2,3 món để chọn',
    'Giá cả hợp líĂn uống thoả thíchGiữ xe miễn phíKhông gian bờ kè thoáng mát Có phòng máy lạnhMỗi tội lúc quán đông thì đợi hơi lâu',
    'May lần trước ăn mì k hà, hôm nay ăn thử bún bắp bò. Có chả tôm viên ăn lạ lạ. Tôm thì k nhiều, nhưng vẫn có tôm thật ở nhân bên trong. ',
    'Ngồi ăn Cơm nhà *tiền thân là quán Bão* Phần vậy là 59k nha. Trưa từ 10h-14h, chiều từ 16h-19h. À,có sữa hạt sen ngon lắmm. #food #foodpic #foodporn #foodholic #yummy #deliciuous'
]
preprocessed_texts = preprocessor.process_batch(sample_texts, correct_errors=True)
preprocessor.close_vncorenlp()
print(preprocessed_texts)

IV. Model Development

Based on the original BERT paper, the model achieved the best results when concatenating last 4 layers of BERT together. So we applied that method to the PhoBERT layer in our model architectures and combined it with 2 output construction ways below, ACSA-v1 and ACSA-v2, to form the final solutions.

ACSA-v1. Multi-task Approach:

👉 Notebook Solutions: Hotel-v1.ipynb, Restaurant-v1.ipynb

1. Output Construction

We transformed each Aspect#Category pair and their corresponding Polarity labels in each dataset's review into a list of C one-hot vectors, where C is the number of Aspect#Category pairs:

Each vector has 3 polarity labels, Positive, Negative, Neutral, and 1 None label to indicate whether or not the input has this Aspect#Category so that it can have a polarity. Labels that exists will be 1, otherwise 0.
Therefore, we need to create C Dense layers with 4 neurons for each to predict the polarity of the corresponding Aspect#Category pair.
Softmax function will be applied here to get the probability distribution over the 4 polarity classes.

However, we will not simply feedforward the learned feature to each Dense layer one-by-one. Instead, we will concatenate them into a single Dense layer consisting of:

34 Aspect#Categories × 4 Polarities = 136 neurons for the Hotel domain.
12 Aspect#Categories × 4 Polarities = 48 neurons for the Restaurant domain.

Finally, the binary_crossentropy loss function will be applied to treat each Dense layer in the final Concatenated Dense layer as a binary classification problem.

2. Why use one-hot encoding and Softmax?

In this ACSA problem, each Aspect#Category,Polarity can represent an independent binary classification task (Is this Aspect#Category Positive or not?, Is this Aspect#Category Negative or not?, etc.).

So you might wonder that instead of treating each Aspect#Category,Polarity as a separate output neuron with Sigmoid, why we one-hot encoded them within a single 4-neuron block for each and used Softmax? The key issue here is that the polarities within an Aspect#Category are not entirely independent. For example:

If the Aspect#Category is strongly Positive, it's less likely to be Negative or Neutral.
If the Aspect#Category is very Negative, it's less likely to be Positive or Neutral.

Using separate Sigmoids doesn't inherently capture this relationship. You could end up with outputs like: Positive=0.9, Negative=0.8, Neutral=0.7. This doesn't make sense because the polarities should be mutually exclusive and the sum of the probabilities should be 1, which is what Softmax does.

3. Why concat each `Aspect#Category` into 1 Dense layer and apply `binary_crossentropy`?

The Concatenation mixes the independent Aspect#Category,Polarity information and allows the network to learn complex/shared relationships between them. For example, if the model sees that HOTEL#CLEANLINESS is Positive, it might be more likely to predict HOTEL#QUALITY as Positive as well.

When using this Concatenation, the binary_crossentropy will be applied to each output independently and the Softmax constraint is maintained during forward and backward passes for each Aspect#Category. This approach not only allows the model to learn to predict multiple Aspect#Category,Polarity simultaneously as binary classification problems but also maintains the mutual exclusivity of 4 polarities within each Aspect#Category.

Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/04/22/MultitaskLearning.html

ACSA-v2. Multi-task with Multi-branch Approach:

👉 Notebook Solutions: Hotel-v2.ipynb, Restaurant-v2.ipynb

The only difference of this approach from the above is that it will branch into many sub-models by using C Dense layers (34 for Hotel and 12 for Restaurant) but not concatenating them into a single one. Each model will predict each task independently, not sharing parameters between them.

The Softmax function is applied here to get the probability distribution over the 4 polarity classes directly without converting them into one-hot vectors. Therefore, the categorical_crossentropy loss function will be used to treat each Dense layer as a multi-class classification problem.

Reference (Vietnamese): https://phamdinhkhanh.github.io/2020/05/05/MultitaskLearning_MultiBranch.html

V. Experimental Results

1. Evaluation on the VLSP 2018 ABSA Dataset

VLSP has their own Java evaluation script for their ACSA tasks. You have to prepare 2 files:

The ground-truth file: 3-VLSP2018-SA-Hotel-test.txt for the Hotel domain and 3-VLSP2018-SA-Restaurant-test.txt for the Restaurant domain.
The predicted file that has the same format as the ground-truth file. You can find the example predictions of the models in the experiments/predictions folder.

I already provided a script to run the evaluation for each domain and approach. You can run the following command to get the evaluation results:

source ./evaluators/vlsp_evaluate.sh

Task	Method	Hotel			Restaurant
Task	Method	Precision	Recall	F1-score	Precision	Recall	F1-score
Aspect# Category	VLSP best submission	76.00	66.00	70.00	79.00	76.00	77.00
	Bi-LSTM+CNN	84.03	72.52	77.85	82.02	77.51	79.70
	BERT-based Hierarchical	-	-	82.06	-	-	84.23
	Multi-task	87.45	78.17	82.55	81.09	85.61	83.29
	Multi-task Multi-branch	63.21	57.86	60.42	80.81	87.39	83.97
Aspect# Category, Polarity	VLSP best submission	66.00	57.00	61.00	62.00	60.00	61.00
	Bi-LSTM+CNN	76.53	66.04	70.90	66.66	63.00	64.78
	BERT-based Hierarchical	-	-	74.69	-	-	71.30
	Multi-task	81.90	73.22	77.32	69.66	73.54	71.55
	Multi-task Multi-branch	57.55	52.67	55.00	68.69	74.29	71.38

2. Some Notes about the Results

The predictions in the experiments/predictions folder and the evaluation results in the evaluators folder are obtained from older models I did couple years ago.

I finished the paper on this project in 2021, so the above results are obtained from the experiments I conducted at that time, which is located from this e8439bc commit. Something to note if you want to re-run the notebooks in that commit to obtain the above results:

You can download the weights for each model here.
As the notebooks in this commit are deprecated, you can face some issues when running them. For example, when calling the create_model function, you will face the following error when initializing the input layer.

<class 'keras.src.backend.common.keras_tensor.KerasTensor'> is not allowed only (<class 'tensorflow.python.framework.tensor.Tensor'> ...)

This error is because the PhoBERT model in the current huggingface version does not support KerasTensor input in the notebook version of TensorFlow/Keras. There are 2 ways to fix this:

Downgrade the version of TensorFlow to nearly the same as when I did this project, around 2.10.
Use TensorFlow's Subclassing by creating your own model class, which is inherited from keras.Model. This is how I fixed that issue in this latest update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Aspect-based Sentiment Analysis for Vietnamese

Table of Contents

I. Introduction

II. The VLSP 2018 Aspect-based Sentiment Analysis Dataset

1. Dataset Overview

2. Constructing `*.csv` Files for Model Development

III. Vietnamese Preprocessing

1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset

2. The vietnamese_processor.py

IV. Model Development

ACSA-v1. Multi-task Approach:

1. Output Construction

2. Why use one-hot encoding and Softmax?

3. Why concat each `Aspect#Category` into 1 Dense layer and apply `binary_crossentropy`?

ACSA-v2. Multi-task with Multi-branch Approach:

V. Experimental Results

1. Evaluation on the VLSP 2018 ABSA Dataset

2. Some Notes about the Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Aspect-based Sentiment Analysis for Vietnamese

Table of Contents

I. Introduction

II. The VLSP 2018 Aspect-based Sentiment Analysis Dataset

1. Dataset Overview

2. Constructing *.csv Files for Model Development

III. Vietnamese Preprocessing

1. Vietnamese Preprocessing Steps for the VLSP 2018 ABSA Dataset

2. The vietnamese_processor.py

IV. Model Development

ACSA-v1. Multi-task Approach:

1. Output Construction

2. Why use one-hot encoding and Softmax?

3. Why concat each Aspect#Category into 1 Dense layer and apply binary_crossentropy?

ACSA-v2. Multi-task with Multi-branch Approach:

V. Experimental Results

1. Evaluation on the VLSP 2018 ABSA Dataset

2. Some Notes about the Results

2. Constructing `*.csv` Files for Model Development

3. Why concat each `Aspect#Category` into 1 Dense layer and apply `binary_crossentropy`?