This repository is the official implementation of unimodal aggregation (UMA) for automaticspeech recognition (ASR).
It consists of two works:
- for non-autoregressive offline ASR: "Unimodal Aggregation for CTC-based Speech Recognition" (ICASSP 2024)
- for streaming ASR: "Mamba for Streaming ASR Combined with Unimodal Aggregation" (submitted to ICASSP 2025)
Poster 🤩 | Issues 😅 | Lab 🙉 | Contact 😘
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.
Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency.
- The proposed method is implemented using ESPnet2. So please make sure you have installed ESPnet successfully.
- Roll back espnet to the specified version as follows:
git checkout v.202304
- Clone the UMA-ASR codes by:
git clone https://github.com/Audio-WestlakeU/UMA-ASR
- Copy the configurations of the recipes in the egs2 folder to the corresponding directory in "espnet/egs2/". At present, experiments have only been conducted on AISHELL-1, AISHELL-2, HKUST dataset. If you want to experiment on other Chinese datasets, you can refer to these configurations.
- Copy the files in the espnet2 folder to the corresponding folder in "espnet/espnet2", and check that the comment path in the file header matches your path.
- To experiment, follow the ESPnet's steps. You can implement UMA method by simply changing run.sh from the command line to our run_unimodal.sh. For example:
Be careful to change the permissions of the bash files to executable.
./run_unimodal.sh --stage 10 --stop_stage 13
chmod -x asr_unimodal.sh chmod -x run_unimodal.sh
You can cite this paper like:
@inproceedings{fang2024unimodal,
title={Unimodal aggregation for CTC-based speech recognition},
author={Fang, Ying and Li, Xiaofei},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={10591--10595},
year={2024},
organization={IEEE}
}
@article{fang2024mambauma,
title={Mamba for Streaming ASR Combined with Unimodal Aggregation},
author={Ying Fang and Xiaofei Li},
journal={arXiv preprint arXiv:2410.00070},
year={2024}
}