You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and then I tried tokenizing the example sentence from the docs
> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 13.96 secs fish external
usr time 13.09 secs 0.00 micros 13.09 secs
sys time 0.86 secs 0.00 micros 0.86 secs
but it takes around 14 seconds to load the dictionary.
In comparison, mecab is near instant
> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 28.32 millis fish external
usr time 0.00 millis 0.00 micros 0.00 millis
sys time 31.25 millis 0.00 micros 31.25 millis
I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner struct. In particular, when it runs the read_common function
fnread_common<R>(mutrdr:R) -> Result<DictionaryInner>whereR:Read,{letmut magic = [0;MODEL_MAGIC.len()];
rdr.read_exact(&mut magic)?;if magic != MODEL_MAGIC{returnErr(VibratoError::invalid_argument("rdr","The magic number of the input model mismatches.",));}let config = common::bincode_config();let data = bincode::decode_from_std_read(&mut rdr, config)?;Ok(data)}
It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?; so it seems like bincode deserialization is slow.
How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode
It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.
Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?
Thank you very much for the report and suggestions!
How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
MeCab uses mmap to load the dictionary, enabling such fast and memory-efficient deserialization.
If Vibrato uses rkyv as you suggested, an equivalent performance would be achieved because rkyv supports mmap. I agree with attempting rkyv. I will try it when I have time.
Thank you very much for the report and suggestions!
How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
MeCab uses mmap to load the dictionary, enabling such fast and memory-efficient deserialization.
If Vibrato uses rkyv as you suggested, an equivalent performance would be achieved because rkyv supports mmap. I agree with attempting rkyv. I will try it when I have time.
Ah! mmap! No wonder mecab can load it so quick! And thanks for the quick response! Excited to see where this one goes 🙏
Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.
I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using
and then I tried tokenizing the example sentence from the docs
but it takes around 14 seconds to load the dictionary.
In comparison, mecab is near instant
I looked at the code and it seems like all the time is taken from deserializing bincode into the
DictionaryInner
struct. In particular, when it runs theread_common
functionIt takes a long time to complete
let data = bincode::decode_from_std_read(&mut rdr, config)?;
so it seems like bincode deserialization is slow.How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode
According to the rkyv docs, it says
Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?
Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes
The features such as
are stored as strings, maybe they can be stored as
Vec<u8>
instead?Additional context
I'm using vibrato version 0.5.1
And here are the compiled dictionary sizes
The text was updated successfully, but these errors were encountered: