Any work around to retain original form of words ? #5

PSanni · 2022-07-27T09:46:55Z

The model currently cannot retain the original form of words. For example, in image if words are "sunflower oil", it returns "sunfloweroil" without space. Is there any work around to address it?

Also, is it possible to fine-tune this model on other dataset such as XFUND (https://github.com/doc-analysis/XFUND) ?

bmusq · 2022-07-27T12:13:16Z

Hello @PSanni,

For your first problem, namely, retaining original form of words, I do not know how to adress it.

Though, for your second question, I was able to use another dataset of my own (actually being trained). Hereby the solution I came up with. I hope it can be applied to your usecase.

This project makes use of the datasets from this other project https://github.com/ku21fan/STR-Fewer-Labels, as mention in Datasets.md, with few workarounds.
If you look into this other project, you will find a section in the Readme.md named "When you need to train on your own dataset or Non-Latin language datasets.". I bet the name is explicit enough.
They provide a piece of code in create_lmdb_dataset.py as well as the input format to this file to generate a dataset well formatted to be used by the algorithm, and a fortiori, by parseq as well.

I thouroughly followed the instructions and was able to start a training with parseq on my own dataset.

Edit: the training terminates but the test shows really inconsistent results. Maybe the .mdb file is still problematic. I am exploring this issue

baudm · 2022-07-27T17:01:35Z

@PSanni for now, you can just directly edit and comment out

parseq/strhub/data/dataset.py

Line 85 in 98959c9

label = ''.join(label.split())

Note that some preprocessed datasets have had the spaces within labels removed. For the datasets which I preprocessed (COCO, OpenVINO, TextOCR), the spaces within the labels should be intact.

For fine-tuning on other datasets, you have two options:

Write your own Dataset subclass which follows the same public interface as LmdbDataset.
Preprocess your dataset into an LMDB database (see one of the converter scripts in tools to write your own preprocessing script. Then use create_lmdb_dataset.py to create the actual LMDB files).

- Expose normalize_unicode parameter of LmdbDataset - Add remove_whitespace flag for disabling whitespace removal in labels

baudm · 2022-07-28T08:44:45Z

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so:
./train.py data.remove_whitespace=false data.normalize_unicode=false

PSanni · 2022-07-28T11:55:06Z

I think its a good idea to include an annotation samples and required input format to the model.

baudm · 2022-07-28T16:10:21Z

The LMDB format used is unchanged from prior work. create_lmdb_dataset.py expects a text file with one image path and label per line. The actual format is described in the README for the TextOCR and OpenVINO archives.

The conversion from text labels to token IDs is handled by Tokenizer.encode() (in strhub/data/utils.py).

baudm · 2022-07-29T08:06:03Z

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so: ./train.py data.remove_whitespace=false data.normalize_unicode=false

In addition to disabling whitespace (space, tabs, new line, etc.) removal, make sure you add the space character ' ' to charset_train and charset_test so it won't get removed by CharsetAdapter.

Closing this now since all issues have been addressed already. Feel free to reopen if I missed anything.

baudm added a commit that referenced this issue Jul 28, 2022

Add config options for label preprocessing (Refs #5)

e8ea463

- Expose normalize_unicode parameter of LmdbDataset - Add remove_whitespace flag for disabling whitespace removal in labels

baudm closed this as completed Jul 29, 2022

baudm added the enhancement New feature or request label Aug 8, 2022

baudm mentioned this issue Feb 9, 2023

Unicode error Portuguese Charset #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any work around to retain original form of words ? #5

Any work around to retain original form of words ? #5

PSanni commented Jul 27, 2022

bmusq commented Jul 27, 2022 •

edited

Loading

baudm commented Jul 27, 2022

baudm commented Jul 28, 2022

PSanni commented Jul 28, 2022 •

edited

Loading

baudm commented Jul 28, 2022

baudm commented Jul 29, 2022

Any work around to retain original form of words ? #5

Any work around to retain original form of words ? #5

Comments

PSanni commented Jul 27, 2022

bmusq commented Jul 27, 2022 • edited Loading

baudm commented Jul 27, 2022

baudm commented Jul 28, 2022

PSanni commented Jul 28, 2022 • edited Loading

baudm commented Jul 28, 2022

baudm commented Jul 29, 2022

bmusq commented Jul 27, 2022 •

edited

Loading

PSanni commented Jul 28, 2022 •

edited

Loading