tatoeba_parser

INSTALL tatoeba_parser uses autotools to build on linux. Here is the four-step build procedure.

1.  autoreconf -i
2.  ./configure
3.  make
4.  make install

DESCRIPTION

tatoeba_parser is a program that parses the tatoeba database. It is helpful to retrieve all the sentences that match a given set of criterions. To make it work fully, 3 files are necessary: sentences.csv, links.csv and tags.csv. The three files can be freely retrieved from http://www.tatoeba.org .
I first coded this program because I needed example sentences in Chinese. I wanted to translate as many as I could, but I only knew a few characters, so I had to filter out the sentences that contained unknown characters. Then I added more options and filters so as to be able to gather the sentences that had a translation into a language I knew. Then I became so proud of my level of Chinese that I decided I wanted to train my ear as well. I had to get all the sentences which were tagged as "has audio". I developped some more code about that.

USAGE

Launching the program with --help should provide the list of options. I will list some example here.

1. I want to retrieve all the sentences that are written in French and that have a Spanish translation

    parser_r --lang fra --translatable-in spa
    
2. I want to have all the chinese sentences that are formed by a combination of the characters 你好吗
    
    parser_r --lang cmn --regex '^[你好吗]*$'

3. I want to get all the sentences which translations contain the word "foo"

    parser_r --translation-regex '^.*foo.*$'
    
4. I want to get all the Spanish translation tagged as "OK"

    parser_r --lang spa --has-tag "OK"
    
5. I want to have a list of the French sentences which have no space before their ? character

    parser_r --lang fra --regex "^.*[a-zA-Z]\\?.*$"
    
Some switches permit to modify the way the sentences are output. -i will write the sentences ids, and -n will write the line number.

AUTHOR & LICENSING

The author is Victor Lavaud <[email protected]>, and the program / source code is under the GPL license.

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
include		include
m4		m4
src		src
unittests		unittests
valgrind		valgrind
.gitignore		.gitignore
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
Makefile.am		Makefile.am
NEWS		NEWS
README		README
README.md		README.md
TODO		TODO
configure.ac		configure.ac
depcomp		depcomp
download_tatoeba_sentences.sh		download_tatoeba_sentences.sh
install-sh		install-sh
launch_on_avd.sh		launch_on_avd.sh
missing		missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tatoeba_parser

About

Releases

Packages

Contributors 2

Languages

License

qdii/tatoeba_parser

Folders and files

Latest commit

History

Repository files navigation

tatoeba_parser

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages