-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
161 lines (106 loc) · 4.57 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
Champollion Tool Kit V1.2
About CTK
----------
Champollion Tool Kit (CTK) is a tool kit aiming to provide
ready-to-use parallel text sentence alignment tools for as many
language pairs as possible.
Built around the LDC champollion sentence aligner kernel, the tool kit
provides essential components required for accurate sentence
alignment, including sentence breakers, stemmers, pre-processing
scripts, dictionaries (if possible), post-processing scripts etc.
Currently, CTK includes tools to align English text with Arabic, Chinese,
and Hindi translations. It can be easily expanded to other language pairs.
CTK welcomes contributions from other researchers.
CTK is written in perl.
Installation
------------
After unpack the CTK distribution, you need to set the enviorment
variable CTK to the directory where the package is unpacked, which is
this directory if you haven't done anything funny. And that's it.
To test the installation, try run the following command:
./test_installtion
It will tell you either the installation is good, or bad and in which
case minimum diagnosis will be given.
Please note, the first time you run champollion (or test_installation
which runs champollion internally), the program needs to build certain
databases, which can take up to five minutes.
Input and Output
----------------
The input files for both sides should be one segment (sentence) per
line.
The output (alignment file) looks like the following:
omitted <=> 1
omitted <=> 2
omitted <=> 3
1 <=> 4
2 <=> 5
3 <=> 6
4,5 <=> 7
6 <=> 8
7 <=> 9
8 <=> 10
9 <=> omitted
Every alignment is in the format of:
language1 sentence ids <=> language2 sentence ids
where each language1/language2 sentence ids may contain up to four sentence ids
delimited by commas, it also can be "omitted" indicating no translation
was found. The sentence ids start at 1.
Languages
---------
CTK v1.2 supports three language pairs:
English Chinese(GB)
English Chinese(UTF8)
English Arabic (UTF8)
English Hindi (UTF8)
IMPORTANT: Because we don't have IPRs to distribute the dictionaries
we're using internally, the dictionaries included in this package
are rather small: English Chinese dictionary (about 5K headwords)
and English Arabic dictionary (about 4K headwords). Our experiment
shows that bigger dictionary usually leads better performance, which
means that you may want to use your own dictionary, if it has better
coverage than the one we provide.
Command Line
------------
Command line to run English Chinese sentence aligner:
your_CTK_path/bin/champollion.EC_GB <english sentence file> <chinese sentence file> <alignment file>
or
your_CTK_path/bin/champollion.EC_utf8 <english sentence file> <chinese sentence file> <alignment file>
Command line to run English Arabic sentence aligner:
your_CTK_path/bin/champollion.EA <english sentence file> <arabic sentence file> <alignment file>
Command line to run English Hindi sentence aligner:
your_CTK_path/bin/champollion.EH <english sentence file> <hindi sentence file> <alignment file>
In addition, there is champollion.generic which can align unknown
language pairs. To run it:
your_CTK_path/bin/champollion.generic <LX sentence file> <LY sentence file> <alignment file> <dictionary>
For languages not included in the package, it's strongly recommended
that you write a tokenizer following existing examples, and a stemmer
if possible. Even an imperfect stemmer will make a big difference.
Evaluation Corpus
-----------------
To facilitate the development of better sentence aligners, this
package also includes the manually aligned English Chinese data as an
evaluation corpus. The evaluation corpus is in 'eval' directory.
The data were selected from three sources: UN, Sinorama, and Hong Kong Hansards.
The Chinese files are:
198706005.c.txt 921008fc.txt UN19990209_010.c.txt
200110006.c.txt 930422fc.txt
890621fc.txt UN19930101_020.c.txt
The English files are:
198706005.e.txt 921008fe.txt UN19990209_010.e.txt
200110006.e.txt 930422fe.txt
890621fe.txt UN19930101_020.e.txt
The gold alignment files are:
198706005.gold.align 930422f.gold.align
200110006.gold.align UN19930101_020.gold.align
890621f.gold.align UN19990209_010.gold.align
921008f.gold.align
COPYRIGHT
---------
This software are protected by Common Public License, see LICENSE for
detail.
Contributions, Questions, Bug report, etc.
------------------------------------------
Please contact Xiaoyi Ma at [email protected].
Xiaoyi Ma 6/20/2011
Linguistic Data Consortium