Tuesday, May 8, 2012

sentence-level alignment tools for statistical machine translation

Recently, I have found the following sentence-level alignment tools for statistical machine translation (SMT). These tools can pair sentences which have the same meaning but in different languages from parallel documents. This is also the first step of building an SMT system.

(1) CTK: Champollion Tool Kit
http://champollion.sourceforge.net/
Note: this tool (from LDC) uses translation lexicons to align sentences, and one disadvantage is that when the two documents are very different in the number of sentences, this tool can not work well.
CTK v1.2 supports three language pairs:
    English Chinese(GB)
    English Chinese(UTF8)
    English Arabic (UTF8)
    English Hindi (UTF8)

(2) Gale-Church Aligner
This is a very old sentence-level alignment algorithm, and fortunately Chris Crowner has implemented it in the NLTK.
http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/align/align.py?r=8552&spec=svn8552
Note that the python code is in the nltk_contrib, not in the main release of NLTK.

(3) MTTK: Machine Translation Toolkit
http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/
Note: this tool is supposed to have the ability to do sentence-level alignment, but I still can not figure out how to do it using the tool.

(4) Align
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.html
Note: this tool was developed by Adam Berger, and can be downloaded from:
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.tar
It supports sentence-level alignment using some anchor labels.

(5) Bleualign
https://github.com/rsennrich/Bleualign
This tool requires automatic translations of one side of the unaligned corpus and then uses a modified BLEU evaluation to find the sentence-level alignments. Of course, you need a seed SMT system to generate the automatic translations. The tool is written in Python.
I found a problem when using this aligner which could use the same sentence on the target side multiple times in the output alignments.

(6) Microsoft Bilingual Sentence Aligner
https://www.microsoft.com/en-us/download/details.aspx?id=52608
This is a sentence aligner written in Perl. It uses sentence length.


6 comments:

Anonymous said...

Can any one help me with python code for translation from Arabic language to English pleeeeeeeeeeese??

noha said...

Can any one help me with python code for translation from Arabic language to English pleeeeeeeeeeese??

Wang Pidong said...

to the best of my knowledge, there is no Python statistical machine translation decoder so far, so you'd better turn to using Moses to build your translation system. Of course, before building the system, you need to prepare some parallel training data of Arabic and English. One free way of getting the training data is to get them from some open source parallel corpora, e.g. OPUS.

Anonymous said...

How to get access_token for ios and android devices. Any translation code snippet for ios will help me a lot

Mohammad Nasiruddin said...

Hello Noha,

please check the Kriya decoder which is an implementation of hierarchical phrase-based (hiero) SMT system. It is entirely implemented in Python and includes both grammar extractor and decoder modules.
Please see the PBML paper for technical details specific to this implementation -
Baskaran Sankaran, Majid Razmara and Anoop Sarkar. 2012. Kriya – An end-to-end Hierarchical
Phrase-based MT System. The Prague Bulletin of Mathematical Linguistics (PBML), (97), 83--98

cidermole said...

And there's also Bob Moore's excellent "Bilingual Sentence Aligner".

Currently residing at

http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/

though Microsoft seems to change download links.