Wednesday, December 19, 2012

How to display Chinese in Graphviz

Graphviz is a very handy tool for drawing plots and firgures.

However, it is not straight forward to display Chinese characters in the generated plots.

One example is as follows using DOT:
node [shape=box,style=dashed,height=0.3,fontname="C:\Windows\Fonts\NSimSun Regular.ttf",fontsize=12]; "你好"; "whr you are ∀";

where you can write UTF-8 encoded Chinese characters in the source file, and alternatively you can write it in xml-like unicode numbers like  "∀" (i.e. ∀). More importantly, you need to specify the Chinese font file such that Graphviz can really display the Chinese characters, since by default Graphviz can hardly find the correct font to use for Chinese characters.

Tuesday, December 18, 2012

How to use JDB on Linux

JDB is quite a powerful debugging tool for Java programs, especially for multi-threading Java programs.

JDB can be found in the JDK package.

You can also learn how to use JDB by reading the manual of JDB (using command "man jdb").

Here I only show the useful parts that I found:
(1) you can run your Java program as usual with the additional option "-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n"
(2) now you can start JDB using command "jdb -attach 8000"
(3) in JDB, you can first use "suspend" to suspend your JAVA program, and then use "threads" to see the thread list of the JAVA program; if you want to see what code is each thread running, you can use "where 0x22" (0x22 is the thread id which is from the thread list); after finishing debugging, you can use "resume" to resume your JAVA program.
(4) if you want to exit JDB and let your JAVA program go on running, you can simply press Control-C

Friday, November 2, 2012

How to set the priority order of jar files in Eclipse

If you have multiple jar files in your Eclipse project, then you would have some problems, if some jar files have classes with the same name. In this case, the priority order of the jar files should matter, because you always want to import the right class.

How to set the priority order of your jar files in Eclipse?

Right click your project, and then click the Property menu.
In the "Java Build Path" menu on the left of the popup Window.
Then you can see the "Order and Export" tab which shows the order of the jar files and source codes.

Thursday, September 20, 2012

How To Open a Command Prompt in Windows 8

Windows 8 has been released recently, and some people complain of it, since it changes the way in which users used to do with Windows.
However, I really like it, for the simple reason that it integrates nearly all the Microsoft products together, e.g. Windows on PC, Windows on Phone, and also Xbox.

One of the new features that I found useful is that in Windows 8 file explorer, you can easily open a command prompt in the current folder by simply clicking the menu File -> Open command prompt, which is the one that I has been expecting for a long time.

Tuesday, September 18, 2012

Useful add-ons of Firefox

FireBug 1.9
A useful tool used to see the html architectures of web pages.

A tool used to see the TCP/UDP packages sent from/to Firefox.

Tuesday, July 24, 2012

How to install Git on Linux

1. The easiest way is to install Git using package managers, and you can refer to the following page:

2. if you really cannot use any package manager, the final choice is to install Git from its source codes, which can be found at:
Then you can install Git from the source code (according to the INSTALL in the root directory of the Git source code tarball):
    $ make configure ;# as yourself
    $ ./configure --prefix=/usr ;# as yourself
    $ make all doc ;# as yourself
    # make install install-doc install-html;# as root

Wednesday, June 27, 2012

how to change the language of Windows 7 Business

Microsoft provides free language solutions for Windows 7 Ultimate/Enterprise, but not for Windows 7 Business.
Of course, you can pay some money to upgrade your Windows 7 Business to Ultimate/Enterprise for the privilege of changing languages.

There is also one free solution: using Vistalizator:

This website also provides the language files, MUI language pack, for different Windows systems.

Wednesday, June 13, 2012

Berkeley language model and Google Web 1T language model

Berkeley language model provides a library for estimating storing large n-gram language models in memory and accessing them efficiently. The most amazing contribution of it is that it can be used with the Google Web 1T language model, and it also provides the binary Web 1T language models for many languages:

The homepage of the Berkeley language model project is here, and you can find the binary language models of the Google Web 1T here.

Tuesday, June 12, 2012

Static variables in Python

While Python introduces a lot of new features to the programming community, the variable types in Python are always not clearly defined. In this post, I will discuss the static variables in Python.

(1) How to use static variables in Python classes
class Foo(object):
  counter = 0
  def __call__(self):
    Foo.counter += 1
    print Foo.counter
foo = Foo()
foo() #prints 1
foo() #prints 2
foo() #prints 3

(2) How to use static variable in Python functions (Python does not really have static variables in functions, so here we use the attribute of a function instead of real static variables)
def myfunc():
  if not hasattr(myfunc, "counter"):
     myfunc.counter = 0  # it doesn't exist yet, so initialize it
  myfunc.counter += 

Tuesday, June 5, 2012

How to download Jazzy

Jazzy is a Java spell checker, which is similar to Aspell.
However, on the sourceforge download page of Jazzy, we can only download the source codes of Jazzy, excluding the necessary dictionaries, which makes it hard to use Jazzy.
One possible solution that I just found is to download Jazzy from its CVS repository, on which page you can click the link Download GNU tarball to download a tarball of the complete Jazzy.

Sunday, May 27, 2012

How to use bitBucket with EGit in Eclipse

Git is becoming more and more popular these days, and when we really use version control systems like Git or SVN, we actually want to share our codes with other developers. Thus, we do need a Git server to host the codes, like, which is a Git server offering free limited use.

To set up a project in Eclipse, and push the project to bitBucket, you need to do the following steps:
(1) install EGit in Eclipse (;
(2) create an Eclipse project, e.g. HelloWorld; right click the project, and select Team->Share project... to add the project under Git control; right click the project again, and select Team->Add to index to add all the files of the project under version control; right click the project again, and select Team->Commit... to commit all the files;
(3) open an account on, e.g. your account name is myaccount;
(4) configure the SSH in Eclipse:
click your project HelloWorld;
open menu Window->Preference->General->Network Connections->SSH2;
since now you have no SSH keys (bitBucket needs SSH keys for SSH authorization), select Key Management tab and click the button Generate RSA Key... (You can also use DSA keys);
then you can see the public key in the text area, and you need to copy the public key and save it in your account on bitBucket (Account->SSH keys); you also need to click the button Save Private Key... to save the private key to your local directory;
click the General tab, and click the Add Private Key... button to choose the private key that you just saved;
click the OK button to apply all the changes;
(5) on bitBucket, create a repository named HelloWorld, and then you can get the SSH address of the repository as:
(6)right click the project in Eclipse, and select Team->Remote->Push...;
then enter the SSH address and choose SSH as the protocol; Click the Next> button;
(7) click Add all branches spec button only, and then click the Next> button;
(8) click OK

Till now, other developers can clone the project resided on bitBucket, and they can also push changes to the repository.
However, although you can push changes to the remote repository, you cannot pull changes from the repository, since the pull operation is not configured to work with the remote repository.
To solve this problem, you have to add the following lines to the Git configuration file (in your eclipse project folder .git/config):
[remote "origin"]
url = ssh://
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master

Sunday, May 20, 2012

How to add a open file dialog in a Netbeans project

Adding the File Chooser

  1. Choose Window > Navigating > Inspector to open the Inspector window, if it is not open yet.
  2. In the Inspector, right-click the JFrame node. Choose Add From Palette > Swing Windows > File Chooser from the context menu
    GUI Builder Tip: As an alternative to the 'Add From Palette' context menu, you can also drag and drop a JFileChooser component from the Swing Window category of the Palette to the white area of the GUI builder. It will have the same result, but it is a bit harder, because the preview of the JFileChooser is rather big and you might accidentally insert the window into one of the panels, which is not what you want.
  3. A look in the Inspector confirms that a JFileChooser was added to the form.
  4. Right-click the JFileChooser node and rename the variable to fileChooser.
  5. File Chooser added - Inspector View
You have added a File Chooser. Next you tune the File Chooser to display the title that you want, add a custom file filter, and integrate the File Chooser into your application.

Configuring the File Chooser

Implementing the Open Action

  1. Click to select the JFileChooser in the Inspector window, and then edit its properties in the Properties dialog box. Change the 'dialogTitle' property to This is my open dialog, press Enter and close the Properties dialog box.
  2. Click the Source button in the GUI Builder to switch to the Source mode. To integrate the File Chooser into your application, paste the following code snippet into the existing OpenActionPerformed() method.
    private void OpenActionPerformed(java.awt.event.ActionEvent evt) {
        int returnVal = fileChooser.showOpenDialog(this);
        if (returnVal == JFileChooser.APPROVE_OPTION) {
            File file = fileChooser.getSelectedFile();
            try {
              // What to do with the file, e.g. display it in a TextArea
     new FileReader( file.getAbsolutePath() ), null );
            } catch (IOException ex) {
              System.out.println("problem accessing file"+file.getAbsolutePath());
        } else {
            System.out.println("File access cancelled by user.");
  3. Note: Remove the first and last lines of the code snippet that duplicate the existing ones in the source file.
  4. If the editor reports errors in your code, right-click anywhere in the code and select Fix Imports or press Ctrl+Shift+I. In the Fix All Imports dialog box accept the defaults to update the import statements and click OK.
As you can see, you call the FileChooser's getSelectedFile() method to determine which file the user clicked, so you can work with it. This example reads the file contents and displays them in the TextArea.

Implementing a File Filter

Now you add a custom file filter that makes the File Chooser display only *.txt files.
  1. Switch to the Design mode and select the FileChooser in the Inspector window.
  2. In the Properties window, click the elipsis ("...") button next to the File Filter property.
  3. In the File Filter dialog box, select Custom Code from the combobox.
    A screenshot of the combobox open
  4. Type new MyCustomFilter() in the text field. Click OK.
  5. To make the custom code work, you write an inner (or outer) class MyCustomFilter that extends the FileFilter class. Copy and paste the following code snippet into the source of your class below the import statements to create an inner class implementing the filter.
        class MyCustomFilter extends javax.swing.filechooser.FileFilter {
            public boolean accept(File file) {
                // Allow only directories, or files with ".txt" extension
                return file.isDirectory() || file.getAbsolutePath().endsWith(".txt");
            public String getDescription() {
                // This description will be displayed in the dialog,
                // hard-coded = ugly, should be done via I18N
                return "Text documents (*.txt)";
    Forwarded from: 

Tuesday, May 8, 2012

sentence-level alignment tools for statistical machine translation

Recently, I have found the following sentence-level alignment tools for statistical machine translation (SMT). These tools can pair sentences which have the same meaning but in different languages from parallel documents. This is also the first step of building an SMT system.

(1) CTK: Champollion Tool Kit
Note: this tool (from LDC) uses translation lexicons to align sentences, and one disadvantage is that when the two documents are very different in the number of sentences, this tool can not work well.
CTK v1.2 supports three language pairs:
    English Chinese(GB)
    English Chinese(UTF8)
    English Arabic (UTF8)
    English Hindi (UTF8)

(2) Gale-Church Aligner
This is a very old sentence-level alignment algorithm, and fortunately Chris Crowner has implemented it in the NLTK.
Note that the python code is in the nltk_contrib, not in the main release of NLTK.

(3) MTTK: Machine Translation Toolkit
Note: this tool is supposed to have the ability to do sentence-level alignment, but I still can not figure out how to do it using the tool.

(4) Align
Note: this tool was developed by Adam Berger, and can be downloaded from:
It supports sentence-level alignment using some anchor labels.

(5) Bleualign
This tool requires automatic translations of one side of the unaligned corpus and then uses a modified BLEU evaluation to find the sentence-level alignments. Of course, you need a seed SMT system to generate the automatic translations. The tool is written in Python.
I found a problem when using this aligner which could use the same sentence on the target side multiple times in the output alignments.

(6) Microsoft Bilingual Sentence Aligner
This is a sentence aligner written in Perl. It uses sentence length.

Thursday, April 26, 2012

How to use the new Bing translator API with access tokens

Bing translator has changed its API recently, and it forces developers to use a more complicated way than the previous way using AppID.

The new API involves a temporal token, named as access token, which will expire in 10 minutes after you get it.

The detailed steps of using the new API with access tokens are as follows (all the parameters for the curl command should be used after URL encode):

(1) sign up for API access at Azure Data Market, from which you can get your Client Secret (a string, in My Account -> Account Keys);

(2) register the client application (that is, the app using the API), where you can create your own Client ID and Name, and you also have to type in redirect URI which should be a valid URL address (like "http://www.....");

(3)  to get an access token, make a POST request to:
For example,
curl -d 'grant_type=client_credentials&client_id=Client ID&client_secret=Client  Secret&scope=' | python -m json.tool
The returned data is in JSON format, in which you can find your temporal access token (a string, which is the value of the key "access_token").

(4) make POST requests to the Bing translator API with the access token to ask Bing to translate texts;
For example,
curl -H 'Authorization:bearer Your-Access-Token'  ''

(5) go back to (3), if the access token expires after 10 minutes, and we need to renew it again every 10 minutes.

Wednesday, April 18, 2012

Linux shell: stop ctrl+s

in a Linux shell, when you press ctrl+s, your shell will be suspended.
This may not be what you like.
To disable this feature, you can add one line to your .bashrc file in your home directory:

stty –ixon

Friday, March 2, 2012

Train huge language models

(1) using SRILM
(1.1) counting ngrams
Don't use ngram-count directly to count N-grams. Instead, use the make-batch-counts and merge-batch-counts scripts described in training-scripts(1). That way you can create N-gram counts limited only by the maximum file size on your system.
(1.2) training language models from ngram counts
You are likely to run out of memory either because of the size of ngram counts, or of the LM being built. The following are strategies for reducing the memory requirements for training LMs.
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use ngram-count in "raw" form. Instead, use the make-big-lm wrapper script described in the training-scripts(1) man page.
Switch to using the "_c" or "_s" versions of the SRI binaries. For instructions on how to build them, see the INSTALL file. Once built, set your executable search path accordingly, and try make-big-lm again.
Lower the minimum counts for N-grams included in the LM, i.e., the values of the options -gt2min, -gt3min, -gt4min, etc. The higher order N-grams typically get higher minimum counts.
Get a machine with more memory. If you are hitting the limitations of a 32-bit machine architecture, get a 64-bit machine and recompile SRILM to take advantage of the expanded address space. (The MACHINE_TYPE=i686-m64 setting is for systems based on 64-bit AMD processors, as well as recent compatibles from Intel.) Note that 64-bit pointers will require a memory overhead in themselves, so you will need a machine with significantly, not just a little, more memory than 4GB.

(2) using IRSTLM

Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.

Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.

The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory stat under your working directory, where the script will save lots of temporary files; then, simply run the script as in the example: -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10 

The script builds a 3-gram LM (option -n) from the specified input command (-i), by splitting the training procedure into 10 steps (-k). The LM will be saved in the output (-o) file train.irstlm.gz with an intermediate ARPA format. This format can be properly managed through the compile-lm command in order to produce a compiled version or a standard ARPA version of the LM.

For a detailed description of the procedure and of other commands available under IRSTLM please refer to the user manual supplied with the package.

Thursday, February 23, 2012

Moses: recaser issues

Nowadays, I am trying to put up an Moses-based MT demo.
I found that the moses/scripts/recaser/recase.perl actually does a lot of things other than using Moses to translate uncased text to cased text:
(1) by default the moses.ini configuration file of the MT system for recasing uses distortion-limit 6, which means it allows reordering, and the recase.perl script changes the distortion-limit to 1 by passing the option "-dl 1" to the Moses decoder.
(2) the recase.perl script also use some rules to do recasing, e.g., for English, it will always keep some specific words ("a","after","against","al-.+","and","any","as","at","be","because","between","by","during","el-.+","for","from","his","in","is","its","last","not","of","off","on","than","the","their","this","to","was","were","which","will","with") upper casing;
(3) the script also uppercases the initial word of a sentence.

Monday, February 20, 2012

Moses: pruning phrase tables

According to the page:

(1) I first download the source code of SALM from:
then I go to the directory:
and run command:
make allO32
make allO64
(There is some errors: make: *** No rule to make target `../../Bin/Linux/Search/SampleNGramIns.O32', needed by `allO32'. )
Note that I compile SALM using g++-4.1, and I had tried to use g++-4.4 but failed.

(2) I found that in the latest Moses got using command git there is no sub-directory named sigtest-filter, so I copied the sigtest-filter from some old version of Moses got using svn.
I go to the directory sigtest-filter, and run command:
make SALMDIR=/path/to/SALM
(using g++-4.4)

Friday, February 10, 2012

Python: multi threading problem

The canonical implementation of the Python programming language is based on C language. The term “CPython” is used when necessary to distinguish this implementation from others such as Jython or IronPython.

In CPython/Python, there is an important lock named global interpreter lock (GIL), which is the mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines. Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.

GIL actually prevents threads from running in parallel in Python. The GIL is only a problem when tackling CPU-bounded problems in Python, but it is not a big problem for I/O bounded threads.

(1) Jython is free of GIL;
(2) Cython;

Thursday, February 9, 2012

Compiling latest Moses from git

When I was using command:
./bjam --with-srilm=srilm-1.6.0 --with-irstlm=irstlm-5.70.04 --with-giza=giza-pp --with-boost=boost_1_48_0 -j1
to compile the latest Moses checked out using command git,
I got the following errors:
gcc.compile.c++ moses/src/LM/bin/gcc-4.4.1/release/debug-symbols-on/link-static/threading-multi/Factory.o
In file included from moses/src/LM/ORLM.h:8,
from moses/src/LM/Factory.cpp:41:
moses/src/DynSAInclude/onlineRLM.h:22: error: reference to ‘Vocab’ is ambiguous
moses/src/LM/SRI.h:33: error: candidates are: struct Vocab
moses/src/DynSAInclude/vocab.h:17: error: class Moses::Vocab

My solution is to replace all the "Vocab" with "Moses::Vocab" in moses/src/DynSAInclude.

Sunday, February 5, 2012

Python: buffering problem when using 'for line in sys.stdin'

Nowadays, I found a buffering problem when I use the following python code:
for line in sys.stdin:
print line

using which after I type in a sentence to the terminal, I get no output.

After investigating for a while, I come to the following solution (using readline() instead):
line=' '
while len(line)!=0:
print line

Saturday, February 4, 2012

Perl bug: spliting UTF-8 encoded Chinese string

I found a bug of perl, when I used regular expression /\s+/ to split a Chinese string "我想去你家,可以吗?我还想去月球,你想去吗?" which was encoded in UTF-8.

Thursday, January 19, 2012

Moses phrase-based decoder analysis

(1). from the moses-cmd/src/Main.cpp (int main(int argc, char* argv[]))

(2). Main.cpp first calls parameter->LoadParam(argc, argv) to load and check the parameters in the moses.ini configuration file and command line, where the model files are not loaded

(3). Main.cpp then calls StaticData::LoadDataStatic(parameter) to load weights and models according to the parameters of (2)
(3.1) StaticData::LoadDataStatic(parameter) calls StaticData::LoadData(Parameter *parameter)
(3.1.1) in StaticData::LoadData(Parameter *parameter), we load the weights and models by calling, e.g., StaticData::LoadLanguageModels(), LoadPhraseTables()
( in StaticData::LoadLanguageModels() calls LanguageModel* CreateLanguageModel(LMImplementation lmImplementation, const std::vector &factorTypes, size_t nGramOrder, const std::string &languageModelFile, float weight, ScoreIndexManager &scoreIndexManager , int dub) to create LM instances, where the highest level LM class is class LanguageModel : public StatefulFeatureFunction; LanguageModel is the parent class of LanguageModelSingleFactor and LanguageModelMultiFactor; LanguageModelInternal is a subclass of LanguageModelSingleFactor;
In Moses, the major specific interfaces of LM classes like LanguageModelInternal are: bool load(...) and float GetValue(const std::vector &contextFactor, State* finalState = 0, unsigned int* len = 0) const, where the former one is used to load a LM file while the later one calculates the probability for an n-gram saved in contextFactor; the class LanguageModel implements the general interface for a feature function, e.g., Evaluate(..)

(4). Main.cpp uses IOWrapper *ioWrapper = GetIODevice(staticData) to setup the input device (an input file or standard input)

(5). Main.cpp uses vector weights = staticData.GetAllWeights() to check on weights

(6). Main.cpp starts the main loop of translating input instances (text, confusion network, or lattice):
(6.1). use ReadInput(*ioWrapper,staticData.GetInputType(),source) to load an input, which is saved in source
(6.2). setup the translation manager by calling Manager manager(*source, staticData.GetSearchAlgorithm()), where by calling staticData.InitializeBeforeSentenceProcessing(source) we initialize the translation/language models for this sentence; the language model list is StaticDate.m_languageModel; the default search algorithm is SearchNormal;
(6.3). expand translation hypotheses stack by stack until the end of the input sentence using manager.ProcessSentence()
(6.3.1). ProcessSentence() first reset the statistics using staticData.ResetSentenceStats(m_source)
(6.3.2). ProcessSentence() then collects translation options for the input sentence
(6.3.3). ProcessSentence() calls the search algorithm to process the input using m_search->ProcessSentence()
(6.4). pick the best translation (maximum a posteriori decoding)

Sunday, January 15, 2012

How to install Ruby in your local directory from source code

tar -xzvf ruby-1.9.3-p0.tar.gz
cd ruby-1.9.3-p0
./configure --prefix=$HOME
make install

Saturday, January 14, 2012

Drawing figures with GNUplot

Recently, I have a need to draw a curve for a paper using latex. One of my friends suggests using gnuplot, which is really a nice tool to draw curves and other figures.
There are a lot of helpful examples on wikimedia:

How to burn CN image onto a DVD disc using NERO


2. in the popup Window, click on the tab whose title is ISO.

3. in the ISO tab, click the button OPEN to select the image that you want to burn to the disc.

4. after your selection, it will come back to the original Window;
on the TOP LEFT corner of the Window it says CD;
It has a drop down menu, and you need to click it and select DVD instead.

5. finish the burning process as usual.