You must have an account with the Fulton Supercomputing Lab (FSL). If you do not have an account, you can apply here. You will need to be careful to put all of the project in the correct directory. As of now (11/10/07) you will need to have the project in your compute directory to use the batch queue. For the hdx queue, you will need the project to be in your home directory.

Getting Started

  • When you log in, create a separate folder for your alfa stuff:
    mkdir alfa
  • Move into the directory.
    cd alfa
  • Do an svn checkout inside of your new alfa directory and put it in an ALFA directory.
    svn checkout ALFA
  • Move into the ALFA directory.
    cd ALFA
  • Make a data folder:
    mkdir data
  • Place any data you need into this folder. This may be PTB, BNC, others… They can be found on entropy
  • Send the script to the supercomputer to run.
    python scripts/ -t 1 -P 100 -a 1 -cBaseline -m 1 -n1 -v -dPTB -xActiveLearner.xml
  • This script will run one run of the baseline on the full Penn Treebank.

Script Parameters

As much as the code is self-documenting, this format provides for more lengthly explanations.

-v    --verbose     Prints more messages to the screen than normal
-d    --dataset     The dataset you are using for the experiment. We currently have PTB, Syriac, and BNC
-x    --xml         The xml file used to start the launch. Usually it's either ActiveLearner.xml or MultiTagActiveLearner.xml
-m    --models      Only used for QBC experiments. If not running QBC, use one for the number of models.
-P    --trainper    The percent of the file allTraining.txt to be used as training data. 
                      allTraining.txt is found in ALFA/data/dataset/ where dataset is PTB, BNC, Syriac
                      the percent is chosen randomly.
-T    --traintype   The type used to split this percent (either words or sentneces). With multiple runs and sufficient data, 
                     the split should be about equal, so we usually use sentences
-a    --amount      The amount of data that starts out as annotated. This amount is either a percentage or a hard number of
-p    --use_percent Whether or not the --amount parameter is using a percentage of data, or a number of words or sentences.
-i    --inittype    the type of data used to split the --amount. For example, -iword -a50 starts with at least 50 words 
                     (We don't cut any sentences in half, so we get the fewest number of sentences with at least 50 words) of                              
                     annotated data. -isentence -a1 -p means start with 1 percent of the sentences as annotated data. 
                     We typically start with one sentence (-a1 -isentence)
-s    --batchsize   The size of the batch query. This is how many sentences we give to the oracle each iteration. 
-b    --batchtype   The type we give to the oracle. This is either word or sentence
-c    --comp        The main algorithm used to find uncertainty. For example: QBU, LS, QBC, Baseline, etc.
-n    --numtests    The number of each experiment we want. Since we typically average 5 runs, -n is usually set to 5
-t    --time        The time estimated the experiment will take. It's generally good to overshoot, since the supercomputer will
                     terminate any processes that go over the (in hours?) specified time.
-f    --filename    If you want to change the filename of the experiments.
-C    --candidates  The number of candidates used from which the batch size will be chosen. The default is -1, which means make all
                     possible sentences candidates. In order to run an experiment similar to the Engelson and Dagan paper, 
                     you'd set -C1000 -s100 -bsentence (I believe).
-O    --switchover  The number of iterations after which you will switch to the random baseline.
-G    --stopping    The number of iterations after which you will stop the program.
-o    --outdir      The main output directory. The default is "out/" This should (if it matters) end with a slash.
-B    --switchbase  Whether or not the switchover point switches to the baseline or keeps the last model without training. 

Additional Parameters for Multi-tag options

-S    --subtags     The subtag indices used for a particular run. So, if I want to run a POS Tagger just considering the first subtag,
                     I'd add -S0 to the command line.
-D    --delimeter   What separates the subtag. For Syriac, the delimeter is #.
