Natural Language Processing (NLP) Software

This site contains a collection of NLP Software that was originally written for CMSC 25020 - Computational Linguistics @ University of Chicago. The software is provided free of charge, with no guarantees. Feel free to modify and/or distribute. If you have found any of this useful, drop me (James BonTempo) a line: bontempo_AT_finitestate.net

PERL Scripts

CONVECS - find the 15 most similar words for the 500 most frequent words in a corpus based on the cosine between their context vectors (7.93 kb)
HMM - find the probability of a string using a randomly generated arc-emission hidden markov model and the forward and backward algorithms (alpha and beta) (10.83 kb)
LANGCE - predict the language of a test corpus based on cross entropy values calculated using language symbol distributions generated from training corpora (7.10 kb)
LANGDIST - generate language symbol distributions from training corpora (3.98 kb)
STRALIGN - find the best alignment between two strings (14.01 kb)

Support Files

convecs_few_types.txt (154.44 kb)
convecs_many_types.txt (216.85 kb)
corpus.txt (5809.42 kb)
english_test_corpus.txt (7.89 kb)
english_train_corpus.txt (12.55 kb)
english_train_distribution.txt (0.22 kb)
gene_1.txt (0.39 kb)
gene_2.txt (0.39 kb)
indonesian_test_corpus.txt (4.17 kb)
indonesian_train_corpus.txt (34.78 kb)
indonesian_train_distribution.txt (0.26 kb)
perl_test_corpus.txt (7.10 kb)
perl_train_corpus.txt (18.62 kb)
perl_train_distribution.txt (0.35 kb)
portuguese_test_corpus.txt (2.88 kb)
portuguese_train_corpus.txt (31.68 kb)
portuguese_train_distribution.txt (0.36 kb)
somali_test_corpus.txt (2.31 kb)
somali_train_corpus.txt (32.51 kb)
somali_train_distribution.txt (0.25 kb)
text_1.txt (0.29 kb)
text_2.txt (0.21 kb)