PERL Scripts |
CONVECS - find the 15 most similar words for the 500 most frequent words in a corpus based on the cosine between their context vectors (7.93 kb) HMM - find the probability of a string using a randomly generated arc-emission hidden markov model and the forward and backward algorithms (alpha and beta) (10.83 kb) LANGCE - predict the language of a test corpus based on cross entropy values calculated using language symbol distributions generated from training corpora (7.10 kb) LANGDIST - generate language symbol distributions from training corpora (3.98 kb) STRALIGN - find the best alignment between two strings (14.01 kb) |
Support Files |
convecs_few_types.txt (154.44 kb) convecs_many_types.txt (216.85 kb) corpus.txt (5809.42 kb) english_test_corpus.txt (7.89 kb) english_train_corpus.txt (12.55 kb) english_train_distribution.txt (0.22 kb) gene_1.txt (0.39 kb) gene_2.txt (0.39 kb) indonesian_test_corpus.txt (4.17 kb) indonesian_train_corpus.txt (34.78 kb) indonesian_train_distribution.txt (0.26 kb) perl_test_corpus.txt (7.10 kb) perl_train_corpus.txt (18.62 kb) perl_train_distribution.txt (0.35 kb) portuguese_test_corpus.txt (2.88 kb) portuguese_train_corpus.txt (31.68 kb) portuguese_train_distribution.txt (0.36 kb) somali_test_corpus.txt (2.31 kb) somali_train_corpus.txt (32.51 kb) somali_train_distribution.txt (0.25 kb) text_1.txt (0.29 kb) text_2.txt (0.21 kb) |