#########################################################################
############## Computational Linguistics Toolset v1.1.2 #################
#########################################################################
######## Copyright (C) 2005 Wybo Wiersma <s.wiersma01@chello.nl> ########
#########################################################################
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#########################################################################
# It is kindly requested that you acknowledge the use of these tools in
# the publications reporting results produced with the help of them.
### What it is ###
The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (useful for finding statistically significant syntactical differences between any two sets of tagged texts), and various examination-tools. The tools themselves are well documented.
The individual tools are documented and versioned separately. Each time I make a significant new release (this includes bugfixes) of the entire package I will increase the version-number above.
### What is required ###
Perl (tested here on 5.8.4)
The following modules are needed but come standard with Perl:
FileHandle
FindBin
List::Util
File::Basename
The least thing you need to do is set the $configbasedir variable within the central config file (named config)
### What to find where ###
The dir-structure of the package is as follows:
tools/corpus/ the tools for preparing corpora
tools/examine/ tools for examining
tools/sensing/ tools for doing WordNet-related research
tools/tagging/ tools for tagging
tools/permstat/ tools for doing permutation-statistics
Two special dirs are:
tools/mess/ quick-hack scripts that have little general usage tools/export/ tools for exporting (tarring & publishing these tools)
This structure is not guaranteed to remain the same forever...
### How to use it (in the easiest way) ###
To get more info on what a script does, run it with the -? option.
The *goall-scripts are used to do runs in which the tools are chained together, to allow adding corpora, or doing many tasks in sequence.
To use the tools within a sequence without changing anything to the configuration-files you should follow the following instructions
1 The tools-dir should be unpacked inside another dir, for example: research/
2 The raw corpus should be stored in the following dir within this base-dir (research/)corpusData/<corpusname>/raw
3 Some tools need lists of some sort (like corpuslexiconreducer.pl). Those should be stored in (research/)taskData/lists
4 Other dirs like corpusData, and dirs within corpusData/<corpusname> (for example the cleaned/ dir) are created automatically by the tools when needed.
You can always modify the *goall scripts to suit the needs of your particular research. However things might change between versions of this tool-package, so have a look at the changelog before overwriting your current install.
Better even; drop me a note if you are using this toolset, so I can keep an eye on possible update-problems (although of course I cannot accept any formal liabillity)...
### Changelog
1.1.1 -> 1.1.2
Added PermStatResultSelector as a proper tool
Added multiple ipnorm normalization rounds for extra precision
Fixed a few minor bugs
1.1.0 -> 1.1.1
Fixed and updated (sentence-length counting):
examine/rowstatter.pl
Also added some library functions.
1.0.5 -> 1.1.0
Added the following tools:
corpus/corpus2tagrow.pl
corpus/corpusrewritetagrow.pl
sensing/sensinggoall.pl
sensing/sentencesenser.pl
sensing/semanticgravitor.pl
- Updated
sensing/wordcombinationfinder.pl
- bug fixed that caused some word-combinations not to be found
- changed the default window-size to 5
sensing/listsenser.pl
- implemented the option for using an existing database
- changed the database-format to cdb
