This package (a Unix tar file, gzipped) contains the source code for the software that was used to run the experiments for the article
F. Farnstrom, J. Lewis and C. Elkan, "Scalability for Clustering Algorithms Revisited", In ACM SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, August 2000.
The source code has been released to the public so that others can verify our results and try different algorithms for clustering large databases. Since this software from the beginning was written only for our own use, its usability may be lacking, and the source code is currently in general not adequately commented. If you find the software useful and would like to see a more userfriendly version of it, let us know and we will consider putting more work into it.
If you publish scientific work based upon this software, please cite the above article.
Questions about compiling and using the software can be directed to Fredrik Farnstrom through E-mail to fredrikf@mail.com or by writing to
Fredrik Farnstrom
Skolvagen 44
373 00 Jamjo
SWEDEN
More general questions or comments about scalable clustering algorithms should be directed to Dr. Charles Elkan at elkan@cs.ucsd.edu
This software has been released under the GNU General Public License.
For more information about this, please read the file 'LICENSE'.
Typing 'make clean' will remove executables and object files created during compiling. As this software is of an experimental nature, there's no 'make install' or 'configure' available.
Standard C libraries and a math library are needed for compilation. In addition, the software uses the functions srand48() and drand48() for random number generation. If these are not available, they can be replaced by other functions, satisfying:
// Return a non-negative double-precision floating-point value uniformly
// distributed between [0.0, 1.0).
double drand48(void);
// Set seed value for random number generation.
void srand48(long int seedval);
For measuring the running time of the programs, the clock() function is used. This conforms to ANSI C and should thus not cause any problems, if it does anyway, it's only used by the main functions, so just replace it with some other function to measure running time.
The software has been successfully compiled on the Intel/Linux and SPARC/Solaris
platforms.
Below is a table summarizing the classes used by the main software.
Class name Declared in Implemented
in Description of purpose
---------- ----------- --------------
----------------------
Singleton singleton.h singleton.cpp
Stores a single point and some
extra information about it.
Subcluster subcluster.h subcluster.cpp
Stores sufficient statistics
about a cluster. Has functions
to compute distances and to
perturb cluster means.
MainCluster subcluster.h subcluster.cpp
Inherits Subcluster
Database database.h
-
Abstract base class for
interfaces to databases.
SyntheticData syntheticdata.h syntheticdata.cpp Inherits Database.
Creates
synthetic data from Gaussian
distributions.
KddCupData kddcup98data.h kddcup98data.cpp A subclass
to Database,
reading data from the KDD
Cup '98 dataset.
RAMBuffer rambuffer.h rambuffer.cpp
A buffer storing points and
clusters, making sure that the
memory they occupy is limited.
KMeans kmeans.h
kmeans.cpp An implementation of the
regular K-means algorithm.
BufferedKMeans kmeans.h kmeans.cpp
Regular K-means operating on
a fixed size buffer.
ScalableKMeans scalablekmeans.h sc..ns.cpp The implementation
of the
scalable algorithms, incl.
Bradley et al and our
"simple algorithm".
Reads the data file from KDD Cup '98. To use, download the dataset 'cup98lrn.txt', as of October 2000 available in compressed form from http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html. Load it in a text editor, remove the first line that contains the labels, and save it as 'cup98lrnB.txt'. Now, run 'kdd'. This extracts some of the features from the dataset and writes them to a new file 'cup98.dat'. 'cup98lrn.txt' and 'cup98lrnB.txt' can now be deleted if desired.
kdd2
Reads the file 'cup98.dat' generated by 'kdd' as above. Computes statistics for the features and performs normalization, saving the new dataset as 'cup98data.txt'. I don't think this is used for anything, but the program also saves the statistics to the file 'cup98msd.txt', and this is needed by 'kddcup98data.cpp'. Observe the constants at the beginning of 'kdd2.c', these may need to be changed if other features are used, if running on a subset of the cup98lrn dataset, or if running on the validation set (cup98val).
scramble / scramble2
'scramble' reads the file 'cup98.dat' created by 'kdd'. A random (seed
from clock) permutation is applied to the records of the dataset, and the
new dataset is written to 'cup98b.dat', with floating point numbers stored
in ASCII format. 'scramble2' is identical, except that it writes to a file
called 'cup98data.dat' and saves the floating point numbers in a binary
format instead.
Written by Fredrik Farnstrom, October 8, 2000