Scalable k-means software and test datasets

 
October 2000


This package (a Unix tar file, gzipped) contains the source code for the software that was used to run the experiments for the article

F. Farnstrom, J. Lewis and C. Elkan, "Scalability for Clustering Algorithms Revisited", In ACM SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, August 2000.

The source code has been released to the public so that others can verify our results and try different algorithms for clustering large databases. Since this software from the beginning was written only for our own use, its usability may be lacking, and the source code is currently in general not adequately commented. If you find the software useful and would like to see a more userfriendly version of it, let us know and we will consider putting more work into it.

If you publish scientific work based upon this software, please cite the above article.

Questions about compiling and using the software can be directed to Fredrik Farnstrom through E-mail to fredrikf@mail.com or by writing to

Fredrik Farnstrom
Skolvagen 44
373 00 Jamjo
SWEDEN

More general questions or comments about scalable clustering algorithms should be directed to Dr. Charles Elkan at elkan@cs.ucsd.edu

This software has been released under the GNU General Public License.  For more information about this, please read the file 'LICENSE'.
 

COMPILING

On a UNIX-like platform with the Gnu C++ Compiler (g++), everything that needs to be done is to type 'make'. To use another commandline-based compiler or to change compiler options, edit the file 'Makefile'. If there's a need to compile the program under some other platform or compiler environment, see 'Makefile' for information about how the object files should be linked together into executables.

Typing 'make clean' will remove executables and object files created during compiling. As this software is of an experimental nature, there's no 'make install' or 'configure' available.

Standard C libraries and a math library are needed for compilation. In addition, the software uses the functions srand48() and drand48() for random number generation. If these are not available, they can be replaced by other functions, satisfying:

// Return a non-negative double-precision floating-point value uniformly
// distributed between [0.0, 1.0).
double drand48(void);

// Set seed value for random number generation.
void srand48(long int seedval);

For measuring the running time of the programs, the clock() function is used. This conforms to ANSI C and should thus not cause any problems, if it does anyway, it's only used by the main functions, so just replace it with some other function to measure running time.

The software has been successfully compiled on the Intel/Linux and SPARC/Solaris platforms.
 
 

CLASSES

The software was mainly designed using an object-oriented approach, and was implemented in C/C++.

Below is a table summarizing the classes used by the main software.

Class name  Declared in     Implemented in  Description of purpose
----------  -----------     --------------  ----------------------
Singleton   singleton.h     singleton.cpp   Stores a single point and some
                                            extra information about it.
Subcluster  subcluster.h    subcluster.cpp  Stores sufficient statistics
                                            about a cluster. Has functions
                                            to compute distances and to
                                            perturb cluster means.
MainCluster subcluster.h    subcluster.cpp  Inherits Subcluster
Database    database.h      -               Abstract base class for
                                            interfaces to databases.
SyntheticData syntheticdata.h syntheticdata.cpp Inherits Database. Creates
                                            synthetic data from Gaussian
                                            distributions.
KddCupData  kddcup98data.h  kddcup98data.cpp A subclass to Database,
                                            reading data from the KDD
                                            Cup '98 dataset.
RAMBuffer   rambuffer.h     rambuffer.cpp   A buffer storing points and
                                            clusters, making sure that the
                                            memory they occupy is limited.
KMeans      kmeans.h        kmeans.cpp      An implementation of the
                                            regular K-means algorithm.
BufferedKMeans kmeans.h     kmeans.cpp      Regular K-means operating on
                                            a fixed size buffer.
ScalableKMeans scalablekmeans.h sc..ns.cpp  The implementation of the
                                            scalable algorithms, incl.
                                            Bradley et al and our
                                            "simple algorithm".
 
 

RUNNING EXPERIMENTS ON KDD CUP '98 DATA

To run experiments on data from KDD Cup '98, do the following:

RUNNING EXPERIMENTS ON SYNTHETIC DATA

RUNNING EXPERIMENTS ON OTHER DATASETS

UTILITIES

kdd

Reads the data file from KDD Cup '98. To use, download the dataset 'cup98lrn.txt', as of October 2000 available in compressed form from http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html.  Load it in a text editor, remove the first line that contains the labels, and save it as 'cup98lrnB.txt'. Now, run 'kdd'. This extracts some of the features from the dataset and writes them to a new file 'cup98.dat'. 'cup98lrn.txt' and 'cup98lrnB.txt' can now be deleted if desired.

kdd2

Reads the file 'cup98.dat' generated by 'kdd' as above. Computes statistics for the features and performs normalization, saving the new dataset as 'cup98data.txt'. I don't think this is used for anything, but the program also saves the statistics to the file 'cup98msd.txt', and this is needed by 'kddcup98data.cpp'.  Observe the constants at the beginning of 'kdd2.c', these may need to be changed if other features are used, if running on a subset of the cup98lrn dataset, or if running on the validation set (cup98val).

scramble / scramble2

'scramble' reads the file 'cup98.dat' created by 'kdd'. A random (seed from clock) permutation is applied to the records of the dataset, and the new dataset is written to 'cup98b.dat', with floating point numbers stored in ASCII format. 'scramble2' is identical, except that it writes to a file called 'cup98data.dat' and saves the floating point numbers in a binary format instead.
 
 

Written by Fredrik Farnstrom, October 8, 2000
 

elkan@cs.ucsd.edu