Classification/Clustering Datasets

The latest way to get the datasets, including the data, the split indices, and associated scripts is through a special subversion repository.

svn checkout .

This will check out everything except the actual data (indices and scripts) from the repository. Each dataset is a directory organized as follows:

dataset_root +
             |- README
             |- indices +
             |          |-
             |          |-...
             |- scripts +
The scripts/getData script is setup to copy the directories containing the corresponding data files to the dataset_root directory. After running the script, the dataset should be ready to run.


This is a descendant of the dataset harvested from Enron's mail server after it was used as evidence against the company and was subsequently put into the public domain. Originally, there were no topic labels available for the Enron dataset and it was therefore appropriate only for unsupervised methods. Recently, the LDC has released a set of annotations for the dataset and so it may now be used for classification and external metrics can be used to evaluate clusterings of the annotated subset of the data.

Movie Review

This data set is derived from the data set provided by Bo Pang and Lillian Lee at Cornell. Only one split is provided, which supports both clustering and classification. The data set creators divided the reviews into 3 classes, based on normalized scores given by the authors in their review. In the indices provided here, these three classes have been labeled Positive, Negative and Neutral. Details of the data gathering process and class determination procedure can be found here, particularly in this README.

The data consists of movie review written by four authors. Particularly:

Author - number of documents

Dennis Schwartx - 1027

James Berardinelli - 1307

Scott Renshaw - 902

Steve Rhodes - 1770

For a total of 5006 documents.

The data has been divided into training, dev-test, and blind-test as follows:


Negative - 969

Neutral - 1532

Positive - 1504

Total - 4005

Dev Test:

Negative - 121

Neutral - 179

Positive - 201

Total - 501

Blind Test:

Negative - 107

Neutral - 204

Positive - 189

Total - 500



Social Bookmarking

There are two related but distinct datasets crawled from the bookmarking site. To use the old dataset, point to the directory as the index to use. To use the new dataset, point to

Old Social Bookmarking

This data set was crawled from the popular social bookmaring site by Michael Goulding. uses a tag-based system, where each bookmark can be assigned user-defined tags for organizational and sharing purposes. Dr. Ringger and Dan each chose 25 key words, or topics and the search facilities were used to find documents that had these topic labels as one of their tags.

There were quite a few spam postings in the resulting data, and so a few heuristics were applied to cull the “real” pages from the spam.

The intention was to gather 50 documents each from the 50 topics, but, after spam filtering, several topics ended up containing significantly fewer documents.

Two splits are provided for the social bookmarking data set, the full set (full_set), which contains 2307 documents, and a reduced set(tiny_set), which contains 396 documents.

We would like to re-crawl this data set, because:

  • Documents were filtered to exclude any that don't contain the topic name as part of the text (to match the need of Michael's work at the time of keyword extraction)
  • Some topics have as few as 4 members, and we would like to have at least 50 documents for each
  • While most of the spam problems were solved by heuristic filtering, the data still includes documents that aren't “content” pages. This includes pages that are forms, or that consist of mostly javascript code.
Topic labels in the Old Social Bookmarking dataset
ajax 50
algorithm 34
applescript 50
biblical archaeology 39
byu 41
clustering 50
copyright 50
dell computers 50
diebold 50
discriminative training 4
final fantasy 50
fuel cell 50
games 41
gardening 50
google 50
gtd 50
hezbollah 50
home theater 50
howto 50
ipod 50
java 39
kohler 50
language identification 44
lawncare 50
machine learning 44
mac 50
mitt romney 50
mosaic tile 50
nanotechnology 50
natural language processing 37
news aggregator 42
osx 34
patents 50
pedometer 32
photography 50
podcasts 50
power supplies 50
productivity 50
programming 39
psp 50
riaa 50
ruby 41
sco 50
security 50
sprinklers 50
text mining 50
thai recipes 50
translation 46
wii 50
youtube 50
Total 2307

New Social Bookmarking

Also crawled from This dataset has been updated to correct errors made when collecting the first. For example, the old social bookmarking data was collected by using the search engine. This yielded pertinent results, but missed the point of leveraging the manual tagging by users of the bookmarked websites. Instead of retrieving pages that were tagged with the topic labels, the old data set consisted of any pages that mentioned the topic words in the title or page description, this is most likely at least part of the reason that the original dataset needed so much filtering to remove spam.

The new dataset was crawled using the Ruby script found in the scripts directory of the dataset root directory. The keyphrases used are found in the file keyphrases.txt.

Some of the topic labels consist of multiple words. This was handled either by

  1. Concatenating the words into a single word - this means that all of the documents in this category must have been tagged by at least one user with the concatenated string.
  2. Delimiting the words with the '+' character - this means that all of the documents in this category must have been tagged with each of the individual words in the keyphrase by at least one user. This has the effect of taking the intersection of the documents tagged with each word in the keyphrase.

Each of the above methods were attempted manually first, in order to determine which was more appropriate for that particular keyphrase. Sometimes one way or the other would not produce enough documents to reach the quota of 100 documents per topic.

The file contains an index that maps from each document file to the URL where the document was collected from.

NOTE: One of the files in the dataset ( contains snippets of code from the ILOVEYOU VBS virus, and might be identified as malware by some signature-based anti-virus engines. The file does not contain enough of the code to run, and is quite safe, however.

Topic labels in the New Social Bookmarking dataset
Topic Label Document Count
ajax 500
algorithms 499
applescript 500
BYU 398
clustering 500
copyright 500
Dell+Computers 380
diebold 500
finalfantasy 500
fuelcell 500
games 500
gardening 500
google 499
gtd 500
Hezbollah 500
hometheater 500
howto 500
ipod 500
java 500
Kohler 84
lawncare 238
linux 500
mac 500
machinelearning 500
MittRomney 273
mosaic+tile 69
nanotechnology 500
natural+language+processing 167
news+aggregator 500
osx 500
patents 500
pedometer 267
photography 346
podcasts 390
power+supply 498
productivity 500
programming 500
riaa 500
ruby 451
SCO 500
security 500
sony 403
sprinkler 189
textmining 500
thai+recipes 500
translation 500
wii 500
windows 500
youtube 478
Total 21627

20 Newsgroups

This is the venerable 20 Newsgroups dataset introduced by Thorsten Joachims in his 1997 ICML paper. It has been split according to the description of Joachims split in that paper. There are currently three splits of this dataset:

A summary of the characteristics of the various splits of the 20 Newsgroups dataset.
Split Class Count Document Count Clustering Classification
full_set 20 19997 Yes Yes
reduced_set 10 6000 Yes Yes
tiny_set 4 400 Yes No
indices_broad_reduction_5000 20 4999 Yes Yes

A more complete description of each of these splits follows.


The composition of the full_set split of the 20 Newsgroups dataset
Component Document Count Percentage of Split
training 13398 67.00
dev test 3300 16.50
blind test 3299 16.50
all 19997 100


The composition of the reduced_set split of the 20 Newsgroups dataset
Component Document Count Percentage of Split
training 4000 66.667
dev test 1000 16.667
blind test 1000 16.667
all 6000 100


The tiny_set was made exclusively for testing clustering algorithms on a very small dataset. This split is not suitable for classification purposes.

The composition of the tiny_set split of the 20 Newsgroups dataset
Component Document Count Percentage of Split
all 400 100


This split was created as a reduced set with representation from a larger number of classes.

The composition of the indices_broad_reduction_5000 split of the 20 Newsgroups dataset
Component Document Count Percentage of Split
training 3359 67.19
dev test 819 16.38
blind test 821 16.42
all 4999 100

Book of Mormon

This data set was created by Dan Walker and contains a single split suitable only for clustering, as no natural labels have been applied. The documents in this data set are individual verses extracted from the version of the Book of Mormon available from Project Gutenberg.

The documents have been pre-processed, so that individual tokens are separated by whitespace. Tokens include words and punctuation characters.

nlp/data.txt · Last modified: 2015/04/21 22:33 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0