See the bottom of this page for latest updates!


This data set is a collection of newswire articles from 1987.

It is located on entropy in /home/data/Reuters/lewis.

The data is in 20 SGML files. see and for details on sgml. In short, SGML (Standard Generalized Markup Language) is the predecessor to XML. XML appears to be a subset of SGML, but SGML is not XML. Practically speaking, an XML parser may not be able to parse an SGML document. In the case of this data set (reuters 21578), the files appear to be valid XML.

There is a DTD describing the format of the files.

Custom Split

I've created a split of this data that is custom from any other split. I simply took all of the articles and randomly selected a set from which to generate test, dev, and blind subsets.

nlp/reuters21578.txt · Last modified: 2015/04/23 15:46 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0