Text Clustering

862 Words2 Pages

The idea of text clustering long preceded the computer age: “Clustering is one of the most primitive mental activities of humans, used to handle the huge amount of information they receive every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries is an obvious example. Manual clustering was the only type of document clustering possible prior to the computer age. This circumstance may have influenced much clustering work that relied only on immediate intuitive knowledge of the world without making use of quantitative numerical methods. In other words, text clustering was usually performed in subjective ways that relied heavily on the perception, knowledge, and judgment of the researcher. With more and easier accessibility to electronic digital data in different disciplines and the power of computing data processing on one hand and the need for maintaining objectivity standards on the other, it has become ever more likely that such procedures must involve computational automated methods (Arabie et al., 1996) where human intuition and traditional organization methods are replaced by mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have witnessed a flourishing of the development of automated statistical clustering and classification systems for systematizing the inherent subjectivity in traditional text classification applications. It is this need for automated objective methodology that motivates our clustering of Hardy’s novels and short stores.  Clustering vs. classification The two terms clustering and classification are extensively used throughout this thesis. The question that rises at this point is: are they synonymous or is there a distinction... ... middle of paper ... ...ion is that clustering is an “unsupervised” activity while classification is a supervised one. In clustering, there is no one who assigns documents to classes but it is only the distribution and makeup of the data that will determine cluster membership (Manning et al., 2008). To illustrate the argument, let us consider the following example. Having a set of 1000 documents on the history of English literature, these can be both clustered and classified. In performing a clustering task, documents are just clustered into distinct groups where similar or related documents are grouped together. In classification, on the other hand, predefined sets are given first. These can be Old English literature, Shakespearean literature, Augustan Literature, Romantic Literature, and Victorian Literature. Then documents are placed or classified under these predefined categories.

Open Document