Text Classification Systems

1054 Words3 Pages

Currently, there are many classification systems. Broadly speaking, these systems fall into two main categories. These are binary and multiclass systems. Binary classification systems are only concerned with classifying documents into two main categories or groups. Classification systems of this kind are used to distinguish between just two classes of objects. As Maranis and Bebenko (2009) explain, these systems provide Yes/No answer to the question: Does this document belong to class X? In this, such systems can be useful in classifying emails where they are classified whether spam or not, or commercial transactions where they are determined to be fraudulent or not. In such applications, it is more likely and easier to use binary classification systems as we have only two classes or groups. Multiclass systems, in turn, divide documents into two classes or more. As the name indicates, these classifiers assign each document or data point to one of many classes where each has a distinct subject area. Newspaper accounts, for instance, can be classified under different categories such as news, sport, culture, business & money, politics, science, etc. This thesis is only concerned with text clustering. That is, it makes no priori assumptions about the interrelationships of Hardy’s prose works. Computational methods of text clustering fall into two main categories. These are linguistic and statistical mathematical methods (Srivastava and Sahami, 2009; Justo and Torres, 2005). Linguistic methods are based on natural language processing techniques. Methods of this kind usually involve morphological and syntactic processes for extracting meaning and identifying relationships within documents. Mathematical and statistical classificatio... ... middle of paper ... ...sks including SenseClusters (Purandare and Pedersen, 2004). This and others are programs that allow users to cluster similar contexts such as emails and web pages (Pedersen, 2008). The working principle of such programs is that data documents can be grouped on the basis of their mutual contextual similarities (Purandare and Pedersen, 2004). Programs of this kind have indeed proven a successful clustering method when applied to web pages and its merits are more tangible with multimedia material. Nevertheless, an approach of this kind carries with it some limitations. One of them- perhaps the most important- is that it is not concerned with the analysis of the content of documents. One more drawback is that in almost all context classification applications “identical replications of controlled experiments result in different conclusions” (Martin et al., 2005: 470).

Open Document