Stemming Algorithms

878 Words2 Pages

Stemming algorithms have been used in information retrieval (IR) for decades; however, there is no consensus that stemming enhances the effectiveness of IR systems. Many studies have investigated the effectiveness of stemming via the use of test collections: the conclusion is mixed results. Harman (1991) tested three stemming algorithms for large English corpora. The study concludes that the three algorithms used achieved no significant improvement in the performance of the IR systems. Later studies (Abu-Salem et al., 1999; Jinxi and Croft, 1998; Hull, 1996; Krovetz, 1993) find that stemming is useful and enhances the effectiveness of the IR systems. These studies indicate that stemming is one of the most important factors that enhance the effectiveness of information retrieval systems. In consequence, the applications of stemming algorithms are widely used now for this purpose. Abu-Salem, Mahmoud et al (1999) explain that in information retrieval systems, grouping words having the same base or root increases the success rate when matching documents to a query. For the present study, I agree with Savoy (1999) and others who support the idea that stemming is useful especially when long retrieved lists of documents are analyzed.

Many stemmers have been developed for a wide range of languages including English, French, German, Dutch, Swedish, Latin, Malay, Indonesian, Slovene, Turkish, Arabic and Hebrew. Leah, Lisa, et al. (2002) point out that “stemmers are generally tailored for each specific language” (2002: 275). Building stemmers accordingly requires some linguistic knowledge of the language and an understanding of the needs of information retrieval. The concept of all stemmers is the reduction of the corpora size so that Info...

... middle of paper ...

... true that stemming is useful in merging words which are different in form but are semantically equivalent; however, it can as well merge words which are different in form and are also semantically distinct and different from each other. Still again, stemmers find no solutions to homographs. This means that stemmers can conflate word forms which are completely different in meaning. In terms of IR applications, stemmers make two kinds of error: over-stemming and under-stemming. Strong stemmers tend to form larger stem classes where unrelated forms are wrongly conflated. This error is defined as over-stemming. Weak stemmers, in turn, fail to conflate variant forms of the same stem leaving them ungrouped. This error is called under-stemming. The present section introduces the main stemming algorithms for English corpora illustrating how they carry out stemming tasks.

Open Document