There are several research and procedures for classifying Arabic-language texts were based mostly on different environments and lack of dependence on a unified standard, unified data set, which led to the lack of precision in determining the most accurate technique in the classification, Arabic language processing is not saturated as that of other languages. Find the roots and stemmer of Arabic is an important phases towards conducting research on most effective applications of NLP Arabic so we have interest to apply algorithms to these phases. Arabic language has a complex structure which makes it difficult to integrate NLP research on it.
In this theses will be a study and analysis of the classification algorithms based on a unified environment and one dataset with the included challenges faced by these algorithms to demonstrate the effectiveness and accuracy and with a huge data set due to the expansion of data and the continuous increase in the internet.
There are several algorithms for the classification of texts which are used in the classification of texts in the group that have to do by helping to retrieve more quickly and give more accurate searches that for Arabic texts like K-NN ,DECISION TREES , Naive Bayse ,Random forest and others.
we used Diab datasets and the structure of the dataset:
The dataset has nine categories each of which contains 300 documents. Each category has its own directory that includes all files belonging to this particular category. and we make two other collections of data set, the second dataset collections has nine categories each of which contains 600 documents. Each category has its own directory that includes all files belonging to this particular category, and third dataset has n...
... middle of paper ...
...haracters in the field of medicine and can also be in another area, such as sports or in another area but differ in meaning.
In the world of computer and internet there must be solutions to these problems, otherwise the process of searching and retrieving information on the Internet is useless and may take a long time to reach the user request.
The process of retrieval of information must be more precise and a strong relationship to the topic which the user wants, and looking in the same area that the user needs. large topics and multiple sources and large terms increased complexity in the process of retrieval of information is therefore necessary to determine the paths that must be followed when the search or retrieval and that no be random to save time in the search in the paths that not related, and here comes the importance of the classification text in order to
Nicholas, D., Huntington, P., Jamali, H. R. & Tenopir, C. (2006). Finding information in (very large) digital libraries: A deep log approach to determining differences in use according to method of access. The Journal of Academic Librarianship, 32 (2), 119-126.
Abstract:- This paper presents a brief idea about data mining, data mining technology, and big data. The applications regarding data mining will also be discussed briefly. The main cause of data mining is to get different ideas, how to access big data by different tools.
As the conclusion, the paper made good contribution to the field by describing the history of the information retrieval systems from 1945 to 1996 with abundant information on the various technologies developed, information retrieval systems built, and how they affected the research in information retrieval. I think artificial intelligence will start to play a leading role in information retrieval in the following years and one day we will have true question answering type of information retrieval at the finger tip of every Internet user.
Support Vector Machine(SVM): Over the past several years, there has been a significant amount of research on support vector machines and today support vector machine applications are becoming more common in text classification. In essence, support vector machines define hyperplanes, which try to separate the values of a given target field. The hyperplanes are defined using kernel functions. The most popular kernel types are supported: linear, polynomial, radial basis and sigmoid. Support Vector Machines can be used for both, classification and regression. Several characteristics have been observed in vector space based methods for text classification [15,16], including the high dimensionality of the input space, sparsity of document vectors, linear separability in most text classification problems, and the belief that few features are relevant.
Sathya, A.S. and B.P. Simon, A document retrieval system with combination terms using genetic algorithm. International Journal of Computer and Electrical Engineering, 2010. 2(1): p. 1-6.
[4] M.A.Anusuya, S.K.Katti, “Classification Techniques used in Speech Recognition Applications: A Review”, International Journal of computer applications, Vol. 2, AUGUST 2011, pp.910-954
Information Retrieval (IR) is to represent, retrieve from storage and organise the information. The information should be easily access. User will be more interested with easy access information. Information retrieval process is the skills of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web. According to (Shing Ping Tucker, 2008), E-commerce is rapidly a growing segment in the internet.
Information Retrieval is simply a field concerned with organizing information. In other terms, IR is emphasizing the range of different materials that need to be searched. Others researcher said that IR is the contrast between the strong structure and typing a database system with the lack of structure in the objects typically searched in IR. The actual process in information retrieval systems is it has to deal with incomplete or under specified information in the form of the queries issued by users. IR uses the techniques of storing and recovering and often disseminating recorded data especially through the use of a computerized system.
We aim at using Word Sense Disambiguation for text classification in which all the non-stop-words will be tagged with their senses relative to the text in the document and these senses are later selected as classes for the given document. Accordingly, there will be a common class for all the related text and hence can be searched better. We shall use WorldNet to extract the correct senses of the ambiguous non-stop-words. The classification results thus obtained shall later be evaluated by comparing with the classification results of manual disambiguation.
Arabizi is a language that is used in modern Arabic-speaking countries. It is a system of writing Arabic using English characters, so it is considered a combination of Arabic and English languages. It is mostly used in a text messaging system over the internet and cellular phones, and that is because most of cellular phones did not support Arabic language or Arabic characters. Also, Arabic language was thought of as more difficult to use, therefore this new language was invented. Arabizi is used to replace Arabic writing letters, and this raises concerns regarding the preservation of the quality and purity of the Arabic language. In this essay, I will discuss why it appeared, its effect on our Arabic identity and whether or not purity of Arabic needs to be defended.
...ation retrieval. The author try to make it more understandable by relate to Shakespeare theory of seven ages of man. He used different term but it still showed the same meaning although different things. This article gives a positive impact to readers who have no knowledge in information retrieval because it discuss from the beginning of the existing of theory based on the imaginary Bush and Weaver. At the end of this article, the author also state what might go wrong and right of imaginary of Bush in development information retrieval.
The idea of text clustering long preceded the computer age: “Clustering is one of the most primitive mental activities of humans, used to handle the huge amount of information they receive every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries is an obvious example. Manual clustering was the only type of document clustering possible prior to the computer age. This circumstance may have influenced much clustering work that relied only on immediate intuitive knowledge of the world without making use of quantitative numerical methods. In other words, text clustering was usually performed in subjective ways that relied heavily on the perception, knowledge, and judgment of the researcher. With more and easier accessibility to electronic digital data in different disciplines and the power of computing data processing on one hand and the need for maintaining objectivity standards on the other, it has become ever more likely that such procedures must involve computational automated methods (Arabie et al., 1996) where human intuition and traditional organization methods are replaced by mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have witnessed a flourishing of the development of automated statistical clustering and classification systems for systematizing the inherent subjectivity in traditional text classification applications. It is this need for automated objective methodology that motivates our clustering of Hardy’s novels and short stores.
While first reading the article entitled as the seven ages of information retrieval written by Micheal Lesk, it shows that the development of information retrieval is discussed by using the concept of life span produced by the most popular literature, Shakespeare. The author was highlighted the major point used by Shakespeare starting from childhood until retirement to be adapted on the expectation of the article that he has been read before which is the article written by Vennevar Bush in 1945. Few expectations come from this article based on the development of information retrieval. Some of the expectation is managed to be done by the time, some others may advance in terms of implementing the way of getting the information than the expectation that Bush wants and some others is still in progress in future. Besides, some of the point is supported by a graft to make the clear picture of the reader.
NLP researchers aim to gather knowledge on how human beings understand and use language so that appropriate tools and techniques can be developed to make computer systems understand and manipulate natural languages to perform the desired tasks. The foundations of NLP lie in a number of disciplines, viz. computer and information sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence and robotics, psychology, etc. Applications of NLP include a number of fields of studies, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross language information retrieval (CLIR), speech recognition, artificial intelligence and expert systems, and so on. One important area of application of NLP that is relatively new and has not been covered in the previous ARIST chapters on NLP has become quite prominent due to the proliferation of the world wide web and digital libraries. Several researchers have pointed out the need for appropriate research in facilitating multi- or cross-lingual information retrieval, including multilingual text processing and multilingual user interface
The Internet has made access to information easier. Information is stored efficiently and organized on the Internet. For example, instead of going to our local library, we can use Internet search engines. Simply by doing a search, we get thousands of results. The search engines use a ranking system to help us retrieve the most pertinent results in top order. Just a simple click and we have our information. Therefore, we can learn about anything, immediately. In a matter of moments, we can become an expert.