Information gain analysis

2207 Words5 Pages

Information gain analysis

ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions .Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.

The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is encoded in bits. Info(D) is just the average amount of information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D.

Now, suppose we were to partition the tuples in D on some attribute A having v distinct values, {a, a2, . . . , av}, as observed from the training data. If A is discrete-valued,these values correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets, {D, D2, . . . , Dv},where Dj contains those tuples in D that have outcome aj of A. These partitions would correspond to the branches gr...

... middle of paper ...

...r each outcome of the criterion, and the tuples are partitioned accordingly. This section describes three popular attribute selection measures—informationgain, gain ratio, and gini index.

The notation used herein is as follows. Let D, the data partition, be a training set ofclass-labeled tuples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = , . . . , m). Let Ci,D be the set of tuples of class Ci in D. Let |D|and |Ci,D| denote the number of tuples in D and Ci,D, respectively.

REFERENCES

1. Data Mining:Concepts and Techniques(Second Edition) by Jiawei Han and Micheline Kamber.

2. Attribute Oriented Induction with simple select SQL statement by Spits Warnars Department of Computing and Mathematics, Manchester Metropolitan University,John Dalton Building, Chester Street, Manchester M15GD, United Kingdom.

High Profit Analysis
771 Words | 2 Pages
The series “High Profits” demonstrates the works and restrictions of the United States government regarding the issue of legalizing recreational marijuana. Breckenridge Cannabis Club business owners, Caitlin Mcguire and Brian Rogers, demonstrate both the struggles and profits of this up and coming industry. This series portrays virtually every viewpoint possible by including opinions from an array of political actors who discuss the influence of the government on this topic and the impact this topic has on the general public.
Read More
Solving Patient Care Problems: A Case Study
447 Words | 1 Pages
After this analysis of the data is done to sort out those subjective and the objective data,
Read More
Essay On Reliabilism
1609 Words | 4 Pages
For the purpose of this paper I will refine the problem of induction to enumerative cases of induction. I shall explore whether reliabilism is a successful theory of knowledge, and propose that it is a viable solution to the problem of induction proposed by David Hume, but requires ad hoc amendments in attempt to satisfy the New Riddle of Induction put forth by Nelson Goodman.
Read More
Project Management Essay
1328 Words | 3 Pages
These are some of the attributes which are added in the ECS 2 with the interface.
Read More
Analysis Of The Malcomb Baldrige National Quality Award
1030 Words | 3 Pages
The Baldrige criteria address seven major categories, each with sub-criteria and allocated points. In the Business Criteria for Performance Excellence, these categories are:
Read More
Intrusion Detection Systems
1596 Words | 4 Pages
In 1980, James Anderson’s paper, Computer Security Threat Monitoring and Surveillance, bore the notion of intrusion detection. Through government funding and serious corporate interest allowed for intrusion detection systems(IDS) to develope into their current state. So what exactly is IDS? An IDS is used to detect malicious network traffic and computer usage through attack signatures. The IDS watches for attacks not only from incoming internet traffic but also for attacks that originate in the system. When a potential attack is detected the IDS logs the information and sends an alert to the console. How the alert is detected and handled at is dependent on the type of IDS in place. Through this paper we will discuss the different types of IDS and how they detect and handle the alerts, the difference between a passive and a reactive system and some general IDS intrusion invasion techniques.
Read More
Operations Management Metrics
917 Words | 2 Pages
3. Functionality – it can measure the performance of a group such as purchasing or services or manufacturing. 4. Activity/Individual metrics – metrics that are specific to a person or activity (Vickery 1999).
Read More
Evolutionary Computation Algorithm Essay
2072 Words | 5 Pages
Let us see now how this algorithm works. The algorithms randomly creates solutions. Each one of these solutions has a fitness value based on some criteria. Those solutions of a specific problem are also called Phenotype, while the encoding of each solution is called Genotype. We refer on Representation as the procedure of establish the mapping between genotypes and phenotypes. Representation is used as in two different ways. As mentioned before, representation establish the mapping between the genotype and the phenotype. This means that representation could encode ore decode the candidate solutions.
Read More
Data Mining, Data Mining Technology, And Big Data
734 Words | 2 Pages
Data mining is process of computing the data from the large data sets involving methods on to intersection of statistics, machine learning,
Read More
A Five Year Strategic Plan for Mensa Inc.
1109 Words | 3 Pages
It is used to measure the position of a firm in relation to its relative market share as well as its market growth. Based on this the situation where in all of the given four divisions of the firm are at different levels of performance can be evaluated in order to formulate a 5 year strategy plan. This can help in the creation of a portfolio where in returns are optimized by re investing in growth oriented sectors and divesting out of the sectors that are saturated and loss making for the firm.
Read More
Network Security Monitoring Tools
877 Words | 2 Pages
Although Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) have been grouped together here (IDPS), there are distinctions between them. On the most basic level, both will monitor the network...
Read More
Privacy and Security Issues in Data Mining
1885 Words | 4 Pages
[7] Elmasri & Navathe. Fundamentals of database systems, 4th edition. Addison-Wesley, Redwood City, CA. 2004.
Read More
Hr Practices In Google
1082 Words | 3 Pages
Next, you statistically determine which of these many traits your top performers and most impactful employees' exhibit that differentiates them from bottom performing and average employees.
Read More
Business Intelligence and Data Science
1145 Words | 3 Pages
HAND, D. J., MANNILA, H., & SMYTH, P. (2001).Principles of data mining. Cambridge, Mass, MIT Press.
Read More
Machine Learning
2503 Words | 6 Pages
T. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005 download
Read More

Open Document

Information gain analysis

High Profit Analysis

Solving Patient Care Problems: A Case Study

Essay On Reliabilism

Project Management Essay

Analysis Of The Malcomb Baldrige National Quality Award

Intrusion Detection Systems

Operations Management Metrics

Evolutionary Computation Algorithm Essay

Data Mining, Data Mining Technology, And Big Data

A Five Year Strategic Plan for Mensa Inc.

Network Security Monitoring Tools

Privacy and Security Issues in Data Mining

Hr Practices In Google

Business Intelligence and Data Science

Machine Learning

More about Information gain analysis