Decision Tree

Classification

Basic Concept

Classification is the task of assigning objects to one of several predefined categories
It is an important problem in many applications
- Detecting spam email messages based on the message header and content
- Categorizing cells as malignant or benign based on the results of MRI scans
- Classifying galaxies based on their shapes
The input data for a classification task is a collection fo records
Each record, also known as an instance or example, is characterized by a tuple(x,y)
- x is the attribute set
- y is the class label, also known as category or target attribute.
The class label is a discrete attribute.
Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.
The target function is also known as a classification model.
A classification model is useful for the following purposes
- Descriptive modeling
- Predictive modeling
A classification technique is a systematic approach to perform classification on an input data set
Example include
- Decision tree classifiers
- Neural networks
- Support vector machines

General Framework for Classification

A classifcation technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and the class label of the input data.
The model generated by a learning algorithm should
- Fit the input data well and
- Correctly predict the class labels of records it has never seen before
A key objective of the learning algorithm is to build models with good generalization capability.
First, a training set consisting of records whose class labels are known must be provided.
The training set is used to build a classification model.
This model is subsequently applied to the test set, which consists of records which are different from those in the training set.

Confusion matrix

Evaluation of the performance of the model is based on the counts of correctly and incorrectly predicted test records.
These counts are tabulated in a table known as a confusion matrix
Each entry aij in this table denotes the number of records from class i predicted to be of class j.
The total number of correct predictions made by the model is f11 + f00
The total number of incorrect predictions is f10+f01
The information in a confusion matrix can be summarized with the following two measures
- Accuracy
  
  in binary that is
- Error rate
Most classification algorithms aim at attaining the highest accuracy, or equivalently, the lowest error rate when applied to the test set.

Decision tree construction

Concept

We can solve a classification problem by asking a series of carefully crafted quesions about the attributes of the test record.
Each time we receive an answer, a follow-up question is asked.
This process is continued until we reach a conclusion about the class label of the record.
The series of questions and answers can be organized in the form of a decision tree.
It is a hierarchical structure consisting of nodes and directed edges.
The tree has three types of nodes
- A root node that has no incoming edges.
- Internal nodes, each of which has exactly one incoming edge and a number of outgoing edges.
- Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label.
The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different charaacteristics.
Classifying a test record is straightforward once a decision tree has been constructed.
Starting from the root node, we apply the test condition
We then follow the appropriate branch based on the outcome of the test
This will lead us either to
- Another internal node, at which a new test condition is applied, or
- A leaf node
The class label associated with the leaf node is then assigned to the record.
Effient algorithms have been developed tot induce a reasonably accurate, although suboptimal, decision tree in a reasonable amount of time.
These algorithms usually employ a greedy strategy that makes a series of locally optimal decisions about which attribut to use for partitioning the data.

Hunt’s Algorithm

A decision tree is grown in a recursive fashion by paritioning the training records into successively purer subsets
We suppose
- Us is the set of training records that are associated with node s.
- C={c1,c2,..,ck} is the set of class labels.
If all the records in Us belong to the same class ck, then s is a leaf node labeled as ck.
If Us contains records that belong to more than one class,
- An attribute test condition is selected to parition the records into smaller subsets.
- An child node is created for each outcome of the test condition.
- The records in Us are distributed to the children based on the outcome.
The alogirithm is then recutsively appled to each child node.
For each node, let p(ck) denotes the fraction of training records from class ck
In most cases, the leaf node is assigned to the class that has the majority number of training records.
The fraction p(ck) for a node can also be used to estimate the probability that a record assigned to that node belongs to class k.
Decision trees that are too large are susceptible to a phenomenon known as overfitting.
A tree pruning step can be performed to reduce the size of the decision tree.
Pruning helps by trimming the tree branches in a way that improves the generalization error.

Attribute Test

Each recursive step of the treee-growing process must select an attribute test condition to divide the records into smaller subsets.
To implement this step, the algorithm must provide
- A method for specifying the test condition for different attribute types and
- An objective measure for evaluating the goodness of each test condition.

Binary attribute

The test condition for a binary attribute generates two possible outcomes

Nominal attributes

A norminal attribute can produce binary or multi-way splits.
Thereare 2^(s-1)-1 ways of creating a binary partition of S attribute values.
For a multi-way split, the number of outcomes depends on the number of distinct values for the corresponding attribute.

Ordinal attributes

Ordinal attributes can also produce binary or multi-way splits.
Ordinal attributes can be grouped as long as the grouping does not violate the order property of the attribute values.

Continuous attributes

The test condition can be expressed as a comparison test x<=T pr x > T
For the binary case
- The decision tree algorithm must consider all possible split positions T, and
- Select the one that produces that best partition
For the multi-way split
- The algorithm must consider multiple split positions.

conduction example

We consider the problem of using a decision tree to predict the number of participants in a marathon race requiring medical attention.
This number depends on attributes such as
- Temperature forecase (TEMP)
- Humidity forecast (HUMID)
- Air pollution forecast (AIR)

In a decision tree, each internal node represents a particular attribute, e.g. TEMP or AIR
Each possible value of that attribute coresponds to a branch of the tree.
Leaf nodes represent classifications, such as Large or Small number of participants requiring medical attention.
Suppose AIR is selected as the first attribute.
This partitions the examples as follows.
Since the entries of the set {2,5,7,8,10} all coreespond to the case of a large number of participants requiring medical attention, a leaf node is formed.
On the other hand, for the set {1,3,4,6,9}
- TEMP is selected as the next attribute to be tested.
- This further divides this partition into {4}, {3,6} and {1, 9}

Information theory

Concept

Each attribute reduces a certain amount of uncertainty in the classification process.
We calcualte the amount of uncertainty reduced by the selection of each attribute.
We then select the attribute that provides the greatest uncertanty reduction.
Information theory provides a mathematical formulation for measuring how much information a message contains.
We consider the case where a message is selected among a set of possible messages and transmitted.
The information content of a message depends on
- The size of this message set, and
- The frequency with which each possible message occurs.
The amount of information in a message with occurrence probability p is defined as -log2p.
Suppose we are given
- a set of messages, C={c1,c2,...,ck}
- the occurrence porbality p(ck) of each ck.
We define the entropy I as the expected in formation content of a message in C:
The entropy is measured in bits.

Attribute selection

We can calcualte the entropy of a set of training examples from the occurrence probabilities of the different classes.
In our example
- p(Small) = 2/10
- p(Large) = 8/10
The set of training instances is denoted as U
We can calculate the entropy as follows:
The information gain provided by an attribute is the difference between
1. The degree of uncertainty before including the attribute
2. The degree of uncertainty after including the attribute.
Item 2 above is defined as the weighted average of the entropy values of the child nodes of the attribute.
If we select attribute P, with S values, this will partition U into the subsets {U1,U2,…,Us}.
The average degree of uncertainty after selectin P is
The information gain associated with attribute with attribute P is computed as follows.
If the attibute AIR is chosen, the examples are partitioned as follows:
- U1={1,3,4,6,9}
- U2={2,5,7,8,10}
The resulting entropy value is
The information gain can be computed as follows:
For the atttribute TEMP which partitions the examples into U1 = {4,5},U2={3,6,7,8} and U3={1,2,9,10}:
For the attribute HUMID which partitions the examples into U1={4,5,6,7,9,10} and U2={1,2,3,8}
The attribute AIR correcsponds to the highest information again.
As a result, this attribute will be selected.

Continuous attributes

If attribute P is continuous with value x, we can apply a binary test.
The outcome of the test depends on a threshold value T.
There are two possible outcomes:
- x <=T
- x > T
The traning set is then partitioned into 2 subsets U1 and U2.
We apply sorting to values of attribute P to abtain the sequence {x1,x2,...,xm}
Any threshold between xr and x(r+1) will divide the set into two subsets
- {x1,x2,...,xr}
- {x(r+1),x(r+2),...,xm}
There are at most m-1 possible splits.
For r = 1,…,m-1 such that x(r) != x(r+1), the corresponding threshold is chosen as Tr=(x(r)+x(r+1))/2
We can then calculate the information gain for each Tr
- where I(P,Tr) is a function of Tr
The threshold Tr which maximizes gain(P,Tr) is then chosen.

Impurity measures

The measures developed for selecting the best split are often based on the degree of impurity of the child nodes.
Besides entropy, other examples of imputrity measures include
- Gini index
- Classification error rate
In the following figure, we compare the values of the imputrity measures ofr binary classification problems.
p refers to the fraction of records that belong to one of the two classes
All three measures attain their maximum value when p = 0.5
Then minimum values of the measures are attained when p equals 0 or 1

Gain ratio

Impurity measures such as entropy and Gini indesx tend to favor attributes that have a large number of possible values.
In many cases, a test condition that results in a large number of outcomes may not be desirable
This is because the number of records associated with each partition is too small to enable us to make any reliable predictions.
To solve this problem, we can modify the splitting criterion to take into account the number of possible attribute values.
In the case of informatoin agin, we can use the gain ratio which is defined as follows

where

Oblique decision tree

The test condition described so far involve using only a single attribute at a time
The tree-growing procedure can be viewed as the process of partitioning the attribute sapce into disjoint regions.
The border between two neighboring regins of different classes is known as a decisionboundary.
Since the test condition involves only a single attribute, the decision boundaries are parallel to the coordinate axes.
This limits the expressiveness of the decision tree representation for modeling complex relationships among continuous attributes.
An oblique decision tree allows test conditions that involve more than one attribute
The following figure illustrates a data set that connot be classified effectively by a conventional decision tree
This data set can be easily represented by a single node of an oblique decision tree with the test condition x+y<1
However, finding the optimal test condition for a given node can be computationally expensive.