CS3481-Lecture-4-Classifier Evaluation

CS3481-Lecture-4-Classifier Evaluation

Classification error

  • The errors committed by a classification model are generally divided into two types

    • Training errors
    • Generalization errors
  • Training error is the number of misclassification errors committed on training records

  • Training error is also known as resubstitution error of apparent error.

  • Generalization error is the expected error of the model on previously unseen records.

  • A good classification model should

    • Fit the training data well.
    • Accurately classify reocrds it has never seen before.
  • A model that fits the training data too well can have a poor generaliztion error.

  • This is known as model overfitting

  • We consider the 2-D data set in the following figure.

  • The data set contains data points that belong to two different calsses.

  • 10% of the points are chosen for training, while the remaining 90% are used for testing.

  • Decision tree calssifiers with different numbers of leaf nodes are built using the training set.

  • The following figure shows the training and test error rates of the decision tree.

  • Both error rates are large when the size of the tree is very small.

  • This situation is known as model underfitting.

  • Underfitting occurs because the model cannot learn the true structure of the data.

  • It performs poorly on both the training and test sets.

  • When the tree becomes too large

    • The training error rate continues to decrease
    • However, the test error rate begins to increase.
  • This pheonmenon is known as model overfitting

Overfitting

  • The training error can be reduced by increasing the model complexity

  • However, the test error can be large because the model may accidentally fit some of the noise point in the training data.

  • In ohter words, the performance of the model on the training set does not generalize well to the test examples.

  • We consider two decision treews, one with 5 leaf nodes and one with 50 leaf nodes.

  • The two decision trees and their corresponding decision boundaries are shown in the following figures.

  • The decision boundaries constructed by the tree with 5 leaf nodes represent a reasonable approximation of the optimal solution.

  • It is expected that this set of decision boudnaries will result in a good classification performance on the test set.

  • On the other hand, the tree with 50 leaf nodes generates a set of complicated decision boundaries.

  • Some of the shaded regions corresponding to the ‘+’ class only contain a few ‘+’ training instances, among a large number of surrounding ‘o’ instance.

  • This excessive fine-tuning to specific patterns in the data will lead to unsatisfactory classification performance on the test set.

Generalization error estimation

  • The ideal classification model is the one that produeces the lowest generalization error.

  • The problem is that the model has no knowledge of the test set.

  • It has access only to the training set.

  • We consider the following approaches to estimate the generalization error

    • Resubstitution estimate
    • Estimates incorporating model complexity
    • Using a validation set

Resubstitution estimate

  • The resubstitution estimate approach assumes that the training set is a good representation of the overall data

  • However, the training error is usually an optimistic estimate of the generalization error.

  • We consider the two decision trees shown in the following figure.

  • The left tree TL is more complex than the right tree TR.

  • The training error rate for TL is e(TL) = 4/24 = 0.167

  • The training error rate for TR is e(TR) = 6/25 = 0.24

  • Based on the resubsitution estimate, TL is considered better than TR.

Estimates incorporating model complexity

  • The chance for model overfitting increases as the model becomes more complex.

  • As a result, we should prefer simpler models

  • Based on this principle, we can estimate the generalization error as the sum of

    • Training error and
    • A penalty term for model complexity
  • We consider the previous two decision trees TL and TR

  • We assume that the penalty term is equal to 0.5 for each leaf node.

  • The error rate estimate for TL is:

  • The error rate estimate for TR is:

  • Based on this penalty term, TL is better than TR

  • For a binary tree, a penalty term of 0.5 means that a node should always be expanded into its two child nodes if it improves the classification of at least one training record.

  • This is bcecause expanding a node, which is the same as adding 0.5 to the overall error, is less costly than committing one training error.

  • Suppose the penalty term is equal to 1 for all the leaf nodes

  • The error rate estimate for TL becomes 0.458

  • The error rate estimate for TR becomes 0.417

  • So TR is better than TL

  • A penalty term of 1 measns that, for a binary tree, a node should not be expanded unless it reduces the classification error by more than one training record.

Using a validation set

  • In this approach, the original training data is divided into two smaller subsets.
  • One of the subsets is used for training
  • Another one known as the validation set, is used for estimating the generalization error.
  • This approach can be used in the case where the complexity of the model is determind by a parameter
  • We can adjust the parameter until the resulting model attains the lowest error on the validation set.
  • This approach provides a better way for estimating how well the model performs on previously unseen records.
  • However, less data are available for training

Handling overfitting in decision tree

  • There are two approaches for avoiding model overfitting in decision tree
    • Pre-pruning
    • Post-pruning

Pre-pruning

  • In this approach, the tree growing algorithm is halted before generating a fully grown tree that perfectly fits the training data.
  • To do this, an alternative stopping condition could be used
  • For example, we can stop expanding a node when the observed change in impurity measure falls below a certain threshold.
  • Advantage:
    • avoids generating overly compelx sub-trees that overfit the training data
  • However, it is difficult to choose the right threshold for the change in impurity measure.
  • A threshold which is:
    • too high will result in underfitted models
    • too low may not be sufficient to overcome the model overfitting problem

Post-pruning

  • In this approach, the decision tree is initially grown to its maximum size
  • This is followed by a tree pruning step, which trims the fully grown tree.
  • Trimming can be done by replacing a sub-tree with a new leaf node whose class label is determined from the majority class of records associated with the node
  • The tree pruning step terminates when no further improvement is observed
  • Post-pruning tends to give better results than pre-pruning because it makes pruning decisions based on a fully grown tree.
  • On the other hand, pre-pruning can suffer from prematrue termination of the tree growing process.
  • However, for post-pruning, the additional computations for growing the full tree may be wasted when some of the sub-trees are pruned.

Classifier evaluation

  • There are a number of methods to evaluate the performance of a classifier
    • Hold-out method
    • Cross validation

Hold-out method

  • In this method, the original data set is partitioned into two disjoint sets.
  • These are called the training set and test set respectively.
  • The classification model is constructed from the training set.
  • The performance of the model is evaluted using the test set.
  • The hold-out method has a number of well known limitations.
  • First, fewer examples are available for training.
  • Second, the model may be highly dependent on the composition of the training and test sets.

Cross validation

  • In this approach, each record is used the same number of times ofr training, and exactly once for testing.
  • To illustrate this method, suppose we partition the data into two equal-sized subsets.
  • Frist, we choose one of the subsets for training and the other for testing
  • We then swap the roles of the subsets so that the previous training set becomes the test set, and vice versa.
  • The estimated error is obtained by averaging the errors on the test sets for both runs.
  • In this example, eachrecord is used exactly once for training and once for testing.
  • This approach is called two-fold cross-validation
  • The k-fold cross validation method generalizes this approach by segmenting the data into k equal-sized partitions.
  • During each run
    • One for the partitions is chosen for testing
    • The rest of them are used for training
  • This procedure is repeated k times so that each partition is used for testing exactly once.
  • The estimated error is obtained by averaging the errors on the test sets for all k runs.