malibu implements a number of algorithms and measures to evaluate the performance of a two-class supervised learning tool. This is necessary to both select the best model and evaluate the performance of the learner.
To facillate understanding of each validation algorithm, there is a graphical representation illustraing the proportion of data from the dataset used for training, testing or nothing. The circles below represent some proportion fo the dataset. For each example, they are divided into 12 slices. The coloring system goes as follows: red slices represent the proportion of the dataset used for testing, blue for training and yellow are unused or not used yet. The one exception is the bootstrap method were dark blue represents repeated examples such that the training set size equals the size of the dataset.
The best method to determine the true generalization ability of the learning algorithm is evaluate its performance on unseen examples. The holdout method divides the dataset into two portions, usually 2/3 for training and 1/3 for testing. For small datasets, this method is often repeated for random paritions of the dataset.
The resubstituion or self-consistency test evaluations the classifier over the same set of examples used for training (training set). This test often gives an overly optimistic estimate of the true generalization error. This method is not often used to evaluate the performance of a classifier.
When the dataset is small, holdout fails to give a accurate estimation of the classifiers performance. Instead, crossvalidation has been show to outperform holdout in this case. n-fold cross-validation divides the dataset into n parts where n-1 parts are used for training and the left out n part is used for testing. This process is iterated such that every part is used for testing once. An exterme case of cross-validation is leave-one-out crossvalidation where every instance expect one is used for training and the left out instance for testing. This procedure is repeated for every instance in the dataset. A single run of 10-fold cross-validation is often used in practice (to obtain a valid confidence interval).
Another method used to improve the holdout estimate is progressive validation. The generalization error is as good as a single holdout however, progressive validation uses half the number of examples on average for training. Progressive validation is performed by splitting the dataset into two parts then training the classifier on the first part and testing a single instance from the second part. This instance is then added to the training set and the procedure continues for each instance in the second part.
The bootstrap method samples the data with replacement generating a training set the size of the original dataset. Any examples not used in the training set are used for testing. This procedure is repeated. Approximatly 1/3 of the examples are left out on any given round of bootstrap for testing. In many cases, boostrap in superior to cross-validation; however, there seems to be just as many cases where cross-validation is superior.
A scalar metric can easily be tabulated in terms of successes, failures and draws. It should be kept in mind the circumstances that a scalar represents.
A threshold metric reflects the performance of a classifier at a particular threshold, usually 0. These metrics are derived from a confusion matrix (or contingency table), which counts the number of correct/incorrect predictions for a particular class. For two classes:
- True positive (TP): a prediction that is postive and correct
- True negative (TN): a prediction that is negative and correct
- False positive (FP): a prediction that is postive and incorrect
- False negative (FN): a prediction that is negative and incorrect
The accuracy measures the ratio of correction predictions to the total number of predictions. P(ŷ=y)
The sensitivity measures the proportion of positive cases correctly predicted as positive. In terms of conditional probability, sensitivity is the probability a case will be predicted positive given the case is positive. In information retrieval, sensitivity is known as recall and measures the fraction of relevant material returned by a search. It is also known as the true positive rate (TPR).
The specificity measures the proportion of negative cases correctly predicted as negative. In terms of conditional probability, specificity is the probability a case will be predicted negative given the case is negative. A high specificity means a low Type I error.
|Positive Predictive Value||PPV|
The positive predictive value measures the proportion of cases predicted positive which are correctly predicted. In terms of conditional probability, it is the probability that a case is truely positive given it is predicted positive. In terms of information retrieval, positive predictive value is known as precision and measures the fraction of documents returned that are relevant.
|Negative Predictive Value||NPV|
The negative predictive value measures the proportion of cases predicted negative which are correctly predicted. In terms of conditional probability, it is the probability that a case is truely negative given it is predicted negative.
|False Positive Rate||FPR|
The false positive rate measures the proportion of negative cases that are incorrectly predicted positive. It is also known as a Type I error i.e. the error of rejecting a hypothesis that should have been accepted.
The lift measures accuracy of prediction for the top p% of predictions. In this case, it measures the accuracy of the top 25%. This is a typical measure in database marketing.
The f-score measures the weighted harmonic mean of precision and recall. This is a common metric in information retrieval.
|Matthew's Correlation Coefficient||MCC|
The Matthew's corrletation coefficient measures the accuracy of prediction where a predictor that is always right has a value 1 and a predictor that is always wrong has the value -1. A random guess will have a value 0.
Ranking (Ordering) metrics measure the classifiers ability to correctly order predictions. These metrics do not depend on the relative values of the predictions or the threshold of prediction.
|Area Under the ROC Curve||AUR|
A ROC plot compares the true positive rate v.s the false positive rate as the threshold is swept from 0 to 1. A area Under the ROC of 1 means a perfect prediction and 0.5 a random guess. Another interpetation of the area under the ROC is that it measures how badly sorted a set of predictions are in terms of the class value. That is, it measures the number of swaps to repair the sort normalized by the number of positive times the number of negative cases.
|Area Under the Precision Recall Curve||AUP|
The precision recall curve pose an alternative for tasks with a large skew in class distribution and is often used in information retrieval. A classifier that is near optimal in ROC space may not be optimal in precision/recall space.
|Break Even Point||BEP|
The breakeven point is defined as the point where the precision and recall are equal. This is a common metric in information retrieval.
|Area Under the Cost Curve||AUC|
The area under the cost curve measurs the expected cost of a classifier assuming all possible probability cost values are equally likely. A lower value is better.
The probability (regression) metrics measure how close the predicted values match the target values. The target values T are choose to be 0 or 1 depending on the case.
|Root Mean Squared Error||RMSE|
The root mean squared error measures how close predicted values match target values. This uses the squared error divided by the number of cases and under the square root.
|Mean Cross Entropy||CXE|
The cross entropy (log loss) measures how close predicted values match target values and it assumes the predicted probability between [0,1] that indicate that the class of the examples is 1. The cross entropy is nomalized to the size of the dataset, mean cross entropy. Note, the cross-entropy is infinity at 0 and 1.
A single scalar cannot account for both types of error that a binary classifier may make. Moreover, a scalar cannot account for other circumstances under which one classifier is superior to another. There are several methods to generate such curves. The method used below to generate the following curves involves sweeps the threshold from 0 to 1 for a classifier with real-outputs.
The reciever operating characteristic (ROC) curve depicts the performance of a classifiers in terms of the true positive rate (sensitivity) and the false positive rate (1-specificity). The closer the curve is to the left side and the top the better the classifier.
The precision recall curve depicts the performance of a classifier in terms of precision (y-axis) and recall (x-axis). The precision recall curve pose an alternative for tasks with a large skew in class distribution and is often used in information retrieval. A classifier that is near optimal in ROC space may not be optimal in precision/recall space. An optimal precision recall curve will be in the upper right hand corner.
Sometimes you need to see the precision/recall curve with the classes reversed. The curve shown below has the negative predictive value on the y-axis and the specificity and x-axis.
The cost curve depicts the performance of a classifiers in terms of expected cost (or error rate) over a full range of class distributions and misclassification costs. The x-axis measures with respect to the class distribution and the y-axis measures the normalized expected cost (or error). A single cost line interpolates to a point in ROC space where the y-coordinate measure false positive rate and the right y-coordinate one minus the true positive rate. The combination of the cost lines form the lower envelope. One classifier is superior to another classifier if its lower envelope is lower than the other classifier's.
Using the above cost curve, it would be hard to compare classifiers. Note, the white space form below the lines; this forms a curve called the lower envelope. This lower envelope captures much of the information of the cost curve while allowing inituitive comparision like a ROC curve.
Another problem with the above cost curve is that it is hard to see individual cost lines. The plot below shows the lower envelope with some important cost lines. Likewise, it is easier to see for what range of the lower envelope this classifier outperforms the trival classifiers (0,0), (1,1) and (1,0),(0,1).
The lift curve depicts the performance of a classifiers probability estimates in terms of lift (y-axis) and probability of predicting positive (x-axis).