This tutorial illustrates the use of the functions
vl_roc,
vl_det, and
vl_pr to generate ROC, DET, and precision-recall
curves.
VLFeat includes support for plotting starndard information retrieval curves such as the Receiver Operating Characteristic (ROC) and the Precision-Recall (PR) curves.
Consider a set of samples with labels labels and
score
scores.
scores is typically the output
of a classifier, with higher scores corresponding to positive
labels. Ideally, sorting the data by decreasing scores should leave
all the positive samples first and the negative samples last. In
practice, a classifier is not perfect and the ranking is not ideal.
The tools discussed in this tutorial allow to evaluate and visualize
the quality of the ranking.
For the sake of the illustration generate some data randomly as follows:
numPos = 20 ;
numNeg = 100 ;
labels = [ones(1, numPos) -ones(1,numNeg)] ;
scores = randn(size(labels)) + labels ;
In this case, there have five times more negative samples than positive ones. The scores are correlated to the labels as expected, but do not allow for a perfect separation of the two classes.
ROC and DET curves
To visualize the quality of the ranking, one can plot the ROC curve
by using the vl_roc
function:
vl_roc(labels, scores) ;
This produces the figure
The ROC curve is the parametric curve given by the true positve
rate (TPR) against the true negative rate (TNR). These two quantities
can be obtained from vl_roc as follows:
[tpr, tnr] = vl_roc(labels, scores) ;
The TPR value tpr(k) is the percentage of positive
samples that have rank smaller or equal than
k (where
ranks are assigned by decreasing scores).
tnr(k) is
instead the percentage of negative samples that have rank larger
than
k. Therefore, if one classifies the samples with
rank smaller or equal than
k to be positive and the rest
to be negative,
tpr(k) and
tnr(k) are
repsectively the probability that a positive/negative sample is
classified correctly.
Moving from rank k to rank
k+1, if the
sample of rank
k+1 is positive then
tpr
increases; otherwise
tnr decreases. An ideal classifier
has all the positive samples first, and the corresponding ROC curve is
one that describes two sides of the unit square.
The Area Under the Curve (AUC) is an indicator of the
overall quality of a ROC curve. For example, the ROC of the ideal
classifier has AUC equal to 1. Another indicator is the Equal
Error Rate (EER), the point on the ROC curve that corresponds to
have an equal probability of miss-classifying a positive or negative
sample. This point is obtained by intersecting the ROC curve with a
diagonal of the unit square. Both AUC and EER can be computed
by vl_roc:
[tpr, tnr, info] = vl_roc(labels, scores) ;
disp(info.auc) ;
disp(info.eer) ;
vl_roc has a couple of useful functionalities:
- Any sample with label equal to zero is effecitvely ignored in the evaluation.
- Samples with scores equal to
-inf are assumed to be never retrieved by the classifier. For these, the TNR is conventionally set to be equal to zero.
- Additional negative and positive samples with
-inf score can be added to the evaluation by means of the
numNegatives and
numPositives options. For example,
vl_roc(labels,scores,'numNegatives',1e4) sets the number of negative samples to 10,000. This can be useful when evaluating large retrieval systems, for which one may want to record in
labels and
scores only the top ranked results from a classifier.
- Different variants of the ROC plot can be produced. For example
vl_roc(labels,scores,'plot','tptn') swaps the two axis, plotting the TNR against the TPR. Since the TPR is also the recall (i.e., the percentage of positive samples retrieved up to a certain rank), this makes the plot more directly comparable to a precision-recall plot.
A limitation of the ROC curves in evaluating a typical retrieval system is that they put equal emphasis on false positive and false negative errors. In a tipical retrieval application, however, the vast majority of the samples are negative, so the false negative rate is typically very small for any operating point of interest. Therefore the emphasis is usually on the very first portion of the rank, where the few positive samples should concentrate. This can be emphasized by using either precision-recall plot or a variant of the ROC curves called Detection Error Tradeoff (DET) curves.
A DET curve plots the FNR (also called false alarm rate)
against teh FPR (also called miss rate) in logarithmic
coordiantes. It can be generated
by vl_det function
call:
Precision-recall curves
Both ROC and DET curves normalize out the relative proportions of
positive and negative samples. By contrast, a Precision-Recall
(PR) curve reflects this directly. One can plot the PR curve by
using the vl_pr
function:
vl_pr(labels, scores) ;
This produces the figure
The PR curve is the parametric curve given by precision and
recall. These two quantities can be obtained from vl_roc
as follows:
[recall, precision] = vl_roc(labels, scores) ;
The precision value precision(k) is the proportion of
samples with rank smaller or equal than
k-1 that are
positive(where ranks are assigned by decreasing
scores).
recall(k) is instead the percentage of positive
samples that have rank smaller or equal than
k-1. For
example, if the first two samples are one positive and one
negative,
precision(3) is 1/2. If there are in total 5
positive samples, then
recall(3) is 1/5.
Moving from rank k to rank
k+1, if the
sample of rank
k+1 is positive then
both
precision and
recall increase;
otherwise
precision decreases and
recall
stays constant. This gives the PR curve a characteristic
saw-shape. For aan ideal classifier that ranks all the positive
samples first the PR curve is one that describes two sides of the unit
square.
Similar to the ROC curves, the Area Under the Curve (AUC)
can be used to summarize the quality of a ranking in term of precision
and recall. This can be obtained as info.auc by
[rc, pr, info] = vl_pr(labels, scores) ;
disp(info.auc) ;
disp(info.ap) ;
disp(info.ap_interp_11) ;
The AUC is obtained by trapezoidal interpolation of the precision.
An alternative and usually almost equivalent metric is the Average
Precision (AP), returned as info.ap. This is the
average of the precision obtained every time a new positive sample is
recalled. It is the same as the AUC if precision is interpolated by
constant segments and is the definition used by TREC most
often. Finally, the 11 points interpolated average precision,
returned as
info.ap_interp_11. This is an older TREC
definition and is obtained by taking the average of eleven precision
values, obtained as the maximum precision for recalls largerer than
0.0, 0.1, ..., 1.0. This particular metric was used, for example, in
the PASCAL VOC challenge until the 2008 edition.