This page lists a number of sample VLFeat applications. Their code can be found in the VLROOT/apps/ subdirectory.

Basic recognition

This sample application uses VLFeat to train an test an image classifier on the Caltech-101 data. The classifier achieves 65% average accuracy by using a single feature and 15 training images per class. It uses:

PHOW features (dense multi-scale SIFT descriptors)
Elkan k-means for fast visual word dictionary construction
Spatial histograms as image descriptors
A homogeneous kernel map to transform a Chi2 support vector machine (SVM) into a linear one
SVM classifiers

The program is fully contained in a single MATLAB M-file, and can also be simply adapted to use your own data (change conf.calDir).

Advanced encodings for recognition

This example application extends the Caltech-101 demo above in many ways: it supports multiple encoding methods, including BoVW, VLAD, and Fisher Vectors, tweaked image featuers, and multiple benchmark datasets. The code is located int apps/recognition. Start from the main file.

The following tables report results on a few standard benchmark datasets (PASCAL VOC 2007 classification challenge, Caltech 101 30 training images, MIT Scene 67, and Flickr Material Dataset) for a number of different encodings:

method	VOC07	Caltech 101	Scene 67	FMD
FV	59.12% mAP	73.02% Acc	58.25% Acc	59.60% Acc
FV + aug.	60.25% mAP	75.61% Acc	57.57% Acc	60.80% Acc
FV + s.p.	62.23% mAP	77.63% Acc	61.83% Acc	60.80% Acc
VLAD + aug.	54.66% mAP	78.68% Acc	53.29% Acc	49.40% Acc
BOVW + aug.	49.87% mAP	75.98% Acc	50.22% Acc	46.00% Acc

The baseline feature is SIFT (vl_dsift) computed at seven scales with a factor $\sqrt{2}$ between successive scales, bins 8 pixel wide, and comptued with a step of 4 pixels. All experiments but the Caltech-101 ones start by doubling the resolution of the input image. The details of the encodings are as follows:

Bag-of-visual-words uses 4096 vector quantized visual words histogram square rooting, followed by $L^2$ normalization (Hellinger's kernel).
VLAD uses 256 vector quantized visual words, signed square-rooting, component wise $L^2$ normalization, and global $L^2$ normalization (see vl_vlad).
Fisher vectors uses a 256 visual words GMM and the improved formulation (signed square-rooting followed by $L^2$ normlization, see vl_fisher).
Learning uses a linear SVM (see vl_svmtrain). The parameter $C$ is set to 10 for all dataset except PASCAL VOC, for which it is set to 1.
Experiments labelled with “aug.” encode spatial information by appending the feature coordinates to the descriptor; the ones labelled with “s.p.” use a spatial pyramid with 1x1 and 3x1 subdivisions.

SIFT mosaic

This sample application uses VLFeat to extract SIFT features form a pair of images and match them. It then filters the matches based on RANSAC and produces a mosaic. Read the code.

VLFeat.org

Basic recognition

Advanced encodings for recognition

SIFT mosaic