Table of Contents
This page describes the Vector of Locally Aggregated Descriptors (VLAD) image encoding of [9] . See Vector of Locally Aggregated Descriptors (VLAD) encoding for an overview of the C API.
VLAD is a feature encoding and pooling method, similar to Fisher vectors. VLAD encodes a set of local feature descriptors \(I=(\bx_1,\dots,\bx_n)\) extracted from an image using a dictionary built using a clustering method such as Gaussian Mixture Models (GMM) or K-means clustering. Let \(q_{ik}\) be the strength of the association of data vector \(\bx_i\) to cluster \(\mu_k\), such that \(q_{ik} \geq 0\) and \(\sum_{k=1}^K q_{ik} = 1\). The association may be either soft (e.g. obtained as the posterior probabilities of the GMM clusters) or hard (e.g. obtained by vector quantization with K-means).
\(\mu_k\) are the cluster means, vectors of the same dimension as the data \(\bx_i\). VLAD encodes feature \(\bx\) by considering the residuals
\[ \bv_k = \sum_{i=1}^{N} q_{ik} (\bx_{i} - \mu_k). \]
The residulas are stacked together to obtain the vector
\[ \hat\Phi(I) = \begin{bmatrix} \vdots \\ \bv_k \\ \vdots \end{bmatrix} \]
Before the VLAD encoding is used it is usually normalized, as explained VLAD normalization next.
VLAD normalization
VLFeat VLAD implementation supports a number of different normalization strategies. These are optionally applied in this order:
- Component-wise mass normalization. Each vector \(\bv_k\) is divided by the total mass of features associated to it \(\sum_{i=1}^N q_{ik}\).
- Square-rooting. The function \(\sign(z)\sqrt{|z|}\) is applied to all scalar components of the VLAD descriptor.
- Component-wise \(l^2\) normalization. The vectors \(\bv_k\) are divided by their norm \(\|\bv_k\|_2\).
- Global \(l^2\) normalization. The VLAD descriptor \(\hat\Phi(I)\) is divided by its norm \(\|\hat\Phi(I)\|_2\).