IVF-HNSW
 All Classes Functions Variables Typedefs Pages
Public Types | Public Member Functions | Public Attributes | Protected Member Functions | Protected Attributes | List of all members
ivfhnsw::IndexIVF_HNSW Struct Reference

#include <IndexIVF_HNSW.h>

Inheritance diagram for ivfhnsw::IndexIVF_HNSW:
ivfhnsw::IndexIVF_HNSW_Grouping

Public Types

typedef uint32_t idx_t
 all indices are this type
 

Public Member Functions

 IndexIVF_HNSW (size_t dim, size_t ncentroids, size_t bytes_per_code, size_t nbits_per_idx)
 
void build_quantizer (const char *path_data, const char *path_info, const char *path_edges, size_t M=16, size_t efConstruction=500)
 
void assign (size_t n, const float *x, idx_t *labels, size_t k=1)
 
virtual void search (size_t k, const float *x, float *distances, long *labels)
 
virtual void add_batch (size_t n, const float *x, const idx_t *xids, const idx_t *precomputed_idx=nullptr)
 
virtual void train_pq (size_t n, const float *x)
 
virtual void write (const char *path)
 Write index to the path.
 
virtual void read (const char *path)
 Read index from the path.
 
void compute_centroid_norms ()
 Compute norms of the HNSW vertices.
 
void rotate_quantizer ()
 For correct search using OPQ encoding rotate points in the coarse quantizer.
 

Public Attributes

size_t d
 Vector dimension.
 
size_t nc
 Number of centroids.
 
size_t code_size
 Code size per vector in bytes.
 
hnswlib::HierarchicalNSW * quantizer
 Quantizer that maps vectors to inverted lists (HNSW [Y.Malkov])
 
faiss::ProductQuantizer * pq
 Produces the residual codes.
 
faiss::ProductQuantizer * norm_pq
 Produces the norm codes of reconstructed base vectors.
 
faiss::LinearTransform * opq_matrix
 Rotation matrix for OPQ encoding.
 
bool do_opq
 Turn on/off OPQ encoding.
 
size_t nprobe
 Number of probes at search time.
 
size_t max_codes
 Max number of codes to visit to do a query.
 
std::vector< std::vector< idx_t > > ids
 Inverted lists for indexes.
 
std::vector< std::vector
< uint8_t > > 
codes
 PQ codes of residuals.
 
std::vector< std::vector
< uint8_t > > 
norm_codes
 PQ codes of norms of reconstructed base vectors.
 

Protected Member Functions

float pq_L2sqr (const uint8_t *code)
 L2 sqr distance function for PQ codes.
 

Protected Attributes

std::vector< float > norms
 L2 square norms of reconstructed base vectors.
 
std::vector< float > centroid_norms
 L2 square norms of coarse centroids.
 
std::vector< float > precomputed_table
 Size pq.M * pq.ksub.
 

Detailed Description

Index based on a inverted file (IVF) with Product Quantizer encoding.

In the inverted file, the quantizer (an HNSW instance) provides a quantization index for each vector to be added. The quantization index maps to a list (aka inverted list or posting list), where the id of the vector is then stored.

At search time, the vector to be searched is also quantized, and only the list corresponding to the quantization index is searched. This speeds up the search by making it non-exhaustive. This can be relaxed using multi-probe search: a few (nprobe) quantization indices are selected and several inverted lists are visited.

Supports HNSW quantizer construction, PQ training, adding vertices, serialization and searching.

Each residual vector is encoded as a product quantizer code.

Currently only asymmetric queries are supported: database-to-database queries are not implemented.

Member Function Documentation

void ivfhnsw::IndexIVF_HNSW::add_batch ( size_t  n,
const float *  x,
const idx_t xids,
const idx_t precomputed_idx = nullptr 
)
virtual

Add n vectors of dimension d to the index.

Parameters
nnumber of base vectors in a batch
xbase vectors to add, size n * d
xidsids to store for the vectors (size n)
precomputed_idxif non-null, assigned idxs to store for the vectors (size n)
void ivfhnsw::IndexIVF_HNSW::assign ( size_t  n,
const float *  x,
idx_t labels,
size_t  k = 1 
)

Return the indices of the k HNSW vertices closest to the query x.

Parameters
nnumber of input vectors
xquery vectors, size n * d
labelsoutput labels of the nearest neighbours, size n * k
knumber of the closest HNSW vertices to the query x
void ivfhnsw::IndexIVF_HNSW::build_quantizer ( const char *  path_data,
const char *  path_info,
const char *  path_edges,
size_t  M = 16,
size_t  efConstruction = 500 
)

Construct from stretch or load the existing quantizer (HNSW) instance

if all files exist, quantizer will be loaded, else HNSW will be constructed

Parameters
path_datapath to input vectors
path_infopath to parameters for HNSW
path_edgespath to edges for HNSW
Mmin number of edges per point, default: 16
efConstructionmax number of candidate vertices in queue to observe, default: 500

There has been removed parallel HNSW construction in order to make internal centroid ids equal to external ones. Construction time is still acceptable: ~5 minutes for 1 million 96-d vectors on Intel Xeon E5-2650 V2 2.60GHz.

void ivfhnsw::IndexIVF_HNSW::search ( size_t  k,
const float *  x,
float *  distances,
long *  labels 
)
virtual

Query n vectors of dimension d to the index.

Return at most k vectors. If there are not enough results for a query, the result array is padded with -1s.

Parameters
knumber of the closest vertices to search
xquery vectors, size n * d
distancesoutput pairwise distances, size n * k
labelsoutput labels of the nearest neighbours, size n * k

Search procedure

During IVF-HNSW-PQ search we compute

d = || x - y_C - y_R ||^2

where x is the query vector, y_C the coarse centroid, y_R the refined PQ centroid. The expression can be decomposed as:

d = || x - y_C ||^2 - || y_C ||^2 + || y_C + y_R ||^2 - 2 * (x|y_R)


term 1 term 2 term 3

We use the following decomposition:

  • term 1 is the distance to the coarse centroid, that is computed during the 1st stage search in the HNSW graph, minus the norm of the coarse centroid
  • term 2 is the L2 norm of the reconstructed base point, that is computed at construction time, quantized using separately trained product quantizer for such norms and stored along with the residual PQ codes.
  • term 3 is the classical non-residual distance table.

Norms of centroids are precomputed and saved without compression, as their memory consumption is negligible. If it is necessary, the norms can be added to the term 3 and compressed to byte together. We do not think that it will lead to considerable decrease in accuracy.

Since y_R defined by a product quantizer, it is split across sub-vectors and stored separately for each subvector.

Reimplemented in ivfhnsw::IndexIVF_HNSW_Grouping.

void ivfhnsw::IndexIVF_HNSW::train_pq ( size_t  n,
const float *  x 
)
virtual

Train product quantizers

Parameters
nnumber of training vectors of dimension d
xlearn vectors, size n * d

Reimplemented in ivfhnsw::IndexIVF_HNSW_Grouping.


The documentation for this struct was generated from the following files: