topo.base

Submodules

Package Contents

Functions

kNN(X[, Y, n_neighbors, metric, n_jobs, backend, ...])

General function for computing k-nearest-neighbors graphs using NMSlib, HNSWlib, PyNNDescent, ANNOY, FAISS or scikit-learn.

Attributes

_have_numba

topo.base.kNN(X, Y=None, n_neighbors=5, metric='euclidean', n_jobs=-1, backend='hnswlib', low_memory=True, M=15, p=11 / 16, efC=50, efS=50, n_trees=50, return_instance=False, verbose=False, **kwargs)

General function for computing k-nearest-neighbors graphs using NMSlib, HNSWlib, PyNNDescent, ANNOY, FAISS or scikit-learn.

Parameters:
  • X (np.ndarray or scipy.sparse.csr_matrix.) – Input data.

  • n_neighbors (int (optional, default 30)) – number of nearest-neighbors to look for. In practice, this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood search. Options are ‘nmslib’, ‘hnswlib’, ‘pynndescent’,’annoy’, ‘faiss’ and ‘sklearn’.

  • metric (str (optional, default 'cosine').) – Accepted metrics. Defaults to ‘cosine’. Accepted metrics include: -‘sqeuclidean’ -‘euclidean’ -‘l1’ -‘lp’ - requires setting the parameter p - equivalent to minkowski distance -‘cosine’ -‘angular’ -‘negdotprod’ -‘levenshtein’ -‘hamming’ -‘jaccard’ -‘jansen-shan’

  • n_jobs (int (optional, default 1).) – Number of threads to be used in computation. Defaults to 1. Set to -1 to use all available CPUs. Most algorithms are highly scalable to multithreading.

  • M (int (optional, default 30).) – defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSlib. Please check more about NMSlib at https://github.com/nmslib/nmslib.

  • efC (int (optional, default 100).) – A ‘hnsw’ parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

  • efS (int (optional, default 100).) – A ‘hnsw’ parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the output of approximate nearest neighbors search. The default is True and uses additive symmetrization, i.e. knn = ( knn + knn.T ) / 2 .

  • **kwargs (dict (optional, default {}).) – Additional parameters to be passed to the backend approximate nearest-neighbors library. Use only parameters known to the desired backend library.

Returns:

A scipy.sparse.csr_matrix containing k-nearest-neighbor distances.

topo.base._have_numba = True