topo.base.ann

Module Contents

Classes

NMSlibTransformer

Wrapper for using nmslib as sklearn's KNeighborsTransformer. This implements

HNSWlibTransformer

Wrapper for using HNSWlib as sklearn's KNeighborsTransformer. This implements

AnnoyTransformer

Wrapper for using annoy.AnnoyIndex as sklearn's KNeighborsTransformer

FAISSTransformer

Mixin class for all transformers in scikit-learn.

Functions

kNN(X[, Y, n_neighbors, metric, n_jobs, backend, ...])

General function for computing k-nearest-neighbors graphs using NMSlib, HNSWlib, PyNNDescent, ANNOY, FAISS or scikit-learn.

get_sparse_row(mat, idx)

or_else_csrs(csr1, csr2)

postprocess_knn_csr(knns[, include_fwd, include_rev])

topo.base.ann.kNN(X, Y=None, n_neighbors=5, metric='euclidean', n_jobs=-1, backend='hnswlib', low_memory=True, M=15, p=11 / 16, efC=50, efS=50, n_trees=50, return_instance=False, verbose=False, **kwargs)

General function for computing k-nearest-neighbors graphs using NMSlib, HNSWlib, PyNNDescent, ANNOY, FAISS or scikit-learn.

Parameters:
  • X (np.ndarray or scipy.sparse.csr_matrix.) – Input data.

  • n_neighbors (int (optional, default 30)) – number of nearest-neighbors to look for. In practice, this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood search. Options are ‘nmslib’, ‘hnswlib’, ‘pynndescent’,’annoy’, ‘faiss’ and ‘sklearn’.

  • metric (str (optional, default 'cosine').) – Accepted metrics. Defaults to ‘cosine’. Accepted metrics include: -‘sqeuclidean’ -‘euclidean’ -‘l1’ -‘lp’ - requires setting the parameter p - equivalent to minkowski distance -‘cosine’ -‘angular’ -‘negdotprod’ -‘levenshtein’ -‘hamming’ -‘jaccard’ -‘jansen-shan’

  • n_jobs (int (optional, default 1).) – Number of threads to be used in computation. Defaults to 1. Set to -1 to use all available CPUs. Most algorithms are highly scalable to multithreading.

  • M (int (optional, default 30).) – defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSlib. Please check more about NMSlib at https://github.com/nmslib/nmslib.

  • efC (int (optional, default 100).) – A ‘hnsw’ parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

  • efS (int (optional, default 100).) – A ‘hnsw’ parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the output of approximate nearest neighbors search. The default is True and uses additive symmetrization, i.e. knn = ( knn + knn.T ) / 2 .

  • **kwargs (dict (optional, default {}).) – Additional parameters to be passed to the backend approximate nearest-neighbors library. Use only parameters known to the desired backend library.

Returns:

A scipy.sparse.csr_matrix containing k-nearest-neighbor distances.

class topo.base.ann.NMSlibTransformer(n_neighbors=15, metric='cosine', method='hnsw', n_jobs=-1, p=None, M=30, efC=100, efS=100, dense=False, verbose=False)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Wrapper for using nmslib as sklearn’s KNeighborsTransformer. This implements an escalable approximate k-nearest-neighbors graph on spaces defined by nmslib. Read more about nmslib and its various available metrics at https://github.com/nmslib/nmslib. Calling ‘nn <- NMSlibTransformer()’ initializes the class with default

neighbour search parameters.

Parameters:
  • n_neighbors (int (optional, default 30)) – number of nearest-neighbors to look for. In practice, this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

  • metric (str (optional, default 'cosine').) – Accepted NMSLIB metrics. Defaults to ‘cosine’. Accepted metrics include: * ‘sqeuclidean’ * ‘euclidean’ * ‘l1’ * ‘lp’ - requires setting the parameter p - equivalent to minkowski distance * ‘cosine’ * ‘angular’ * ‘negdotprod’ * ‘levenshtein’ * ‘hamming’ * ‘jaccard’ * ‘jansen-shan’

  • method (str (optional, default 'hsnw').) –

    approximate-neighbor search method. Available methods include:

    -‘hnsw’ : a Hierarchical Navigable Small World Graph. -‘sw-graph’ : a Small World Graph. -‘vp-tree’ : a Vantage-Point tree with a pruning rule adaptable to non-metric distances. -‘napp’ : a Neighborhood APProximation index. -‘simple_invindx’ : a vanilla, uncompressed, inverted index, which has no parameters. -‘brute_force’ : a brute-force search, which has no parameters.

    ’hnsw’ is usually the fastest method, followed by ‘sw-graph’ and ‘vp-tree’.

  • n_jobs (int (optional, default -1).) – number of threads to be used in computation. Defaults to -1 (all but one). The algorithm is highly scalable to multi-threading.

  • M (int (optional, default 30).) – defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSlib. Please check more about NMSlib at https://github.com/nmslib/nmslib.

  • efC (int (optional, default 100).) – A ‘hnsw’ parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

  • efS (int (optional, default 100).) – A ‘hnsw’ parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

  • dense (bool (optional, default False).) – Whether to force the algorithm to use dense data, such as np.ndarrays and pandas DataFrames.

Returns:

Class for really fast approximate-nearest-neighbors search.

Example

import numpy as np from sklearn.datasets import load_digits from scipy.sparse import csr_matrix from topo.base.ann import NMSlibTransformer # # Load the MNIST digits data, convert to sparse for speed digits = load_digits() data = csr_matrix(digits) # # Start class with parameters nn = NMSlibTransformer() nn = nn.fit(data) # # Obtain kNN graph knn = nn.transform(data) # # Obtain kNN indices, distances and distance gradient ind, dist, grad = nn.ind_dist_grad(data) # # Test for recall efficiency during approximate nearest neighbors search test = nn.test_efficiency(data)

fit(data)
transform(data)
ind_dist_grad(data, return_grad=True, return_graph=True)
test_efficiency(data, data_use=0.1)

Test if NMSlibTransformer and KNeighborsTransformer give same results

Updates number of neighbors for kNN distance computation. :param n_neighbors: :type n_neighbors: New number of neighbors to look for.

fit_transform(X)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

class topo.base.ann.HNSWlibTransformer(n_neighbors=30, metric='cosine', n_jobs=-1, M=30, efC=100, efS=100, verbose=False)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Wrapper for using HNSWlib as sklearn’s KNeighborsTransformer. This implements an escalable approximate k-nearest-neighbors graph on spaces defined by hnwslib. Read more about hnwslib at https://github.com/nmslib/hnswlib Calling ‘nn <- HNSWlibTransformer()’ initializes the class with

neighbour search parameters.

Parameters:
  • n_neighbors (int (optional, default 30)) – number of nearest-neighbors to look for. In practice, this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

  • metric (str (optional, default 'cosine')) – accepted NMSLIB metrics. Defaults to ‘cosine’. Accepted metrics include: * ‘sqeuclidean’ and ‘euclidean’ * ‘inner_product’ * ‘cosine’ For additional metrics, use the NMSLib backend.

  • n_jobs (int (optional, default -1)) – number of threads to be used in computation. Defaults to -1 (all but one). The algorithm is highly scalable to multi-threading.

  • M (int (optional, default 30)) – defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320.

  • efC (int (optional, default 100)) – A ‘hnsw’ parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

  • efS (int (optional, default 100)) – A ‘hnsw’ parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

Returns:

Class for really fast approximate-nearest-neighbors search.

fit(data)
transform(data)
ind_dist_grad(data, return_grad=True, return_graph=True)
test_efficiency(data, percent_use=0.1)

Test if HNSWlibTransformer and KNeighborsTransformer give same results

Updates number of neighbors for kNN distance computation. :param n_neighbors: :type n_neighbors: New number of neighbors to look for.

fit_transform(X)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

class topo.base.ann.AnnoyTransformer(n_neighbors=5, metric='euclidean', n_trees=10, n_jobs=-1)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Wrapper for using annoy.AnnoyIndex as sklearn’s KNeighborsTransformer Read more about ANNOY at https://github.com/spotify/annoy

Calling ‘nn <- AnnoyTransformer()’ initializes the class with default

neighbour search parameters.

Parameters:
  • metric (str (optional, default 'euclidean').) – Accepted ANNOY metrics. Defaults to ‘euclidean’. Accepted metrics include: * ‘euclidean’ * ‘angular’ * ‘dot’ * ‘manhattan’ * ‘hamming’

  • n_jobs (int (optional, default -1).) – Number of threads to be used in computation. Defaults to -1 (all but one).

  • n_trees (int (optional, default 10).) – ANNOYS builds a forest of n_trees trees. More trees gives higher precision when querying, at a higher computational cost.

fit(X)
transform(X)
fit_transform(X, y=None)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

_transform(X)

As transform, but handles X is None for faster fit_transform.

class topo.base.ann.FAISSTransformer(n_neighbors=5, *, metric='euclidean', index_key='', n_probe=128, n_jobs=-1, include_fwd=True, include_rev=False)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Mixin class for all transformers in scikit-learn.

If get_feature_names_out is defined, then BaseEstimator will automatically wrap transform and fit_transform to follow the set_output API. See the developer_api_set_output for details.

base.OneToOneFeatureMixin and base.ClassNamePrefixFeaturesOutMixin are helpful mixins for defining get_feature_names_out.

property _metric_info
mk_faiss_index(feats, inner_metric, index_key='', nprobe=128)
fit(X, y=None)
transform(X)
_transform(X)
fit_transform(X, y=None)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

_more_tags()
topo.base.ann.get_sparse_row(mat, idx)
topo.base.ann.or_else_csrs(csr1, csr2)
topo.base.ann.postprocess_knn_csr(knns, include_fwd=True, include_rev=False)