topo.tpgraph.fuzzy

Module Contents

Functions

fuzzy_simplicial_set(X[, n_neighbors, metric, ...])

Given a set of data X, a neighborhood size, and a measure of distance

compute_membership_strengths(knn_indices, knn_dists, ...)

Construct the membership strength data for the 1-skeleton of each local

smooth_knn_dist(distances, k[, n_iter, ...])

Compute a continuous version of the distance to the kth nearest

Attributes

SMOOTH_K_TOLERANCE

MIN_K_DIST_SCALE

NPY_INFINITY

INT32_MIN

INT32_MAX

topo.tpgraph.fuzzy.SMOOTH_K_TOLERANCE = 1e-06
topo.tpgraph.fuzzy.MIN_K_DIST_SCALE = 0.0001
topo.tpgraph.fuzzy.NPY_INFINITY
topo.tpgraph.fuzzy.INT32_MIN
topo.tpgraph.fuzzy.INT32_MAX
topo.tpgraph.fuzzy.fuzzy_simplicial_set(X, n_neighbors=15, metric='cosine', backend='nmslib', n_jobs=1, set_op_mix_ratio=1.0, local_connectivity=1.0, apply_set_operations=True, return_dists=False, verbose=False, **kwargs)

Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.

Originally implemented by Leland McInnes at https://github.com/lmcinnes/umap under the BSD 3-Clause License.

Parameters:
  • X (array of shape (n_samples, n_features).) – The data to be modelled as a fuzzy simplicial set.

  • n_neighbors (int.) – The number of neighbors to use to approximate geodesic distance. Larger numbers induce more global estimates of the manifold that can miss finer detail, while smaller values will focus on fine manifold structure to the detriment of the larger picture.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood search. Options are ‘nmslib’, ‘hnswlib’, ‘pynndescent’,’annoy’, ‘faiss’ and ‘sklearn’.

  • metric (str (optional, default 'cosine').) – Accepted metrics. Defaults to ‘cosine’. Accepted metrics include: -‘sqeuclidean’ -‘euclidean’ -‘l1’ -‘lp’ - requires setting the parameter p - equivalent to minkowski distance -‘cosine’ -‘angular’ -‘negdotprod’ -‘levenshtein’ -‘hamming’ -‘jaccard’ -‘jansen-shan’

  • n_jobs (int (optional, default 1).) – Number of threads to be used in computation of nearest neighbors. Set to -1 to use all available CPUs.

  • knn_indices (array of shape (n_samples, n_neighbors) (optional).) – If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the indices of the k-nearest neighbors as a row for each data point. Ignored if metric is ‘precomputed’.

  • knn_dists (array of shape (n_samples, n_neighbors) (optional).) – If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the distances of the k-nearest neighbors as a row for each data point. Ignored if metric is ‘precomputed’.

  • set_op_mix_ratio (float (optional, default 1.0).) – Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

  • local_connectivity (int (optional, default 1)) – The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

  • verbose (bool (optional, default False)) – Whether to report information on the current progress of the algorithm.

  • return_dists (bool or None (optional, default none)) – Whether to return the pairwise distance associated with each edge.

  • **kwargs (dict (optional, default {}).) – Additional parameters to be passed to the backend approximate nearest-neighbors library. Use only parameters known to the desired backend library.

Returns:

  • fuzzy_ss (coo_matrix) – A fuzzy simplicial set represented as a sparse matrix. The (i, j) entry of the matrix represents the membership strength of the 1-simplex between the ith and jth sample points.

  • sigmas (array of shape (n_samples,)) – The normalization factor derived from the metric tensor approximation. Equal to the distance

  • rhos (array of shape (n_samples,)) – The distance to the 1st nearest neighbor for each point.

topo.tpgraph.fuzzy.compute_membership_strengths(knn_indices, knn_dists, sigmas, rhos)

Construct the membership strength data for the 1-skeleton of each local fuzzy simplicial set – this is formed as a sparse matrix where each row is a local fuzzy simplicial set, with a membership strength for the 1-simplex to each other data point.

Originally implemented by Leland McInnes at https://github.com/lmcinnes/umap under the BSD 3-Clause License.

Parameters:
  • knn_indices (array of shape (n_samples, n_neighbors)) – The indices on the n_neighbors closest points in the dataset.

  • knn_dists (array of shape (n_samples, n_neighbors)) – The distances to the n_neighbors closest points in the dataset.

  • sigmas (array of shape(n_samples)) – The normalization factor derived from the metric tensor approximation.

  • rhos (array of shape(n_samples)) – The local connectivity adjustment.

Returns:

  • rows (array of shape (n_samples * n_neighbors)) – Row data for the resulting sparse matrix (coo format)

  • cols (array of shape (n_samples * n_neighbors)) – Column data for the resulting sparse matrix (coo format)

  • vals (array of shape (n_samples * n_neighbors)) – Entries for the resulting sparse matrix (coo format)

topo.tpgraph.fuzzy.smooth_knn_dist(distances, k, n_iter=64, local_connectivity=1.0, bandwidth=1.0)

Compute a continuous version of the distance to the kth nearest neighbor. That is, this is similar to knn-distance but allows continuous k values rather than requiring an integral k. In essence we are simply computing the distance such that the cardinality of fuzzy set we generate is k.

Originally implemented by Leland McInnes at https://github.com/lmcinnes/umap under the BSD 3-Clause License.

Parameters:
  • distances (array of shape (n_samples, n_neighbors)) – Distances to nearest neighbors for each samples. Each row should be a sorted list of distances to a given samples nearest neighbors.

  • k (float) – The number of nearest neighbors to approximate for.

  • n_iter (int (optional, default 64)) – We need to binary search for the correct distance value. This is the max number of iterations to use in such a search.

  • local_connectivity (int (optional, default 1)) – The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

  • bandwidth (float (optional, default 1)) – The target bandwidth of the kernel, larger values will produce larger return values.

Returns:

  • knn_dist (array of shape (n_samples,)) – The distance to kth nearest neighbor, as suitably approximated.

  • nn_dist (array of shape (n_samples,)) – The distance to the 1st nearest neighbor for each point.