topo.tpgraph

Submodules

Package Contents

Classes

Kernel

Scikit-learn flavored class for computing a kernel matrix from a set of points. Includes functions

GeneralizedProcrustes

Generalized Procrustes Analysis in a scikit-learn flavor.

IntrinsicDim

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data.

Functions

compute_kernel(X[, metric, n_neighbors, fuzzy, cknn, ...])

Compute a kernel matrix from a set of points.

cknn_graph(X[, n_neighbors, delta, metric, weighted, ...])

Function-oriented implementation of Continuous k-Nearest-Neighbors (CkNN).

fuzzy_simplicial_set(X[, n_neighbors, metric, ...])

Given a set of data X, a neighborhood size, and a measure of distance

fit_transform_procrustes(x, fit_transform_call[, ...])

Fit model and transform data for larger datasets. This is from GRAE (https://github.com/KevinMoonLab/GRAE).

topo.tpgraph.compute_kernel(X, metric='cosine', n_neighbors=10, fuzzy=False, cknn=False, delta=1.0, pairwise=False, sigma=None, adaptive_bw=True, expand_nbr_search=False, alpha_decaying=False, return_densities=False, symmetrize=True, backend='nmslib', n_jobs=-1, verbose=False, **kwargs)

Compute a kernel matrix from a set of points.

Parameters:
  • X (array-like, shape (n_samples, n_features).) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix.

  • metric (string (optional, default 'cosine').) – The metric to use when computing the kernel matrix. Possible values are: ‘cosine’, ‘euclidean’ and others. Accepts precomputed distances.

  • n_neighbors (int (optional, default 10).) – The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

  • fuzzy (bool (optional, default False).) – Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

  • cknn (bool (optional, default False).) – Whether to build the adjacency and affinity matrices using continuous k-nearest-neighbors. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored.

  • delta (float (optional, default 1.0).) – The scaling factor for the CkNN kernel. Ignored if cknn set to False.

  • pairwise (bool (optional, default False).) – Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

  • sigma (float (optional, default None).) – Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

  • adaptive_bw (bool (optional, default True).) – Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

  • expand_nbr_search (bool (optional, default False).) – Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

  • alpha_decaying (bool (optional, default False).) – Whether to use an adaptively decaying kernel.

  • return_densities (bool (optional, default False).) – Whether to return the bandwidth metrics as a dictinary. If set to True, the function returns a tuple containing the kernel matrix and a dictionary containing the bandwidth metric.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the kernel matrix after normalizations.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If set to -1, all available cores are used.

  • verbose (bool (optional, default False).) – Whether to print progress messages.

  • **kwargs (dict, optional) – Additional arguments to be passed to the nearest-neighbors backend.

Returns:

  • K (array-like, shape (n_samples, n_samples)) – The kernel matrix.

  • densities (dict, optional (if return_densities is set to True)) – If fuzzy and cknn are False, is a dictionary containing the bandwidth metrics. If fuzzy is set to True, the dictionary contains sigma and rho estimates. If cknn is set to True, the dictionary contains the bandwidth metric.

class topo.tpgraph.Kernel(metric='cosine', n_neighbors=10, fuzzy=False, cknn=False, pairwise=False, sigma=None, adaptive_bw=True, expand_nbr_search=False, alpha_decaying=False, symmetrize=True, backend='nmslib', n_jobs=1, laplacian_type='normalized', anisotropy=1, semi_aniso=False, n_landmarks=None, cache_input=False, verbose=False, random_state=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scikit-learn flavored class for computing a kernel matrix from a set of points. Includes functions for computing the kernel matrix with a variety of methods (adaptive bandwidth, fuzzy simplicial sets, continuous k-nearest-neighbors, etc) and performing operations on the resulting graph, such as obtaining its Laplacian, sparsifying it, filtering and interpolating signals, obtaining diffusion operators, imputing missing values and computing shortest paths.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix.

  • metric (string (optional, default 'cosine')) – The metric to use when computing the kernel matrix. Possible values are: ‘cosine’, ‘euclidean’ and others, depending on the chosen nearest-neighbors backend. Accepts precomputed distances as ‘precomputed’.

  • n_neighbors (int (optional, default 10).) – The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

  • fuzzy (bool, optional (default False).) – Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

  • cknn (bool, optional (default False).) – Whether to build the adjacency and affinity matrices using continuous k-nearest-neighbors. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True, laplacian_type is automatically set to ‘unnormalized’.

  • pairwise (bool (optional, default False).) – Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

  • sigma (float (optional, default None)) – Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

  • adaptive_bw (bool (optional, default True).) – Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

  • expand_nbr_search (bool (optional, default False).) – Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

  • alpha_decaying (bool (optional, default False).) – Whether to use an adaptively decaying kernel.

  • laplacian_type (str (optional, default 'normalized').) – The type of Laplacian to compute. Possible values are: ‘normalized’, ‘unnormalized’, ‘random_walk’ and ‘geometric’.

  • anisotropy (float (optional, default 0).) – The anisotropy (alpha) parameter in the diffusion maps literature for kernel reweighting.

  • semi_aniso (bool (optional, default False).) – Whether to use semi-anisotropic diffusion. This reweights the original kernel (not the renormalized kernel) by the renormalized degree.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the kernel matrix after normalizations.

  • backend (str (optional, default 'nmslib').) – Which backend to use for k-nearest-neighbor computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is *highly* recommended whenever possible.

  • laplacian_type – The type of laplacian to use. Can be ‘unnormalized’, ‘normalized’, or ‘random_walk’.

Properties

knnscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The computed k-nearest-neighbors graph.

Ascipy.sparse.csr_matrix, shape (n_samples, n_samples)

The adjacency matrix of the kernel graph.

Kscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The kernel matrix.

dens_dictdict

Dictionary containing the density information for each point in the dataset for the computed kernel.

Lscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The laplacian matrix of the kernel graph.

Pscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The diffusion operator of the kernel graph.

SPscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The shortest-path matrix.

degreenp.ndarray, shape (n_samples,)

The degree of each point in the adjacency graph.

weighted_degreenp.ndarray, shape (n_samples,)

The weighted degree of each point in the kernel graph.

property knn

Returns the k-nearest-neighbors graph.

property K

Kernel matrix.

property A

Graph adjacency matrix.

property degree

Returns the degree of the adjacency matrix.

property weighted_degree

Returns the degree of the weighted affinity matrix.

property L

Object synonym for the laplacian() function.

property P

Object synonym for the diff_op() function.

property SP

Object synonym for the shortest_paths() function.

__repr__()

Return repr(self).

_parse_backend()
fit(X, recompute=False, **kwargs)

Fits the kernel matrix to the data.

Parameters:
  • X (array-like, shape (n_samples, n_features).) – Input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset.

  • recompute (bool (optional, default False).) – Whether to recompute the kernel if it has already been computed.

  • **kwargs (dict, optional) – Additional arguments to be passed to the k-nearest-neighbor backend.

Returns:

The kernel matrix.

transform(X=None)

Returns the kernel matrix. Here for compability with scikit-learn only.

fit_transform(X, y=None)

Fits the kernel matrix to the data and returns the kernel matrix. Here for compability with scikit-learn only.

adjacency()

Graph adjacency matrix (the binary version of W).

The adjacency matrix defines which edges exist on the graph. It is represented as an N-by-N matrix of booleans. \(A_{i,j}\) is True if \(W_{i,j} > 0\).

laplacian(laplacian_type=None)

Compute the graph Laplacian, given a adjacency or affinity graph W. For a friendly reference, see this material from James Melville: https://jlmelville.github.io/smallvis/spectral.html

Parameters:

laplacian_type (str (optional, default None).) – The type of laplacian to use. Can be ‘unnormalized’, ‘normalized’, or ‘random_walk’. If not provided, will use the default laplacian type specified in the constructor.

Returns:

L (scipy.sparse.csr_matrix) – The graph Laplacian.

diff_op(anisotropy=1.0, symmetric=False)

Computes the [diffusion operator](https://doi.org/10.1016/j.acha.2006.04.006).

Parameters:
  • anisotropy (float (optional, default None).) – Anisotropy to apply. ‘Alpha’ in the diffusion maps literature. Defaults to the anisotropy defined in the constructor.

  • symmetric (bool (optional, default False).) – Whether to use a symmetric version of the diffusion operator. This is particularly useful to yield a symmetric operator when using anisotropy (alpha > 1), as the diffusion operator P would be assymetric otherwise, which can be problematic during matrix decomposition. Eigenvalues are the same of the assymetric version, and the eigenvectors of the original assymetric operator can be obtained by left multiplying by Kernel.D_inv_sqrt_.

Returns:

  • P (scipy.sparse.csr_matrix) – The graph diffusion operator.

  • Populates the Kernel.D_inv_sqrt_ slot.

shortest_paths(landmark=False, indices=None)

Compute the shortest paths between all pairs of nodes. If landmark is True, the shortest paths are computed between all pairs of landmarks, not all sample nodes.

Parameters:
  • landmark (bool (optional, default False).) – If True, the shortest paths are computed between all pairs of landmarks, not all sample nodes.

  • indices (list of int (optional, default None).) – If None, the shortest paths are computed between all pairs of nodes. Else, the shortest paths are computed between all pairs of nodes and nodes with specified indices.

Returns:

D (scipy.sparse.csr_matrix) – The shortest paths matrix.

get_indices_distances(n_neighbors=None, kernel=True)
_calculate_imputation_error(data, data_prev=None)

Calculates difference before and after imputation by diffusion. This is from [MAGIC](https://github.com/KrishnaswamyLab/MAGIC).

Parameters:
  • data (array-like) – current data matrix

  • data_prev (array-like, optional (default: None)) – previous data matrix. If None, data is simply prepared for comparison and no error is returned

Returns:

  • error (float) – Procrustes disparity value

  • data_curr (array-like) – transformed data to use for the next comparison

impute(Y=None, t=None, threshold=0.01, tmax=10)

Uses the diffusion operator computed from graph build on X to impute the input data Y. Although the idea behind this is far older, it was first reported in single-cell genomics by the Krishnaswamy lab in the MAGIC (Markov Affinity-based Graph Imputation of Cells) [manuscript](https://www.cell.com/cell/abstract/S0092-8674(18)30724-4)

Parameters:
  • Y (np.ndarray (optional, default None).) – The input data to impute. If None, the input data X is imputed (if it was cached by setting the cache_input parameter to True). Otherwise, you’ll have to specify it as Y.

  • t (int (optional, default None).) – The number of steps to perform during diffusion. The default None iterates until the Procrustes disparity value is below the threshold parameter.

  • threshold (float (optional, default 0.001).) – The threshold value for the Procrustes disparity when finding an optimal t.

  • tmax (int (optional, default 30).) – The maximum number of steps to perform during diffusion when estimating an optimal t.

Returns:

Y_imp (np.ndarray) – The imputed data.

_get_landmarks(X, n_landmarks=None)
filter(signal, replicates=None, beta=50, target=None, filterfunc=None, offset=0, order=1, solver='chebyshev', chebyshev_order=100)

This is inspired from [MELD](https://github.com/KrishnaswamyLab/MELD) to estimate sample associated density estimation. However, you can naturally use this for any signal in your data, not just samples of specific conditions. In practice, this is just a simple [PyGSP](https://pygsp.readthedocs.io/en/stable/reference/filters.html#module-pygsp.filters) filter on a graph. Indeed, it calls PyGSP, so you’ll need it installed to use this function.

Parameters:
  • signal (array-like) – Signal(s) to filter - usually sample labels.

  • replicates (array-like, optional (default None)) – Replicate labels for each sample. If None, no replicates are assumed.

  • beta (int (optional, default 50).) – Amount of smoothing to apply. Vary this parameter to get good estimates - this can vary widely from dataset to dataset.

  • target (array-like (optional, default None).) – Similarity matrix to use for filtering. If None, uses the kernel matrix.

  • filterfunc (function (optional, default None).) – Function to use for filtering. If None, the default is to use a Laplacian filter.

  • offset (float (optional, default 0)) – Amount to shift the filter in the eigenvalue spectrum. Recommended to use an eigenvalue from the graph based on the spectral distribution. Should be in interval [0,1]

  • order (int (optional, default 1).) – Falloff and smoothness of the filter. High order leads to square-like filters.

  • solver (string (optional, default 'chebyshev').) – Method to solve convex problem. If ‘chebyshev’, uses a chebyshev polynomial approximation of the corresponding filter. Else, if ‘exact’, uses the eigenvalue solution to the problem

  • chebyshev_order (int (optional, default 100).) – Order of chebyshev approximation to use.

_replicate_normalize_densities(replicates)
is_connected()

Check if the graph is connected (cached). A graph is connected if and only if there exists a (directed) path between any two vertices.

Returns:

connected (bool) – True if the graph is connected, False otherwise.

_connected_components(target=None)

Finds the connected components of the kernel matrix by default. Other matrices can be specified for use with the target parameter.

Parameters:

target (array-like (optional, default None).) – The target matrix to find the connected components of. If None, uses the kernel matrix.

Returns:

  • components (list of np.ndarray) – The connected components of the target matrix.

  • labels (list of int) – The labels of the connected components.

resistance_distance()

Uses the cached Laplacian matrix to compute the resistance distances. See Klein and Randic [manuscript](https://doi.org/10.1007%2FBF01164627) for details.

Returns:

rd (sparse matrix) – Resistance distance matrix

sparsify(epsilon=0.1, maxiter=10, random_state=None)

Sparsify a graph (with the Spielman-Srivastava method). This originally only called PyGSP but now also has some adaptations.

Parameters:
  • epsilon (float (optional, default 0.1).) – Sparsification parameter, which must be between 1/sqrt(N) and 1.

  • maxiter (int (optional, default 10).) – Maximum number of iterations.

  • random_state ({None, int, RandomState, Generator} (optional, default None)) – Seed for the random number generator (for reproducible sparsification).

Returns:

sparse_graph (sparse matrix) – Sparsified graph affinity matrix.

interpolate(f_subsampled, keep_inds, target=None, order=100, reg_eps=0.005)

Interpolate a graph signal.

Parameters:
  • f_subsampled (ndarray) – A graph signal on the graph G.

  • keep_inds (ndarray) – List of indices on which the signal is sampled.

  • target (array-like (optional, default None).) – Similarity matrix to use for interpolation. If None, uses the kernel matrix.

  • order (int) – Degree of the Chebyshev approximation (default = 100).

  • reg_eps (float) – The regularized graph Laplacian is $ar{L}=L+epsilon I$. A smaller epsilon may lead to better regularization, but will also require a higher order Chebyshev approximation.

Returns:

signal_interpolated (ndarray) – Interpolated graph signal on the full vertex set of G.

write_pkl(wd=None, filename='mykernel.pkl')
topo.tpgraph.cknn_graph(X, n_neighbors=10, delta=1.0, metric='euclidean', weighted=False, include_self=False, return_densities=False, backend='nmslib', n_jobs=1, verbose=False, **kwargs)

Function-oriented implementation of Continuous k-Nearest-Neighbors (CkNN). An efficient implementation of [CkNN](https://arxiv.org/pdf/1606.02353.pdf). CkNN is the only unweighted graph construction that can be used to approximate the Laplace-Beltrami Operator via the unnormalized graph Laplacian. It can also be used to wield an weighted affinity matrix with locality-sensitive weights.

Parameters:
  • n_neighbors (int (optional, default=5).) – Number of neighbors to compute. The actual number of k-nearest neighbors to be used in the CkNN normalization is half of it.

  • delta (float (optional, default=1.0).) – A parameter to decide the radius for each points. The combination radius increases in proportion to this parameter. This should be tunned.

  • metric (str (optional, default='euclidean').) – The metric of each points. This parameter depends on the parameter metric of scipy.spatial.distance.pdist.

  • weighted (bool (optional, default=False).) – If True, the CkNN graph is weighted (i.e. an affinity matrix). If False, the CkNN graph is unweighted (i.e. the proper adjacency matrix). If None, will return a tuple of the adjacency matrix (unweighted) and the affinity matrix (weighted).

  • return_densities (bool (optional, default=False).) – If True, will return the distance to the k-nearest-neighbor of each points.

  • include_self (bool (optional, default=True).) – All diagonal elements are 1.0 if this parameter is True.

  • backend (str 'hnwslib', 'nmslib' or 'sklearn' (optional, default 'nmslib').) –

    Which backend to use to compute nearest-neighbors. Options for fast, approximate nearest-neighbors are ‘hnwslib’ and ‘nmslib’ (default). For exact nearest-neighbors, use ‘sklearn’.

    • If using ‘nmslib’, a sparse

    csr_matrix input is expected. If using ‘hnwslib’ or ‘sklearn’, a dense array is expected. * I strongly recommend you use ‘hnswlib’ if handling with somewhat dense, array-shaped data. If the data is relatively sparse, you should use ‘nmslib’, which operates on sparse matrices by default on TopOMetry and will automatically convert the input array to csr_matrix for performance.

  • n_jobs (int (optional, default 1).) – The number of jobs to use in the k-nearest-neighbors computation. Defaults to one (I highly recommend you use all available).

  • verbose (bool (optional, default False).) – If True, print progress messages.

  • kwargs (dict (optional, default {}).) – Additional parameters to pass to the k-nearest-neighbors backend.

topo.tpgraph.fuzzy_simplicial_set(X, n_neighbors=15, metric='cosine', backend='nmslib', n_jobs=1, set_op_mix_ratio=1.0, local_connectivity=1.0, apply_set_operations=True, return_dists=False, verbose=False, **kwargs)

Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.

Originally implemented by Leland McInnes at https://github.com/lmcinnes/umap under the BSD 3-Clause License.

Parameters:
  • X (array of shape (n_samples, n_features).) – The data to be modelled as a fuzzy simplicial set.

  • n_neighbors (int.) – The number of neighbors to use to approximate geodesic distance. Larger numbers induce more global estimates of the manifold that can miss finer detail, while smaller values will focus on fine manifold structure to the detriment of the larger picture.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood search. Options are ‘nmslib’, ‘hnswlib’, ‘pynndescent’,’annoy’, ‘faiss’ and ‘sklearn’.

  • metric (str (optional, default 'cosine').) – Accepted metrics. Defaults to ‘cosine’. Accepted metrics include: -‘sqeuclidean’ -‘euclidean’ -‘l1’ -‘lp’ - requires setting the parameter p - equivalent to minkowski distance -‘cosine’ -‘angular’ -‘negdotprod’ -‘levenshtein’ -‘hamming’ -‘jaccard’ -‘jansen-shan’

  • n_jobs (int (optional, default 1).) – Number of threads to be used in computation of nearest neighbors. Set to -1 to use all available CPUs.

  • knn_indices (array of shape (n_samples, n_neighbors) (optional).) – If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the indices of the k-nearest neighbors as a row for each data point. Ignored if metric is ‘precomputed’.

  • knn_dists (array of shape (n_samples, n_neighbors) (optional).) – If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the distances of the k-nearest neighbors as a row for each data point. Ignored if metric is ‘precomputed’.

  • set_op_mix_ratio (float (optional, default 1.0).) – Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

  • local_connectivity (int (optional, default 1)) – The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

  • verbose (bool (optional, default False)) – Whether to report information on the current progress of the algorithm.

  • return_dists (bool or None (optional, default none)) – Whether to return the pairwise distance associated with each edge.

  • **kwargs (dict (optional, default {}).) – Additional parameters to be passed to the backend approximate nearest-neighbors library. Use only parameters known to the desired backend library.

Returns:

  • fuzzy_ss (coo_matrix) – A fuzzy simplicial set represented as a sparse matrix. The (i, j) entry of the matrix represents the membership strength of the 1-simplex between the ith and jth sample points.

  • sigmas (array of shape (n_samples,)) – The normalization factor derived from the metric tensor approximation. Equal to the distance

  • rhos (array of shape (n_samples,)) – The distance to the 1st nearest neighbor for each point.

class topo.tpgraph.GeneralizedProcrustes(ref=None, tol=1e-07, n_iter=200, check_finite=True)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Generalized Procrustes Analysis in a scikit-learn flavor. Automatically tries to align the provided matrices by finding transformations that make them as similar as possible to each other. This is from the Procrustes [package](https://github.com/theochem/procrustes), available under the GPL v3.

Parameters:
  • ref (ndarray, optional) – The reference array to initialize the first iteration. If None, the first array in array_list will be used.

  • tol (float, optional) – Tolerance value to stop the iterations.

  • n_iter (int, optional) – Number of total iterations.

  • check_finite (bool, optional) – If true, convert the input to an array, checking for NaNs or Infs.

fit(array_list)

Fit the model with the given array_list. :param array_list: The list of 2D-array which is going to be transformed. :type array_list: list

Returns:

self (object) – Returns the instance itself.

transform(array_list=None)

Returns a tuple of the aligned concatenated array and the error (distance) for the matching. Here only for scikit-learn consistency.

Parameters:

array_list (List) – The list of 2D-array which is going to be transformed.

Returns:

  • array_aligned (List) – A list of transformed arrays with generalized Procrustes analysis.

  • new_distance_gpa (float) – The distance for matching all the transformed arrays with generalized Procrustes analysis.

fit_transform(array_list)

Fit the model with the given array_list and returns a tuple of the aligned concatenated array and the error (distance) for the matching.

Returns:

  • array_aligned (List) – A list of transformed arrays with generalized Procrustes analysis.

  • new_distance_gpa (float) – The distance for matching all the transformed arrays with generalized Procrustes analysis.

topo.tpgraph.fit_transform_procrustes(x, fit_transform_call, procrustes_batch_size=5000, procrustes_lm=1000)

Fit model and transform data for larger datasets. This is from GRAE (https://github.com/KevinMoonLab/GRAE). If dataset has more than self.proc_threshold samples, then compute the eigendecomposition or projection over mini-batches. In each batch, add self.procrustes_lm samples (which are the same for all batches), which can be used to compute a procrustes transform to roughly align all batches in a coherent manner.

Parameters:
  • x (np.array) – Data to be transformed

  • fit_transform_call (function) – Function to be called to fit and transform the data (scikit-learn style estimator).

  • procrustes_batch_size (int) – Number of samples in each batch of procrustes

  • procrustes_lm (int) – Number of anchor points present in all batches. Used as a reference for the procrustes transform.

  • Returns

  • --------

  • x_transformed (np.array) – Embedding of x, which is the union of all batches aligned with procrustes.

class topo.tpgraph.IntrinsicDim(methods=['fsa', 'mle'], k=[10, 20, 50, 75, 100], backend='hnswlib', metric='euclidean', n_jobs=-1, plot=True, random_state=None, **kwargs)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data. This class iterates over a range of possible values of k-nearest-neighbors to consider in calculations using two different methods: the Farahmand-Szepesvári-Audibert (FSA) dimension estimator and the Maximum Likelihood Estimator (MLE).

Parameters:
  • methods (list of str, (default ['fsa'])) – The dimensionality estimation methods to use. Current options are ‘fsa’ () and ‘mle’().

  • k (int, range or list of ints, (default [10, 20, 50, 75, 100])) – The number of nearest neighbors to use for the dimensionality estimation methods. If a single value of k is provided, then the result dictionary will have keys corresponding to the methods, and values corresponding to the dimensionality estimates. If multiple values of k are provided, then the result dictionary will have keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • metric (str (default 'euclidean')) – The metric to use when calculating distance between instances in a feature array.

  • backend (str (optional, default 'nmslib').) – Which backend to use for k-nearest-neighbor computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is *highly* recommended whenever possible.

  • plot (bool (optional, default True).) – Whether to plot the results when using the fit() method.

  • random_state (int or numpy.random.RandomState() (optional, default None).) – A pseudo random number generator. Used for generating colors for plotting.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the backend kNN estimator.

Properties

local_id, global_id : dictionaries containing local and global dimensionality estimates, respectivelly.

Their structure depends on the value of the k parameter:

  • If a single value of k is provided, then the dictionaries will have

keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • If multiple values of k are provided, then the dictionaries will have

keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

__repr__()

Return repr(self).

_parse_random_state()
_compute_id(X)
plot_id(bins=30, figsize=(6, 8), titlesize=22, labelsize=16, legendsize=10)
fit(X, **kwargs)

Estimates the intrinsic dimensionalities of the data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix of k-nearest-neighbors, with its k higher or equal the highest value of k used to estimate the intrinsic dimensionalities.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the plotting function IntrinsicDim.plot_id().

Returns:

  • Populates the local and global properties of the class.

  • Shows a plot of the results if plot=True.

transform(X=None)

Does nothing. Here for compability with scikit-learn only.