topo

Subpackages

Submodules

Package Contents

Classes

TopOGraph

Main TopOMetry class for learning topological similarities, bases, graphs, and layouts from high-dimensional data.

Functions

read_pkl([wd, filename])

Attributes

_HAVE_SCANPY

__version__

topo._HAVE_SCANPY = True
class topo.TopOGraph(base_knn=30, graph_knn=30, n_eigs=100, base_kernel=None, base_kernel_version='bw_adaptive', eigenmap_method='DM', laplacian_type='normalized', projection_method='MAP', graph_kernel_version='bw_adaptive', base_metric='cosine', graph_metric='euclidean', alpha=1.0, diff_t=None, semi_aniso=False, delta=1.0, sigma=0.1, n_jobs=1, low_memory=False, eigen_tol=0.0001, eigensolver='arpack', backend='hnswlib', cache=True, verbosity=1, random_state=42)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Main TopOMetry class for learning topological similarities, bases, graphs, and layouts from high-dimensional data.

From data, learns topological similarity metrics, from these build orthonormal eigenbases and from these eigenbases learns new topological graphs. Users can choose different adaptive kernels to achieve these topological representations, which can approximate the Laplace-Beltrami Operator via their Laplacian or Diffusion operators.

The eigenbasis or their topological graphs can then be visualized with multiple existing layout optimization tools.

Parameters:
  • base_knn (int (optional, default 30).) – Number of k-nearest-neighbors to use when learning topological similarities.

  • graph_knn (int (optional, default 30).) – Similar to base_knn, but used to learning topological graphs from the orthogonal bases.

  • n_eigs (int (optional, default 100).) – Number of eigenpairs to compute.

  • base_kernel (topo.tpgraph.Kernel (optional, default None).) – An optional Kernel object already fitted with the data. If available, the original input data X is not required.

  • eigenmap_method (str (optional, default 'DM').) –

    Which eigenmap method to use. Defaults to ‘DM’, which is the diffusion maps method and will use the diffusion operator learned from the used kernel. Options include: * ‘msDM’ - uses the diffusion operator learned from the used kernel (TopOGraph.base_kernel.P) using the multiscale version of Diffusion Maps version,

    which accounts for all possible diffusion timescales. Finds top eigenpairs (with highest eigenvalues).

    • ’DM’ - uses the diffusion operator learned from the used kernel (TopOGraph.base_kernel.P). Finds top eigenpairs (with highest eigenvalues).

    • ’LE’ - uses the graph Laplacian learned from the used kernel (TopOGraph.kernel.L). Finds bottom eigenpairs (with lowest eigenvalues).

    • ’top’ - uses the kernel matrix (TopOGraph.kernel.K) as the affinity matrix. Finds top eigenpairs (with highest eigenvalues).

    • ’bottom’ - uses the kernel matrix (TopOGraph.kernel.K) as the affinity matrix. Finds bottom eigenpairs (with lowest eigenvalues).

  • alpha (int or float (optional, default 1).) – Anisotropy. Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which unbiases results from data underlying sampling distribution.

  • laplacian_type (str (optional, default 'random_walk').) – Which Laplacian to use by default with the kernel. Defaults to ‘random_walk’, but ‘normalized’ is the most common. Options include ‘unnormalized’ (also referred to as the combinatorial Laplacian), ‘normalized’, ‘random_walk’ and ‘geometric’.

  • base_kernel_version (str (optional, default 'bw_adaptive')) – Which method to use for learning affinities to use in the eigenmap strategy. Defaults to ‘bw_adaptive’, which employs an adaptive bandwidth. There are several other options available to learn affinities, including: * ‘fuzzy’ - uses fuzzy simplicial sets as per [UMAP](). It is grounded on solid theory. * ‘cknn’ - uses continuous k-nearest-neighbors as per [Berry et al.](). As ‘fuzzy’, it is grounded on solid theory, and is guaranteed to approaximate the Laplace-Beltrami Operator on the underlying manifold. * ‘bw_adaptive’ - uses an adaptive bandwidth without adaptive exponentiation. It is a locally-adaptive kernel similar to that proposed by [Nadler et al.](https://doi.org/10.1016/j.acha.2005.07.004) and implemented in [Setty et al.](https://doi.org/10.1038/s41587-019-0068-4). * ‘bw_adaptive_alpha_decaying’ - uses an adaptive bandwidth and an adaptive decaying exponentiation (alpha). This is the default. * ‘bw_adaptive_nbr_expansion’ - first uses a simple adaptive bandwidth and then uses it to learn an adaptive number of neighbors to expand neighborhood search to learn a second adaptive bandwidth kernel. The neighborhood expansion can impact runtime, although this is not usually expressive for datasets under 10e6 samples. The neighborhood expansion was proposed in the TopOMetry paper. * ‘bw_adaptive_alpha_decaying_nbr_expansion’ - first uses a simple adaptive bandwidth and then uses it to learn an adaptive number of neighbors to expand neighborhood search to learn a second adaptive bandwidth kernel. The neighborhood expansion was proposed in the TopOMetry paper. * ‘gaussian’ - uses a fixed Gaussian kernel with a fixed bandwidth. This is the most naive approach.

  • graph_kernel_version (str (optional, default 'bw_adaptive_alpha_decaying')) – Which method to use for learning affinities from the eigenbasis. Same as base_kernel_version, but for the graph construction after learning an eigenbasis.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood search. Options are ‘nmslib’, ‘hnswlib’, ‘pynndescent’,’annoy’, ‘faiss’ and ‘sklearn’. By default it will check what you have available, in this order.

  • base_metric (str (optional, default 'cosine')) –

    Distance metric for building an approximate kNN graph during topological basis construction. Defaults to ‘cosine’. When using scaled data (zero mean and unit variance) the cosine similarity metric is highly recommended. The ‘hamming’ and ‘jaccard’ distances are also available for string vectors.

    NOTE: not all k-nearest-neighbors backends have the same metrics available! The algorithm will expect you to input a metric compatible with your backend. Example of accepted metrics in NMSLib(*), HNSWlib(**) and sklearn(***):

    • ’sqeuclidean’ (, *)

    • ’euclidean’ (, *)

    • ’l1’ (*)

    • ’lp’ - requires setting the parameter p (*) - similar to Minkowski

    • ’cosine’ (, *)

    • ’inner_product’ (**)

    • ’angular’ (*)

    • ’negdotprod’ (*)

    • ’levenshtein’ (*)

    • ’hamming’ (*)

    • ’jaccard’ (*)

    • ’jansen-shan’ (*)

  • graph_metric (str (optional, default 'euclidean').) – Similar to base_metric, but used for building a new topological graph on top of the learned orthonormal eigenbasis.

  • sigma (float (optional, default None).) – Scaling factor if using fixed bandwidth kernel (only used if base_kernel_version or graph_kernel_version is set to gaussian).

  • delta (float (optional, default 1.0).) – A parameter of the ‘cknn’ kernel version. It is ignored if graph_kernel_version or base_kernel_version are not set to ‘cknn’. It is used to decide the local radius in the continuous kNN algorithm. The combination radius increases in proportion to this parameter.

  • low_memory (bool (optional, default False).) –

    Whether to use a low-memory implementation of the algorithm. Everything will quite much run the same way, but the TopOGraph object will not store the full kernel matrices obtained at each point of the algorithm. A few take-away points:

    • If you are running a single model, you should always keep this set this to False unless you have very little memory available.

    • This parameter is particularly useful when analysing very large datasets, and for cross-validation of algorithmic combinations

    (e.g. different base_kernel_version and graph_kernel_version) when using the TopOGraph.run_models_layouts() function. This is because the kernel matrices are stored in memory, and can be quite large. After defining the best algorithmic combination, recomputing only it costs a fraction of the time and memory.

n_jobsint (optional, default 1).

Number of threads to use in neighborhood calculations. Set this to as much as possible for speed. Setting to -1 uses all available threads.

eigensolverstring (optional, default ‘arpack’).

Method for computing the eigendecomposition. Can be either ‘arpack’, ‘lobpcg’, ‘amg’ or ‘dense’. * ‘dense’ :

use standard dense matrix operations for the eigenvalue decomposition. For this method, M must be an array or matrix type. This method should be avoided for large problems.

  • ‘arpack’ :

    use arnoldi iteration in shift-invert mode. For this method, M may be a dense matrix, sparse matrix, or general linear operator.

  • ‘lobpcg’ :

    Locally Optimal Block Preconditioned Conjugate Gradient Method. A preconditioned eigensolver for large symmetric positive definite (SPD) generalized eigenproblems.

  • ‘amg’ :

    Algebraic Multigrid solver (requires pyamg to be installed) It can be faster on very large, sparse problems, but requires setting a random seed for better reproducibility.

eigen_tolfloat (optional, default 0.0).

Error tolerance for the eigenvalue solver. If 0, machine precision is used.

diff_tint (optional, default None).

Time parameter for the diffusion operator, if eigenmap_method is ‘DM’. Also works with ‘eigenmap_method’ being ‘LE’. The diffusion operator or the graph Laplacian will be powered by t steps (a diffusion process). Ignored for other methods. If None, the number of k-nearest neighbors is used.

semi_anisobool (optional, default False).

Whether to use semi-anisotropic diffusion, if eigenmap_method is ‘DM’. This reweights the original kernel (not the renormalized kernel) by the renormalized degree.

projection_methodstr (optional, default ‘MAP’).

Which projection method to use. Only ‘Isomap’, ‘t-SNE’ and ‘MAP’ are implemented out of the box without need to import packages. ‘t-SNE’ uses scikit-learn if the user does not have multicore-tsne and ‘MAP’ relies on code that is adapted from UMAP for efficiency. Current options are:

These are frankly quite direct to add, so feel free to make a feature request if your favorite method is not listed here.

verbosityint (optional, default 1).

Controls verbosity. 0 for no verbosity, 1 for minimal (prints warnings and runtimes of major steps), 2 for medium (also prints layout optimization messages) and 3 for full (down to neighborhood search, useful for debugging).

cachebool (optional, default True).

Whether to cache kernel and eigendecomposition classes in the results dictionaries. Set to False to avoid unnecessary duplications if you’re using a single model with large data.

random_stateint or numpy.random.RandomState() (optional, default None).

A pseudo random number generator. Used in eigendecomposition when eigensolver is ‘amg’,

Variables:
  • BaseKernelDict (dict) – A dictionary of base kernel classes, with keys being the kernel version and values being the kernel class.

  • EigenbasisDict (dict) – A dictionary of eigendecomposition classes, with keys referring to the kernel version and eigenmap method and values being the eigendecomposition class.

  • GraphKernelDict (dict) – A dictionary of graph kernel classes, with keys referring to the base kernel, eigendecomposition method, and graph kernel version used, and and values being the kernel class.

  • self.SpecLayout (np.ndarray (n_samples, n_components)) – The spectral layout of the data.

  • self.ProjectionDict (dict) – A dictionary of graph projection coordinates, with keys referring to the projection method and values being the projection class.

  • ClustersDict (dict) – A dictionary of cluster labels, with keys referring to the eigenbasis used and values being the cluster labels.

  • current_basekernel (topo.tpgraph.Kernel object) – The current base kernel class.

  • current_eigenbasis (topo.eigen.EigenDecomposition object) – The current eigendecomposition class.

  • current_graphkernel (topo.tpgraph.Kernel object) – The current graph kernel class (Kernel class fitted on the eigendecomposition results).

  • global_dimensionality (float or None) – The global dimensionality of the data, if it has been estimated.

  • local_dimensionality (np.ndarray (n_samples,)) – The local pairwise dimensionality of the data, if it has been estimated.

__repr__()

Return repr(self).

_parse_backend()
_parse_random_state()
fit(X=None, **kwargs)

Learn distances, computes similarities with various kernels and applies eigendecompositions to them or their Laplacian or diffusion operators. The learned operators approximate the Laplace-Beltrami Operator and learn the topology of the underlying manifold. The eigenbasis learned from such eigendecomposition represents the data in a lower-dimensional space (with up to some hundreds dimensions), where the euclidean distances between points are approximated by the geodesic distances on the manifold. Because the eigenbasis is equivalent to a Fourier basis, one needs at least n+1 to eigenvectors to represent a dataset that can be optimically divided into n clusters.

Parameters:
  • X (High-dimensional data matrix.) – Currently, supports only data from similar type (i.e. all bool, all float). Not required if the base_kernel parameter is specified and corresponds to a fitted topo.tpgraph.Kernel() object.

  • **kwargs (Additional parameters to be passed to the topo.base.ann.kNN() function.) –

Returns:

  • TopoGraph object with a populated TopOGraph.EigenbasisDict dictionary. The keys of the dictionary are the names of the eigendecompositions

  • that were performed. The values are the corresponding topo.base.eigen.Eigenbasis() objects. The latest set eigenbasis is

  • also accessible through the TopOGraph.eigenbasis attribute.

fit_transform(X=None)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

eigenspectrum(eigenbasis_key=None, **kwargs)

Visualize the eigenspectrum decay. Corresponds to a scree plot of information entropy. Useful to indirectly estimate the intrinsic dimensionality of a dataset.

Parameters:

eigenbasis_key (str (optional, default None).) – If None, will use the default eigenbasis at TopOGraph.eigenbasis. Otherwise, uses the specified eigenbasis.

Returns:

A nice eigenspectrum decay plot (‘scree plot’).

plot_eigenspectrum(eigenbasis_key=None, **kwargs)

An anlias for TopOGraph.eigenspectrum. Visualize the eigenspectrum decay. Corresponds to a scree plot of information entropy. Useful to indirectly estimate the intrinsic dimensionality of a dataset.

Parameters:

eigenbasis_key (str (optional, default None).) – If None, will use the default eigenbasis at TopOGraph.eigenbasis. Otherwise, uses the specified eigenbasis.

Returns:

A nice eigenspectrum decay plot (‘scree plot’).

list_eigenbases()

List the eigenbases in the TopOGraph.EigenbasisDict.

Returns:

A list of eigenbasis keys in the TopOGraph.EigenbasisDict.

transform(X=None, **kwargs)

Learns new affinity, topological operators from chosen eigenbasis. This does not strictly follow scikit-learn’s API, no new data can be transformed with this method. Instead, this method is used to learn new topological operators on top of the learned eigenbasis, analagous to spectral clustering methods.

Parameters:

X (array-like, shape (n_samples, n_features), optional) – New data to transform. This is not used in this method, but is included for compatibility with scikit-learn’s API.

Returns:

scipy.sparse.csr.csr_matrix, containing the similarity matrix that encodes the topological graph.

spectral_layout(graph=None, n_components=2)

Performs a multicomponent spectral layout of the data and the target similarity matrix.

Parameters:
  • graph (str indicating a TopOGraph.EigenbasisDict key or or array-like, optional.) – Graph kernel to use for the spectral layout. Defaults to the active graph kernel (TopOGraph.graph_kernel).

  • n_components (int (optional, default 2).) – Number of dimensions to embed into.

Returns:

np.ndarray containing the resulting embedding.

project(n_components=2, init=None, projection_method=None, landmarks=None, landmark_method='kmeans', n_neighbors=None, num_iters=500, **kwargs)

Projects the data into a lower dimensional space using the specified projection method. Calls topo.layout.Projector().

Parameters:
  • n_components (int (optional, default 2).) – Number of dimensions to optimize the layout to. Usually 2 or 3 if you’re into visualizing data.

  • init (str or np.ndarray (optional, default None).) – If passed as str, will use the specified layout as initialization. If passed as np.ndarray, will use the array as initialization.

  • projection_method (str (optional, default 'MAP').) – Which projection method to use. Only ‘Isomap’, ‘t-SNE’ and ‘MAP’ are implemented out of the box. ‘t-SNE’ uses scikit-learn implementation and ‘MAP’ relies on code that is adapted from UMAP. Current options are: * ‘Isomap’ * ‘t-SNE’ * ‘MAP’ * ‘UMAP’ * ‘PaCMAP’ * ‘TriMAP’ * ‘IsomorphicMDE’ - MDE with preservation of nearest neighbors * ‘IsometricMDE’ - MDE with preservation of pairwise distances * ‘NCVis’ These are frankly quite direct to add, so feel free to make a feature request if your favorite method is not listed here.

  • landmarks (int or np.ndarray (optional, default None).) – If passed as int, will obtain the number of landmarks. If passed as np.ndarray, will use the specified indexes in the array. Any value other than None will result in only the specified landmarks being used in the layout optimization, and will populate the Projector.landmarks_ slot.

  • landmark_method (str (optional, default 'kmeans').) – The method to use for selecting landmarks. If landmarks is passed as an int, this will be used to select the landmarks. Can be either ‘kmeans’ or ‘random’.

  • num_iters (int (optional, default 1000).) – Most (if not all) methods optimize the layout up to a limit number of iterations. Use this parameter to set this number.

  • **kwargs (dict (optional).) – Additional keyword arguments to pass to the projection method during fit.

Returns:

np.ndarray containing the resulting embedding. Also stores it in the TopOGraph.ProjectionDict slot.

run_models(X, kernels=['fuzzy', 'cknn', 'bw_adaptive'], eigenmap_methods=['DM', 'LE', 'top'], projections=['Isomap', 'MAP'])

Power function that runs all models in TopOMetry. It iterates through the specified kernel versions, eigenmap methods and projection methods. As expected, it can take a while to run, depending on how many kernels and methods you specify. At least one kernel, one eigenmap method and one projection must be specified.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The input data. Use a sparse matrix for efficiency.

  • kernels (list of str (optional, default ['fuzzy', 'cknn', 'bw_adaptive']).) – List of kernel versions to run. These will be used to learn an eigenbasis and to learn a new graph kernel from it. Options are: * ‘fuzzy’ * ‘cknn’ * ‘bw_adaptive’ * ‘bw_adaptive_alpha_decaying’ * ‘bw_adaptive_nbr_expansion’ * ‘bw_adaptive_alpha_decaying_nbr_expansion’ * ‘gaussian’ Will not run all by default to avoid long waiting times in reckless calls.

  • eigenmap_methods (list of str (optional, default ['DM', 'LE', 'top']).) – List of eigenmap methods to run. Options are: * ‘DM’ * ‘LE’ * ‘top’ * ‘bottom’

  • projections (list of str (optional, default ['Isomap', 'MAP']).) – List of projection methods to run. Options are the same of the topo.layouts.Projector() object: * [‘(L)Isomap’]() - one of the first manifold learning methods * [‘t-SNE’](https://github.com/DmitryUlyanov/Multicore-TSNE) - a classic manifold learning method * ‘MAP’- a lighter [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html) with looser assumptions * [‘UMAP’](https://umap-learn.readthedocs.io/en/latest/index.html) * [‘PaCMAP’](http://jmlr.org/papers/v22/20-1061.html) (Pairwise-controlled Manifold Approximation and Projection) - for balanced visualizations * [‘TriMAP’](https://github.com/eamid/trimap) - dimensionality reduction using triplets * ‘IsomorphicMDE’ - [MDE](https://github.com/cvxgrp/pymde) with preservation of nearest neighbors * ‘IsometricMDE’ - [MDE](https://github.com/cvxgrp/pymde) with preservation of pairwise distances * [‘NCVis’](https://github.com/stat-ml/ncvis) (Noise Contrastive Visualization) - a UMAP-like method with blazing fast performance

Returns:

A TopOGraph object with results stored at different slots. –

  • Base kernel results are stored at TopOGraph.BaseKernelDict.

  • Eigenbases results are stored at TopOGraph.EigenbasisDict.

  • Graph kernel results learned from eigenbases are stored at TopOGraph.GraphKernelDict.

  • Projection results are stored at TopOGraph.ProjectionDict.

write_pkl(filename='topograph.pkl', remove_base_class=True)
_compute_kernel_from_version_knn(knn, n_neighbors, kernel_version, results_dict, prefix='', suffix='', low_memory=False, base=True, data_for_expansion=None)
eval_models_layouts(X, landmarks=None, kernels=['cknn', 'bw_adaptive'], eigenmap_methods=['msDM', 'DM', 'LE'], projections=['MAP'], additional_eigenbases=None, additional_projections=None, landmark_method='random', n_neighbors=5, n_jobs=-1, cor_method='spearman', **kwargs)

Evaluates all orthogonal bases, topological graphs and layouts in the TopOGraph object. Compares results with PCA and PCA-derived layouts (i.e. t-SNE, UMAP etc).

Parameters:
  • X (data matrix. Expects either numpy.ndarray or scipy.sparse.csr_matrix.) –

  • landmarks (optional (int, default None).) – If specified, subsamples the TopOGraph object and/or data matrix X to a number of landmark samples before computing results and scores. Useful if dealing with large datasets (>30,000 samples).

  • kernels (list of str (optional, default ['fuzzy', 'cknn', 'bw_adaptive_alpha_decaying']).) – List of kernel versions to run and evaluate. These will be used to learn an eigenbasis and to learn a new graph kernel from it. Options are: * ‘fuzzy’ * ‘cknn’ * ‘bw_adaptive’ * ‘bw_adaptive_alpha_decaying’ * ‘bw_adaptive_nbr_expansion’ * ‘bw_adaptive_alpha_decaying_nbr_expansion’ * ‘gaussian’ Will not run all by default to avoid long waiting times in reckless calls.

  • eigenmap_methods (list of str (optional, default ['msDM', 'DM', 'LE']).) – List of eigenmap methods to run and evaluate. Options are: * ‘msDM’ - multiscale diffusion maps * ‘DM’ - diffusion maps * ‘LE’ - Laplacian eigenmaps * ‘top’ - top eingenfunctions (with largest eigenvalues) * ‘bottom’ - bottom eigenfunctions (with smallest eigenvalues)

  • projections (list of str (optional, default ['Isomap', 'MAP']).) – List of projection methods to run and evaluate. Options are the same of the topo.layouts.Projector() object: * ‘(L)Isomap’ * ‘t-SNE’ * ‘MAP’ * ‘UMAP’ * ‘PaCMAP’ * ‘TriMAP’ * ‘IsomorphicMDE’ - MDE with preservation of nearest neighbors * ‘IsometricMDE’ - MDE with preservation of pairwise distances * ‘NCVis’

  • additional_eigenbases (dict (optional, default None).) – Dictionary containing named additional eigenbases (e.g. factor analysis, VAEs, ICA, etc) to be evaluated.

  • additional_projections (dict (optional, default None).) – Dictionary containing named additional projections (e.g. t-SNE, UMAP, etc) to be evaluated.

  • n_neighbors (int (optional, default 5).) – Number of nearest neighbors to use for the kNN graph.

  • n_jobs (int (optional, default -1).) – Number of jobs to use for parallelization. If -1, uses all available cores.

  • cor_method (str (optional, default 'spearman').) – Correlation method to use for local scores. Options are ‘spearman’ and ‘kendall’.

  • landmark_method (str (optional, default 'random').) – Method to use for landmark selection. Options are ‘random’ and ‘kmeans’.

  • kwargs (dict (optional, default {}).) – Additional keyword arguments to pass to the topo.base.ann.kNN() function.

Returns:

Populates the TopOGraph object and returns a dictionary of dictionaries with the results

topo.read_pkl(wd=None, filename='topograph.pkl')
topo.__version__ = '0.2.0.9.1'