topo.tpgraph.kernels

Module Contents

Classes

Kernel

Scikit-learn flavored class for computing a kernel matrix from a set of points. Includes functions

Functions

_adap_bw(K, n_neighbors)

compute_kernel(X[, metric, n_neighbors, fuzzy, cknn, ...])

Compute a kernel matrix from a set of points.

topo.tpgraph.kernels._adap_bw(K, n_neighbors)
topo.tpgraph.kernels.compute_kernel(X, metric='cosine', n_neighbors=10, fuzzy=False, cknn=False, delta=1.0, pairwise=False, sigma=None, adaptive_bw=True, expand_nbr_search=False, alpha_decaying=False, return_densities=False, symmetrize=True, backend='nmslib', n_jobs=-1, verbose=False, **kwargs)

Compute a kernel matrix from a set of points.

Parameters:
  • X (array-like, shape (n_samples, n_features).) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix.

  • metric (string (optional, default 'cosine').) – The metric to use when computing the kernel matrix. Possible values are: ‘cosine’, ‘euclidean’ and others. Accepts precomputed distances.

  • n_neighbors (int (optional, default 10).) – The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

  • fuzzy (bool (optional, default False).) – Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

  • cknn (bool (optional, default False).) – Whether to build the adjacency and affinity matrices using continuous k-nearest-neighbors. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored.

  • delta (float (optional, default 1.0).) – The scaling factor for the CkNN kernel. Ignored if cknn set to False.

  • pairwise (bool (optional, default False).) – Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

  • sigma (float (optional, default None).) – Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

  • adaptive_bw (bool (optional, default True).) – Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

  • expand_nbr_search (bool (optional, default False).) – Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

  • alpha_decaying (bool (optional, default False).) – Whether to use an adaptively decaying kernel.

  • return_densities (bool (optional, default False).) – Whether to return the bandwidth metrics as a dictinary. If set to True, the function returns a tuple containing the kernel matrix and a dictionary containing the bandwidth metric.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the kernel matrix after normalizations.

  • backend (str (optional, default 'nmslib').) – Which backend to use for neighborhood computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If set to -1, all available cores are used.

  • verbose (bool (optional, default False).) – Whether to print progress messages.

  • **kwargs (dict, optional) – Additional arguments to be passed to the nearest-neighbors backend.

Returns:

  • K (array-like, shape (n_samples, n_samples)) – The kernel matrix.

  • densities (dict, optional (if return_densities is set to True)) – If fuzzy and cknn are False, is a dictionary containing the bandwidth metrics. If fuzzy is set to True, the dictionary contains sigma and rho estimates. If cknn is set to True, the dictionary contains the bandwidth metric.

class topo.tpgraph.kernels.Kernel(metric='cosine', n_neighbors=10, fuzzy=False, cknn=False, pairwise=False, sigma=None, adaptive_bw=True, expand_nbr_search=False, alpha_decaying=False, symmetrize=True, backend='nmslib', n_jobs=1, laplacian_type='normalized', anisotropy=1, semi_aniso=False, n_landmarks=None, cache_input=False, verbose=False, random_state=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scikit-learn flavored class for computing a kernel matrix from a set of points. Includes functions for computing the kernel matrix with a variety of methods (adaptive bandwidth, fuzzy simplicial sets, continuous k-nearest-neighbors, etc) and performing operations on the resulting graph, such as obtaining its Laplacian, sparsifying it, filtering and interpolating signals, obtaining diffusion operators, imputing missing values and computing shortest paths.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix.

  • metric (string (optional, default 'cosine')) – The metric to use when computing the kernel matrix. Possible values are: ‘cosine’, ‘euclidean’ and others, depending on the chosen nearest-neighbors backend. Accepts precomputed distances as ‘precomputed’.

  • n_neighbors (int (optional, default 10).) – The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

  • fuzzy (bool, optional (default False).) – Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

  • cknn (bool, optional (default False).) – Whether to build the adjacency and affinity matrices using continuous k-nearest-neighbors. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True, laplacian_type is automatically set to ‘unnormalized’.

  • pairwise (bool (optional, default False).) – Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

  • sigma (float (optional, default None)) – Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

  • adaptive_bw (bool (optional, default True).) – Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

  • expand_nbr_search (bool (optional, default False).) – Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

  • alpha_decaying (bool (optional, default False).) – Whether to use an adaptively decaying kernel.

  • laplacian_type (str (optional, default 'normalized').) – The type of Laplacian to compute. Possible values are: ‘normalized’, ‘unnormalized’, ‘random_walk’ and ‘geometric’.

  • anisotropy (float (optional, default 0).) – The anisotropy (alpha) parameter in the diffusion maps literature for kernel reweighting.

  • semi_aniso (bool (optional, default False).) – Whether to use semi-anisotropic diffusion. This reweights the original kernel (not the renormalized kernel) by the renormalized degree.

  • symmetrize (bool (optional, default True).) – Whether to symmetrize the kernel matrix after normalizations.

  • backend (str (optional, default 'nmslib').) – Which backend to use for k-nearest-neighbor computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is *highly* recommended whenever possible.

  • laplacian_type – The type of laplacian to use. Can be ‘unnormalized’, ‘normalized’, or ‘random_walk’.

Properties

knnscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The computed k-nearest-neighbors graph.

Ascipy.sparse.csr_matrix, shape (n_samples, n_samples)

The adjacency matrix of the kernel graph.

Kscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The kernel matrix.

dens_dictdict

Dictionary containing the density information for each point in the dataset for the computed kernel.

Lscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The laplacian matrix of the kernel graph.

Pscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The diffusion operator of the kernel graph.

SPscipy.sparse.csr_matrix, shape (n_samples, n_samples)

The shortest-path matrix.

degreenp.ndarray, shape (n_samples,)

The degree of each point in the adjacency graph.

weighted_degreenp.ndarray, shape (n_samples,)

The weighted degree of each point in the kernel graph.

property knn

Returns the k-nearest-neighbors graph.

property K

Kernel matrix.

property A

Graph adjacency matrix.

property degree

Returns the degree of the adjacency matrix.

property weighted_degree

Returns the degree of the weighted affinity matrix.

property L

Object synonym for the laplacian() function.

property P

Object synonym for the diff_op() function.

property SP

Object synonym for the shortest_paths() function.

__repr__()

Return repr(self).

_parse_backend()
fit(X, recompute=False, **kwargs)

Fits the kernel matrix to the data.

Parameters:
  • X (array-like, shape (n_samples, n_features).) – Input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset.

  • recompute (bool (optional, default False).) – Whether to recompute the kernel if it has already been computed.

  • **kwargs (dict, optional) – Additional arguments to be passed to the k-nearest-neighbor backend.

Returns:

The kernel matrix.

transform(X=None)

Returns the kernel matrix. Here for compability with scikit-learn only.

fit_transform(X, y=None)

Fits the kernel matrix to the data and returns the kernel matrix. Here for compability with scikit-learn only.

adjacency()

Graph adjacency matrix (the binary version of W).

The adjacency matrix defines which edges exist on the graph. It is represented as an N-by-N matrix of booleans. \(A_{i,j}\) is True if \(W_{i,j} > 0\).

laplacian(laplacian_type=None)

Compute the graph Laplacian, given a adjacency or affinity graph W. For a friendly reference, see this material from James Melville: https://jlmelville.github.io/smallvis/spectral.html

Parameters:

laplacian_type (str (optional, default None).) – The type of laplacian to use. Can be ‘unnormalized’, ‘normalized’, or ‘random_walk’. If not provided, will use the default laplacian type specified in the constructor.

Returns:

L (scipy.sparse.csr_matrix) – The graph Laplacian.

diff_op(anisotropy=1.0, symmetric=False)

Computes the [diffusion operator](https://doi.org/10.1016/j.acha.2006.04.006).

Parameters:
  • anisotropy (float (optional, default None).) – Anisotropy to apply. ‘Alpha’ in the diffusion maps literature. Defaults to the anisotropy defined in the constructor.

  • symmetric (bool (optional, default False).) – Whether to use a symmetric version of the diffusion operator. This is particularly useful to yield a symmetric operator when using anisotropy (alpha > 1), as the diffusion operator P would be assymetric otherwise, which can be problematic during matrix decomposition. Eigenvalues are the same of the assymetric version, and the eigenvectors of the original assymetric operator can be obtained by left multiplying by Kernel.D_inv_sqrt_.

Returns:

  • P (scipy.sparse.csr_matrix) – The graph diffusion operator.

  • Populates the Kernel.D_inv_sqrt_ slot.

shortest_paths(landmark=False, indices=None)

Compute the shortest paths between all pairs of nodes. If landmark is True, the shortest paths are computed between all pairs of landmarks, not all sample nodes.

Parameters:
  • landmark (bool (optional, default False).) – If True, the shortest paths are computed between all pairs of landmarks, not all sample nodes.

  • indices (list of int (optional, default None).) – If None, the shortest paths are computed between all pairs of nodes. Else, the shortest paths are computed between all pairs of nodes and nodes with specified indices.

Returns:

D (scipy.sparse.csr_matrix) – The shortest paths matrix.

get_indices_distances(n_neighbors=None, kernel=True)
_calculate_imputation_error(data, data_prev=None)

Calculates difference before and after imputation by diffusion. This is from [MAGIC](https://github.com/KrishnaswamyLab/MAGIC).

Parameters:
  • data (array-like) – current data matrix

  • data_prev (array-like, optional (default: None)) – previous data matrix. If None, data is simply prepared for comparison and no error is returned

Returns:

  • error (float) – Procrustes disparity value

  • data_curr (array-like) – transformed data to use for the next comparison

impute(Y=None, t=None, threshold=0.01, tmax=10)

Uses the diffusion operator computed from graph build on X to impute the input data Y. Although the idea behind this is far older, it was first reported in single-cell genomics by the Krishnaswamy lab in the MAGIC (Markov Affinity-based Graph Imputation of Cells) [manuscript](https://www.cell.com/cell/abstract/S0092-8674(18)30724-4)

Parameters:
  • Y (np.ndarray (optional, default None).) – The input data to impute. If None, the input data X is imputed (if it was cached by setting the cache_input parameter to True). Otherwise, you’ll have to specify it as Y.

  • t (int (optional, default None).) – The number of steps to perform during diffusion. The default None iterates until the Procrustes disparity value is below the threshold parameter.

  • threshold (float (optional, default 0.001).) – The threshold value for the Procrustes disparity when finding an optimal t.

  • tmax (int (optional, default 30).) – The maximum number of steps to perform during diffusion when estimating an optimal t.

Returns:

Y_imp (np.ndarray) – The imputed data.

_get_landmarks(X, n_landmarks=None)
filter(signal, replicates=None, beta=50, target=None, filterfunc=None, offset=0, order=1, solver='chebyshev', chebyshev_order=100)

This is inspired from [MELD](https://github.com/KrishnaswamyLab/MELD) to estimate sample associated density estimation. However, you can naturally use this for any signal in your data, not just samples of specific conditions. In practice, this is just a simple [PyGSP](https://pygsp.readthedocs.io/en/stable/reference/filters.html#module-pygsp.filters) filter on a graph. Indeed, it calls PyGSP, so you’ll need it installed to use this function.

Parameters:
  • signal (array-like) – Signal(s) to filter - usually sample labels.

  • replicates (array-like, optional (default None)) – Replicate labels for each sample. If None, no replicates are assumed.

  • beta (int (optional, default 50).) – Amount of smoothing to apply. Vary this parameter to get good estimates - this can vary widely from dataset to dataset.

  • target (array-like (optional, default None).) – Similarity matrix to use for filtering. If None, uses the kernel matrix.

  • filterfunc (function (optional, default None).) – Function to use for filtering. If None, the default is to use a Laplacian filter.

  • offset (float (optional, default 0)) – Amount to shift the filter in the eigenvalue spectrum. Recommended to use an eigenvalue from the graph based on the spectral distribution. Should be in interval [0,1]

  • order (int (optional, default 1).) – Falloff and smoothness of the filter. High order leads to square-like filters.

  • solver (string (optional, default 'chebyshev').) – Method to solve convex problem. If ‘chebyshev’, uses a chebyshev polynomial approximation of the corresponding filter. Else, if ‘exact’, uses the eigenvalue solution to the problem

  • chebyshev_order (int (optional, default 100).) – Order of chebyshev approximation to use.

_replicate_normalize_densities(replicates)
is_connected()

Check if the graph is connected (cached). A graph is connected if and only if there exists a (directed) path between any two vertices.

Returns:

connected (bool) – True if the graph is connected, False otherwise.

_connected_components(target=None)

Finds the connected components of the kernel matrix by default. Other matrices can be specified for use with the target parameter.

Parameters:

target (array-like (optional, default None).) – The target matrix to find the connected components of. If None, uses the kernel matrix.

Returns:

  • components (list of np.ndarray) – The connected components of the target matrix.

  • labels (list of int) – The labels of the connected components.

resistance_distance()

Uses the cached Laplacian matrix to compute the resistance distances. See Klein and Randic [manuscript](https://doi.org/10.1007%2FBF01164627) for details.

Returns:

rd (sparse matrix) – Resistance distance matrix

sparsify(epsilon=0.1, maxiter=10, random_state=None)

Sparsify a graph (with the Spielman-Srivastava method). This originally only called PyGSP but now also has some adaptations.

Parameters:
  • epsilon (float (optional, default 0.1).) – Sparsification parameter, which must be between 1/sqrt(N) and 1.

  • maxiter (int (optional, default 10).) – Maximum number of iterations.

  • random_state ({None, int, RandomState, Generator} (optional, default None)) – Seed for the random number generator (for reproducible sparsification).

Returns:

sparse_graph (sparse matrix) – Sparsified graph affinity matrix.

interpolate(f_subsampled, keep_inds, target=None, order=100, reg_eps=0.005)

Interpolate a graph signal.

Parameters:
  • f_subsampled (ndarray) – A graph signal on the graph G.

  • keep_inds (ndarray) – List of indices on which the signal is sampled.

  • target (array-like (optional, default None).) – Similarity matrix to use for interpolation. If None, uses the kernel matrix.

  • order (int) – Degree of the Chebyshev approximation (default = 100).

  • reg_eps (float) – The regularized graph Laplacian is $ar{L}=L+epsilon I$. A smaller epsilon may lead to better regularization, but will also require a higher order Chebyshev approximation.

Returns:

signal_interpolated (ndarray) – Interpolated graph signal on the full vertex set of G.

write_pkl(wd=None, filename='mykernel.pkl')