topo.tpgraph.intrinsic_dim

Classes

IntrinsicDim

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data.

Functions

_get_dist_to_k_nearest_neighbor(K[, n_neighbors])

_get_dist_to_median_nearest_neighbor(K[, n_neighbors])

fsa_local(K[, n_neighbors])

Measure local dimensionality using the Farahmand-Szepesvári-Audibert (FSA) dimension estimator

fsa_global(K[, id_local])

mle_local(K[, n_neighbors, k1])

Maximum likelihood estimator af intrinsic dimension (Levina-Bickel)

mle_global(K[, id_local, n_neighbors, k1])

automated_scaffold_sizing(X[, method, ks, backend, ...])

Unified automated scaffold sizing.

Module Contents

class topo.tpgraph.intrinsic_dim.IntrinsicDim(methods=['fsa', 'mle'], k=[10, 20, 50, 75, 100], backend='hnswlib', metric='euclidean', n_jobs=-1, plot=True, random_state=None, **kwargs)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data. This class iterates over a range of possible values of k-nearest-neighbors to consider in calculations using two different methods: the Farahmand-Szepesvári-Audibert (FSA) dimension estimator and the Maximum Likelihood Estimator (MLE).

Parameters:
  • methods (list of str, (default ['fsa'])) – The dimensionality estimation methods to use. Current options are ‘fsa’ () and ‘mle’().

  • k (int, range or list of ints, (default [10, 20, 50, 75, 100])) – The number of nearest neighbors to use for the dimensionality estimation methods. If a single value of k is provided, then the result dictionary will have keys corresponding to the methods, and values corresponding to the dimensionality estimates. If multiple values of k are provided, then the result dictionary will have keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • metric (str (default 'euclidean')) – The metric to use when calculating distance between instances in a feature array.

  • backend (str (optional, default 'nmslib').) – Which backend to use for k-nearest-neighbor computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is *highly* recommended whenever possible.

  • plot (bool (optional, default True).) – Whether to plot the results when using the fit() method.

  • random_state (int or numpy.random.RandomState() (optional, default None).) – A pseudo random number generator. Used for generating colors for plotting.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the backend kNN estimator.

Properties

local_id, global_id : dictionaries containing local and global dimensionality estimates, respectivelly.

Their structure depends on the value of the k parameter:

  • If a single value of k is provided, then the dictionaries will have

keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • If multiple values of k are provided, then the dictionaries will have

keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

methods = ['fsa', 'mle']
use_k = [10, 20, 50, 75, 100]
n_k
backend = 'hnswlib'
metric = 'euclidean'
n_jobs = -1
plot = True
kwargs
random_state = None
local_id
global_id
__repr__()
_parse_random_state()
_compute_id(X)
plot_id(bins=30, figsize=(6, 8), titlesize=22, labelsize=16, legendsize=10)
fit(X, **kwargs)

Estimates the intrinsic dimensionalities of the data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix of k-nearest-neighbors, with its k higher or equal the highest value of k used to estimate the intrinsic dimensionalities.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the plotting function IntrinsicDim.plot_id().

Returns:

  • Populates the local and global properties of the class.

  • Shows a plot of the results if plot=True.

transform(X=None)

Does nothing. Here for compability with scikit-learn only.

topo.tpgraph.intrinsic_dim._get_dist_to_k_nearest_neighbor(K, n_neighbors=10)
topo.tpgraph.intrinsic_dim._get_dist_to_median_nearest_neighbor(K, n_neighbors=10)
topo.tpgraph.intrinsic_dim.fsa_local(K, n_neighbors=10)

Measure local dimensionality using the Farahmand-Szepesvári-Audibert (FSA) dimension estimator

Parameters:
  • K (sparse matrix) – Sparse matrix of distances between points

  • n_neighbors (int) – Number of neighbors to consider for the kNN graph. Note this is actually half the number of neighbors used in the FSA estimator, for efficiency.

Returns:

local_dim (array) – Local dimensionality estimate for each point

topo.tpgraph.intrinsic_dim.fsa_global(K, id_local=None, **kwargs)
topo.tpgraph.intrinsic_dim.mle_local(K, n_neighbors=10, k1=1)

Maximum likelihood estimator af intrinsic dimension (Levina-Bickel)

topo.tpgraph.intrinsic_dim.mle_global(K, id_local=None, n_neighbors=15, k1=1)
topo.tpgraph.intrinsic_dim.automated_scaffold_sizing(X, method: str = 'fsa', ks=(15, 30, 60), backend='hnswlib', metric='euclidean', n_jobs: int = -1, quantile: float = 0.99, min_components: int = 16, max_components: int = 512, headroom: float = 0.15, random_state=None, use_median: bool = False, return_details: bool = False, **knn_kwargs)

Unified automated scaffold sizing.

method=’fsa’:
  • Compute local FSA i.d. for each k in ks, take per-cell median across ks, then take the upper quantile across cells, add headroom, clamp to bounds.

method=’mle’:
  • Use a single neighborhood size k (if ks is an int; if iterable, use max(ks)).

  • Compute local MLE i.d. at k, then global i.d. via:
    • median of locals if use_median=True

    • Levina–Bickel harmonic-mean estimator (mle_global) otherwise.

  • Add headroom, clamp to bounds.

Returns:

  • n_components (int)

  • details (dict (if return_details=True))