topo.tpgraph.intrinsic_dim

Module Contents

Classes

IntrinsicDim

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data.

Functions

_get_dist_to_k_nearest_neighbor(K[, n_neighbors])

_get_dist_to_median_nearest_neighbor(K[, n_neighbors])

fsa_local(K[, n_neighbors])

Measure local dimensionality using the Farahmand-Szepesvári-Audibert (FSA) dimension estimator

fsa_global(K[, id_local])

mle_local(K[, n_neighbors, k1])

Maximum likelihood estimator af intrinsic dimension (Levina-Bickel)

mle_global(K[, id_local, n_neighbors, k1])

local_eigengap_experimental(X[, max_n_components, ...])

class topo.tpgraph.intrinsic_dim.IntrinsicDim(methods=['fsa', 'mle'], k=[10, 20, 50, 75, 100], backend='hnswlib', metric='euclidean', n_jobs=-1, plot=True, random_state=None, **kwargs)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Scikit-learn flavored class for estimating the intrinsic dimensionalities of high-dimensional data. This class iterates over a range of possible values of k-nearest-neighbors to consider in calculations using two different methods: the Farahmand-Szepesvári-Audibert (FSA) dimension estimator and the Maximum Likelihood Estimator (MLE).

Parameters:
  • methods (list of str, (default ['fsa'])) – The dimensionality estimation methods to use. Current options are ‘fsa’ () and ‘mle’().

  • k (int, range or list of ints, (default [10, 20, 50, 75, 100])) – The number of nearest neighbors to use for the dimensionality estimation methods. If a single value of k is provided, then the result dictionary will have keys corresponding to the methods, and values corresponding to the dimensionality estimates. If multiple values of k are provided, then the result dictionary will have keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • metric (str (default 'euclidean')) – The metric to use when calculating distance between instances in a feature array.

  • backend (str (optional, default 'nmslib').) – Which backend to use for k-nearest-neighbor computations. Defaults to ‘nmslib’. Options are ‘nmslib’, ‘hnswlib’, ‘faiss’, ‘annoy’ and ‘sklearn’.

  • n_jobs (int (optional, default 1).) – The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is *highly* recommended whenever possible.

  • plot (bool (optional, default True).) – Whether to plot the results when using the fit() method.

  • random_state (int or numpy.random.RandomState() (optional, default None).) – A pseudo random number generator. Used for generating colors for plotting.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the backend kNN estimator.

Properties

local_id, global_id : dictionaries containing local and global dimensionality estimates, respectivelly.

Their structure depends on the value of the k parameter:

  • If a single value of k is provided, then the dictionaries will have

keys corresponding to the methods, and values corresponding to the dimensionality estimates.

  • If multiple values of k are provided, then the dictionaries will have

keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

__repr__()

Return repr(self).

_parse_random_state()
_compute_id(X)
plot_id(bins=30, figsize=(6, 8), titlesize=22, labelsize=16, legendsize=10)
fit(X, **kwargs)

Estimates the intrinsic dimensionalities of the data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix of k-nearest-neighbors, with its k higher or equal the highest value of k used to estimate the intrinsic dimensionalities.

  • **kwargs (keyword arguments) – Additional keyword arguments to pass to the plotting function IntrinsicDim.plot_id().

Returns:

  • Populates the local and global properties of the class.

  • Shows a plot of the results if plot=True.

transform(X=None)

Does nothing. Here for compability with scikit-learn only.

topo.tpgraph.intrinsic_dim._get_dist_to_k_nearest_neighbor(K, n_neighbors=10)
topo.tpgraph.intrinsic_dim._get_dist_to_median_nearest_neighbor(K, n_neighbors=10)
topo.tpgraph.intrinsic_dim.fsa_local(K, n_neighbors=10)

Measure local dimensionality using the Farahmand-Szepesvári-Audibert (FSA) dimension estimator

Parameters:
  • K (sparse matrix) – Sparse matrix of distances between points

  • n_neighbors (int) – Number of neighbors to consider for the kNN graph. Note this is actually half the number of neighbors used in the FSA estimator, for efficiency.

Returns:

local_dim (array) – Local dimensionality estimate for each point

topo.tpgraph.intrinsic_dim.fsa_global(K, id_local=None, **kwargs)
topo.tpgraph.intrinsic_dim.mle_local(K, n_neighbors=10, k1=1)

Maximum likelihood estimator af intrinsic dimension (Levina-Bickel)

topo.tpgraph.intrinsic_dim.mle_global(K, id_local=None, n_neighbors=15, k1=1)
topo.tpgraph.intrinsic_dim.local_eigengap_experimental(X, max_n_components=30, n_neighbors=30, metric='cosine', verbose=False, **kwargs)