sica.base.StabilizedICA

class sica.base.StabilizedICA(n_components, n_runs, resampling=None, algorithm='fastica_par', fun='logcosh', whiten=True, max_iter=2000, plot=False, normalize=True, reorientation=True, pca_solver='auto', chunked=False, chunk_size=None, zero_center=True, n_jobs=1, verbose=0)[source]

Implement a stabilized version of the Independent Component Analysis algorithm. It fits the matrix factorization model X = AS, where A is the unmixing matrix (n_mixtures, n_sources), S is the source matrix (n_sources, n_observations) and X is the observed mixed data (n_mixtures, n_observations).

Parameters:
n_componentsint

Number of ICA components/sources.

n_runsint

Number of times we run the FastICA algorithm

resamplingstr {None , ‘bootstrap’ , ‘fast_bootstrap’}, optional

Method for resampling the data before each run of the ICA solver.

  • If None, no resampling is applied.

  • If ‘bootstrap’ the classical bootstrap method is applied to the original data matrix, the resampled matrix is

whitened (using the whitening hyperparameters set for the fit method) and the ICA components are extracted.

  • If ‘fast_boostrap’ a fast bootstrap algorithm is applied to the original data matrix and the whitening

operation is performed simultaneously with SVD decomposition and then the ICA components are extracted (see References).

Resampling could lead to quite heavy computations (whitening at each iteration), depending on the size of the input data. It should be considered with care. The default is None.

algorithmstr {‘fastica_par’ , ‘fastica_def’ , ‘picard_fastica’ , ‘picard’ , ‘picard_ext’ , ‘picard_orth’}, optional.

The algorithm applied for solving the ICA problem at each run. Please see the supplementary explanations for more details. The default is ‘fastica_par’, i.e. FastICA from sklearn with parallel implementation.

funstr {‘cube’ , ‘exp’ , ‘logcosh’ , ‘tanh’} or function, optional.

If algorithm is in {‘fastica_par’ , ‘fastica_def’}, it represents the functional form of the G function used in the approximation to neg-entropy. Could be either ‘logcosh’, ‘exp’, or ‘cube’.

If algorithm is in {‘picard_fastica’ , ‘picard’ , ‘picard_ext’ , ‘picard_orth’}, it is associated with the choice of a density model for the sources. See supplementary explanations for more details.

The default is ‘logcosh’.

whitenboolean, optional

If True the matrix X is whitened, i.e. centered then projected in the space defined by its first n_components PCA components and reduced to unit variance along each of these axes.

If False the input X matrix must be already whitened (the rows must be centered, scaled to unit variance and uncorrelated.)

The default is True.

max_iterint

Maximum number of iteration for the FastICA algorithm.

plotboolean, optional

If True plot the stability indexes for each cluster in decreasing order. The default is False.

normalizeboolean, optional

If True normalize the rows of S_ (i.e. the stabilized ICA components) to unit standard deviation. The default is True.

reorientationboolean,optional

If True re-oriente the rows of S_ towards positive heavy tail. The default is True.

pca_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’ , ‘lobpcg’}, optional

Solver for the different PCA methods. Please note that some solvers may not be compatible with some PCA methods. See _whitening.py for more details. The default is “auto”.

chunkedboolean, optional

Parameter for the whitening step, see _whitening.py for more details. The default is False.

chunk_sizeint, optional

Parameter for the whitening step, see _whitening.py for more details. The default is None.

zero_centerboolean, optional

Parameter for the whitening step, see _whitening.py for more details. The default is True.

n_jobsint, optional

Number of jobs to run in parallel. -1 means using all processors. See the joblib package documentation for more explanations. The default is 1.

verbose: int, optional

Control the verbosity: the higher, the more messages. The default is 0.

Attributes:
S_: 2D array, shape (n_components , n_observations)

Array of sources/metagenes, each line corresponds to a stabilized ICA component (i.e. the centrotype of a cluster of components).

stability_indexes_1D array, shape (n_components)

Stability indexes for the stabilized ICA components.

mean_1D array, shape (n_mixtures)

Notes

Here n_components corresponds to the number of ICA sources, n_mixtures corresponds to the number of linear mixtures (i.e. linear mixtures of ICA sources) that we observe, and n_observations corresponds to the number of observations collected for these mixtures. Each time, the user needs to carefully determine which dimension in his data set should correspond to the linear mixtures of ICA sources and which dimension should correspond to the observations. The user should keep in mind that, at the end, he will obtain n_components vectors of dimension n_observations, independent form each other (as finite samples of latent independent distributions). The user guide and the definition of the ICA framework should be helpful.

  • For a data set of discretized sound signals registered by 10 microphones at 100 time points, if we want to

retrieve 5 ICA sources we need to set n_mixtures = 10, n_observations = 100 and n_components = 5.

  • For a gene expression data set with 100 samples and 10000 genes, if we want to retrieve 10 ICA sources **in

the gene space** we need to set n_mixtures = 100, n_observations = 10000 and n_components = 10.

References

Fast bootstrap algorithm:

Fisher A, Caffo B, Schwartz B, Zipunnikov V. Fast, Exact Bootstrap Principal Component Analysis for p > 1 million. J Am Stat Assoc. 2016;111(514):846-860. doi: 10.1080/01621459.2015.1062383. Epub 2016 Aug 18. PMID: 27616801; PMCID: PMC5014451.

ICASSO method :

J. Himberg and A. Hyvarinen, “Icasso: software for investigating the reliability of ICA estimates by clustering and visualization,” 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718), 2003, pp. 259-268, doi: 10.1109/NNSP.2003.1318025 (see https://www.cs.helsinki.fi/u/ahyvarin/papers/Himberg03.pdf).

Picard algorithm and extensions:

Pierre Ablin, Jean-Francois Cardoso, Alexandre Gramfort, “Faster independent component analysis by preconditioning with Hessian approximations” IEEE Transactions on Signal Processing, 2018 (see https://arxiv.org/abs/1706.08171).

Pierre Ablin, Jean-François Cardoso, Alexandre Gramfort “Faster ICA under orthogonal constraint” ICASSP, 2018 (see https://arxiv.org/abs/1711.10873)

UMAP:

For more details about the UMAP (Uniform Manifold Approximation and Projection), see https://pypi.org/project/umap-learn/.

Examples

>>> import pandas as pd
>>> from sica.base import StabilizedICA
>>> df = pd.read_csv("data.csv" , index_col = 0)
>>> sICA = StabilizedICA(n_components = 45 , n_runs = 30, plot = True, n_jobs = -1)
>>> sICA.fit(df)
>>> Sources = pd.DataFrame(sICA.S_ , columns = df.index , index = ['source ' + str(i) for i in range(sICA.S_.shape[0])])
>>> Sources.head()

Methods

fit(X[, y])

Fit the ICA model with X (use stabilization).

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X[, copy])

Transform the mixing matrix back to the mixed data (applying the sources).

projection([method, ax])

Plot the n_components*n_runs ICA components computed during fit() in 2D.

set_params(**params)

Set the parameters of this estimator.

transform(X[, copy])

Apply dimensionality reduction to X (i.e.

fit(X, y=None)[source]

Fit the ICA model with X (use stabilization).

  1. Compute the ICA components of X n_runs times.

  2. Cluster all the n_components*n_runs components with agglomerative hierarchical clustering (average linkage) into n_components clusters.

  3. For each cluster compute its stability index and return its centrotype as the final ICA component.

Parameters:
X2D array-like, shape (n_mixtures, n_observations) or (n_components, n_observations) if whiten is False.

Training data

yIgnored

Ignored.

Returns:
selfobject

Returns the instance itself.

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

inverse_transform(X, copy=True)[source]

Transform the mixing matrix back to the mixed data (applying the sources).

Parameters:
X: 2D array-like, shape (n_mixtures , n_components)
copy: bool, optional

If False, data passed to fit are overwritten. The default is True.

Returns:
X_new: 2D array, shape (n_mixtures, n_observations)
projection(method='mds', ax=None)[source]

Plot the n_components*n_runs ICA components computed during fit() in 2D. Approximate the original dissimilarities between components by Euclidean distance. Each cluster is represented with a different color.

Parameters:
methodstring, optional

Name of the dimensionality reduction method (e.g “tsne” , “mds” or “umap”) The default is “umap”.

axmatplotlib.axes, optional

The default is None.

Returns:
None.

Notes

  • We use the dissimilarity measure sqrt(1 - |rho_ij|) (rho the Pearson correlation) instead of 1 - |rho_ij| to reduce overlapping.

  • Please note that multidimensional scaling (MDS) is more computationally demanding than t-SNE or UMAP. However, it takes into account the global structures of the data set while the others don’t. For t-SNE or UMAP one cannot really interpret the inter-cluster distances.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, copy=True)[source]

Apply dimensionality reduction to X (i.e. recover the mixing matrix applying the pseudo-inverse of the sources).

Parameters:
X2D array-like, shape (n_mixtures, n_observations)
copy: bool, optional

If False, data passed to fit are overwritten. The default is True.

Returns:
A2D array, shape (n_mixtures, n_components)

Unmixing matrix which maps the independent sources to the data (i.e. X.T = AS)