sica.base.StabilizedICA

class sica.base.StabilizedICA(n_components, n_runs, resampling=None, algorithm='fastica_par', fun='logcosh', whiten=True, max_iter=2000, plot=False, normalize=True, reorientation=True, pca_solver='auto', chunked=False, chunk_size=None, zero_center=True, n_jobs=1, verbose=0)[source]

Implement a stabilized version of the Independent Component Analysis algorithm. It fits the matrix factorization model X = AS, where A is the unmixing matrix (n_mixtures, n_sources), S is the source matrix (n_sources, n_observations) and X is the observed mixed data (n_mixtures, n_observations).

Parameters:
n_componentsint

Number of ICA components/sources.

n_runsint

Number of times we run the FastICA algorithm

resamplingstr {None , ‘bootstrap’ , ‘fast_bootstrap’}, optional

Method for resampling the data before each run of the ICA solver.

  • If None, no resampling is applied.

  • If ‘bootstrap’ the classical bootstrap method is applied to the original data matrix, the resampled matrix is

whitened (using the whitening hyperparameters set for the fit method) and the ICA components are extracted.

  • If ‘fast_boostrap’ a fast bootstrap algorithm is applied to the original data matrix and the whitening

operation is performed simultaneously with SVD decomposition and then the ICA components are extracted (see References).

Resampling could lead to quite heavy computations (whitening at each iteration), depending on the size of the input data. It should be considered with care. The default is None.

algorithmstr {‘fastica_par’ , ‘fastica_def’ , ‘picard_fastica’ , ‘picard’ , ‘picard_ext’ , ‘picard_orth’}, optional.

The algorithm applied for solving the ICA problem at each run. Please see the supplementary explanations for more details. The default is ‘fastica_par’, i.e. FastICA from sklearn with parallel implementation.

funstr {‘cube’ , ‘exp’ , ‘logcosh’ , ‘tanh’} or function, optional.

If algorithm is in {‘fastica_par’ , ‘fastica_def’}, it represents the functional form of the G function used in the approximation to neg-entropy. Could be either ‘logcosh’, ‘exp’, or ‘cube’.

If algorithm is in {‘picard_fastica’ , ‘picard’ , ‘picard_ext’ , ‘picard_orth’}, it is associated with the choice of a density model for the sources. See supplementary explanations for more details.

The default is ‘logcosh’.

whitenboolean, optional

If True the matrix X is whitened, i.e. centered then projected in the space defined by its first n_components PCA components and reduced to unit variance along each of these axes.

If False the input X matrix must be already whitened (the rows must be centered, scaled to unit variance and uncorrelated.)

The default is True.

max_iterint

Maximum number of iteration for the FastICA algorithm.

plotboolean, optional

If True plot the stability indexes for each cluster in decreasing order. The default is False.

normalizeboolean, optional

If True normalize the rows of S_ (i.e. the stabilized ICA components) to unit standard deviation. The default is True.

reorientationboolean,optional

If True re-oriente the rows of S_ towards positive heavy tail. The default is True.

pca_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’ , ‘lobpcg’}, optional

Solver for the different PCA methods. Please note that some solvers may not be compatible with some PCA methods. See _whitening.py for more details. The default is “auto”.

chunkedboolean, optional

Parameter for the whitening step, see _whitening.py for more details. The default is False.

chunk_sizeint, optional

Parameter for the whitening step, see _whitening.py for more details. The default is None.

zero_centerboolean, optional

Parameter for the whitening step, see _whitening.py for more details. The default is True.

n_jobsint, optional

Number of jobs to run in parallel. -1 means using all processors. See the joblib package documentation for more explanations. The default is 1.

verbose: int, optional

Control the verbosity: the higher, the more messages. The default is 0.

Attributes:
S_: 2D array, shape (n_components , n_observations)

Array of sources/metagenes, each line corresponds to a stabilized ICA component (i.e. the centrotype of a cluster of components).

stability_indexes_1D array, shape (n_components)

Stability indexes for the stabilized ICA components.

mean_1D array, shape (n_mixtures)

Notes

Here n_components corresponds to the number of ICA sources, n_mixtures corresponds to the number of linear mixtures (i.e. linear mixtures of ICA sources) that we observe, and n_observations corresponds to the number of observations collected for these mixtures. Each time, the user needs to carefully determine which dimension in his data set should correspond to the linear mixtures of ICA sources and which dimension should correspond to the observations. The user should keep in mind that, at the end, he will obtain n_components vectors of dimension n_observations, independent form each other (as finite samples of latent independent distributions). The user guide and the definition of the ICA framework should be helpful.

  • For a data set of discretized sound signals registered by 10 microphones at 100 time points, if we want to

retrieve 5 ICA sources we need to set n_mixtures = 10, n_observations = 100 and n_components = 5.

  • For a gene expression data set with 100 samples and 10000 genes, if we want to retrieve 10 ICA sources **in

the gene space** we need to set n_mixtures = 100, n_observations = 10000 and n_components = 10.

References

Fast bootstrap algorithm:

Fisher A, Caffo B, Schwartz B, Zipunnikov V. Fast, Exact Bootstrap Principal Component Analysis for p > 1 million. J Am Stat Assoc. 2016;111(514):846-860. doi: 10.1080/01621459.2015.1062383. Epub 2016 Aug 18. PMID: 27616801; PMCID: PMC5014451.

ICASSO method :

J. Himberg and A. Hyvarinen, “Icasso: software for investigating the reliability of ICA estimates by clustering and visualization,” 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718), 2003, pp. 259-268, doi: 10.1109/NNSP.2003.1318025 (see https://www.cs.helsinki.fi/u/ahyvarin/papers/Himberg03.pdf).

Picard algorithm and extensions:

Pierre Ablin, Jean-Francois Cardoso, Alexandre Gramfort, “Faster independent component analysis by preconditioning with Hessian approximations” IEEE Transactions on Signal Processing, 2018 (see https://arxiv.org/abs/1706.08171).

Pierre Ablin, Jean-François Cardoso, Alexandre Gramfort “Faster ICA under orthogonal constraint” ICASSP, 2018 (see https://arxiv.org/abs/1711.10873)

UMAP:

For more details about the UMAP (Uniform Manifold Approximation and Projection), see https://pypi.org/project/umap-learn/.

Examples

>>> import pandas as pd
>>> from sica.base import StabilizedICA
>>> df = pd.read_csv("data.csv" , index_col = 0)
>>> sICA = StabilizedICA(n_components = 45 , n_runs = 30, plot = True, n_jobs = -1)
>>> sICA.fit(df)
>>> Sources = pd.DataFrame(sICA.S_ , columns = df.index , index = ['source ' + str(i) for i in range(sICA.S_.shape[0])])
>>> Sources.head()

Methods

fit(X[, y])

Fit the ICA model with X (use stabilization).

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X[, copy])

Transform the mixing matrix back to the mixed data (applying the sources).

projection([method, ax])

Plot the n_components*n_runs ICA components computed during fit() in 2D.

set_inverse_transform_request(*[, copy])

Request metadata passed to the inverse_transform method.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, copy])

Request metadata passed to the transform method.

transform(X[, copy])

Apply dimensionality reduction to X (i.e.

fit(X, y=None)[source]

Fit the ICA model with X (use stabilization).

  1. Compute the ICA components of X n_runs times.

  2. Cluster all the n_components*n_runs components with agglomerative hierarchical clustering (average linkage) into n_components clusters.

  3. For each cluster compute its stability index and return its centrotype as the final ICA component.

Parameters:
X2D array-like, shape (n_mixtures, n_observations) or (n_components, n_observations) if whiten is False.

Training data

yIgnored

Ignored.

Returns:
selfobject

Returns the instance itself.

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()[source]

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

inverse_transform(X, copy=True)[source]

Transform the mixing matrix back to the mixed data (applying the sources).

Parameters:
X: 2D array-like, shape (n_mixtures , n_components)
copy: bool, optional

If False, data passed to fit are overwritten. The default is True.

Returns:
X_new: 2D array, shape (n_mixtures, n_observations)
projection(method='mds', ax=None)[source]

Plot the n_components*n_runs ICA components computed during fit() in 2D. Approximate the original dissimilarities between components by Euclidean distance. Each cluster is represented with a different color.

Parameters:
methodstring, optional

Name of the dimensionality reduction method (e.g “tsne” , “mds” or “umap”) The default is “umap”.

axmatplotlib.axes, optional

The default is None.

Returns:
None.

Notes

  • We use the dissimilarity measure sqrt(1 - |rho_ij|) (rho the Pearson correlation) instead of 1 - |rho_ij| to reduce overlapping.

  • Please note that multidimensional scaling (MDS) is more computationally demanding than t-SNE or UMAP. However, it takes into account the global structures of the data set while the others don’t. For t-SNE or UMAP one cannot really interpret the inter-cluster distances.

set_inverse_transform_request(*, copy='$UNCHANGED$')[source]

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
copystr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for copy parameter in inverse_transform.

Returns:
selfobject

The updated object.

set_output(*, transform=None)[source]

Set output container.

See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

New in version 1.4: "polars" option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, copy='$UNCHANGED$')[source]

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
copystr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for copy parameter in transform.

Returns:
selfobject

The updated object.

transform(X, copy=True)[source]

Apply dimensionality reduction to X (i.e. recover the mixing matrix applying the pseudo-inverse of the sources).

Parameters:
X2D array-like, shape (n_mixtures, n_observations)
copy: bool, optional

If False, data passed to fit are overwritten. The default is True.

Returns:
A2D array, shape (n_mixtures, n_components)

Unmixing matrix which maps the independent sources to the data (i.e. X.T = AS)