Phate single cell

Phate single cell DEFAULT

Thoughts and Theory

As data scientists, we often work with high-dimensional data with more than 3 features, or dimensions, of interest. In supervised machine learning, we may use this data for training and classification for example and may reduce the dimensions to speed up the training. In unsupervised learning, we use this type of data for visualization and clustering. In single-cell RNA sequencing (scRNA-seq), for example, we accumulate measurements of tens of thousands of genes per cell for upwards of a million cells. That’s a lot of data that provides a window into the cell’s identity, state, and other properties. More importantly, these properties put it in relation with the myriad of other cells in the dataset. Nevertheless, this creates a massive matrix of 1,000,000 cells and 10,000 genes, with each gene representing a dimension or axis. How do we interpret such data? As humans in a 3-D world, we cannot see anything beyond the three physical dimensions and need a way to capture the essence of the datasets like these while not losing anything of value.

How can we compress this dataset to 2 or 3 dimensions such that the essential information is retained? Your first instinct may be to use PCA or tSNE to project the dataset into a 2-D embedding. However, there are key tradeoffs with each of these methods that may lead to erroneous conclusions about the dataset in question during downstream analysis. With PCA, you lose local relationships between data-points (or cells in our case) at the cost of capturing a global representation. Furthermore, PCA is a linear method, which does not accurately capture complex datasets such as single-cell RNA sequencing data, where we can have a myriad of cell types defined by distinct gene expression portfolios while also undergoing various processes such as cell division, differentiation (specialization of stem cells into more mature cell types like neurons), metabolism, and so forth. tSNE, on the other hand, is a non-linear method that does a better job preserving local relationships but at the cost of shattering the global picture. For example, in stem cell differentiation, stem cells do not automatically turn into, say, a neuron, at the flick of switch (or set of genes in this case), but rather go through a continuum of changes defined by gradual changes in its transcriptomic profile. We often describe this continuum as a trajectory. The problem with tSNE is that it shatters these trajectories, resulting in disjointed clusters of cells, with little information on how one cell type is related to another.

So now that I’ve illustrated the downstream issues with this tradeoff, how can we resolve it? That is where PHATE enters the picture. PHATE — which stands for Potential of Heat diffusion for Affinity-based Transition Embedding — is a newcomer onto the world of dimensionality reduction. Like tSNE, it is a non-linear, unsupervised technique. However, unlike tSNE, which preserves local structure of higher-dimensional data often at the cost of the global structure, PHATE captures the best of both worlds of PCA and tSNE, preserving both local and global relationships between data-points to accurately reflect the high-dimensional dataset in question.

We will continue to use scRNA-seq data as an example. With this dataset, we essentially have an m-by-n matrix of m cells (rows) and n genes (columns), which represent discrete counts of messenger RNA molecules. We first compute a square matrix of the Euclidean distance between each of these cells, which is simply the length of the line segment between two points based on the Cartesian coordinates. In the context of scRNA-seq, these Cartesian coordinates are the gene expression measurements, so intuitively, we would expect cells a short distance apart to be very similar in gene expression, and hence similar cell types, whereas cells further apart to have very different gene expression patterns, and hence reflect starkly different cell types (e.g., a neuron and a red blood cell). However, this metric doesn’t always lend well to that interpretation. This is because of the Curse of Dimensionality, where too many dimensions can cause the data-points in your dataset to appear equidistant from all the others, makes it hard to draw conclusions of trends in your data, derive meaningful clusters and local neighborhoods, and determine other types of patterns. To address this problem, we convert our distances to affinities, which quantify local similarities between observations in our data. (This is where the “A” in PHATE for “Affinity-based” comes from!) These affinities are inversely proportional to the distances such that the further apart two observations are in Euclidean space, the smaller their affinity; likewise, the closer they are, the greater their affinity. Affinities are commonly calculated by using a kernel function to transform your Euclidean distances. In very simple terms, this is your probability mass or density function minus the factors/coefficients that normalize it to ensure that probabilities are between 0 and 1. They are used often in other Machine Learning problems such as support vector machines. One popular kernel you can guess is the Gaussian kernel:

Where x and y are coordinates in a high-dimensional space X, and ε is a bandwidth metric that measures the “spread” or radius of neighborhoods captured by this kernel. The authors of the PHATE paper use a slightly more advanced kernel function that does a better job of quantifying similarities, whose details I will spare for brevity. While this is a handy trick to preserve the local structure of our dataset, just embedding these alone will fracture the global structure as in the case of tSNE. Hence, in addition to retaining the local structure, PHATE’s other objective it to maintain the global relationships across the data. To achieve this, PHATE uses the affinities to “diffuse” through the data via a Markov random-walk. When we say diffuse, we mean the net spread from a region of high concentration to that of low concentration. In the context of affinities, this can be thought of as going from high affinities (i.e., a cluster of cells in our high-dimensional dataset) to lower affinities (i.e., more spread-out cells). More intuitively, we can think of this is the spread of heat in a room going from a warm source (e.g., a fireplace) to a less warm area (e.g., you on your couch), which can be modeled mathematically as the heat equation and whose solution is the heat kernel. This is where the “H” in PHATE, Heat diffusion, comes from. As for a random walk, this represents a trail of successive random steps through our high-dimensional space (i.e., transitioning from cell to cell), where each possible step or transition has a defined probability of going down said route. You can think of this as the heat in room random spreading in the direction of one corner of the room initially then switching directions. This is known more broadly as a stochastic or random process. In our context, we have the probability of cell i going to cell j depends on the last cell we visited. Now with all that terminology out of the way, let’s see how this works in action!

We first calculate initial probabilities of our random walk by normalizing our previously calculated affinities. This produces the following:

where

This gives us an N-by-N transition probability matrix of moving from cell x to cell y within a single time step.

A simpler term for this matrix, used in the PHATE paper, is the diffusion operator. Mathematically, to attain effective diffusion, we raise the diffusion operator to an optimal number of steps t to learn the global structure of our data. This gives us the probability of transitioning from cell x to cell y in t time steps. With larger t, we can cover more distance in the high-dimensional space and learn more about the global structure without getting bogged down by the locality from our single-time step affinity-based probabilities. You can think of this as scouting a hiking trail to build a map of the surroundings: every few steps, you lay down a marker to note your previous position, generally at notable areas (e.g., a large tree, a river bank, etc.), rather than setting markers with every step. This lets you build a general map of the area without getting bogged down by every single twig and branch. For sake of brevity, I’ll spare the technical details of how to calculate this optimal step t, which is described in the PHATE paper, using von Neumann entropy, but this all feeds into the “T” of our algorithm for “Transition”.

Alright, we’ve got our affinities to capture local relationships between nearby cells in one hand and a powered diffusion operator to learn our global space in another. Now what? Embed this powered diffusion operator? Not so fast. One limitation is that the resulting probability-based distances between cells via this operator (or diffusion distances, as the authors define them) are not very sensitive to distances between far away points and can suffer stability issues when we consider boundary points of our high-dimensional space (the paper goes into more detail on these shortcomings). This can be resolved however via the first letter of our acronym, “P”, which stands for Potential. The authors define this clever metric called potential distance, inspired by information theory and stochastic dynamics, where we measure the distance between log-transformed probabilities from the powered diffusion operator. This increases the sensitivity of our resulting distances, and it enables PHATE to preserve and local and global architecture for visualization purposes. Mathematically this is defined as follows:

where

pₓᵗ refers to row x of our diffusion operator, which you’ll recall is our transition probability matrix raised to the t’th power.

When we say that this distance metric is more sensitive, suppose the transition probability from cell a to cell b is 0.04, while for cell a to c it’s 0.05. Under these diffusion distances, they are not very sensitive to fold-changes. The distance, 0.01, may suffer sensitivity issues in trying to encapsulate this relationship in a lower dimension projection. However, if we log transform these probabilities and take their distance, we obtain a larger distance of 0.223, which is the same as if these probabilities had been 0.4 and 0.5 respectively (recall that log a - log b = log (a/b)). Pretty neat!

Alright, we’re ready for the “E” part of PHATE, “embedding”! Typically, with these diffusion metrics, we commonly perform an eigen-decomposition (i.e., break it up by eigenvectors of the matrix in question, in this case our powered diffusion operator) to derive a diffusion map of the data, which were a popular approach for studying differentiation trajectories in scRNA-seq data. The problem with this approach, however, is that it shatters trajectories into a myriad of eigenvectors reflecting the diffusion components. This high intrinsic dimensionality consequently renders them unamenable for visualization. To bypass this limitation, the authors embed the potential distance matrix using metric Multidimensional Scaling (metric MDS). This is an embedding method tailored for distance matrices as input by minimizing what’s known as a “stress” function:

While this equation may look intimidating, what it’s essentially doing is measuring the goodness of fit, in this case, how well the embedded coordinates fit the higher-dimensional data we seek to visualize. The smaller the stress, the greater fit. Thus, if the stress of these embedded points is zero, then the data has been successfully captured in this MDS embedding. Small, non-zero values may occur as a result of noise or a small number of MDS dimensions (i.e., m = 2 or 3). However, this level of distortion is generally tolerable so long as trajectories and other key features are preserved.

Everything I’ve discussed up until this point can be visually summarized in this figure from the authors of the PHATE paper:

If you’ve stuck around this long, congratulations! We’re now ready to learn how to implement PHATE in Python and see it in action! For simplicity, we’ll use the smaller version of the popular MNIST dataset where we have 8-by-8 images of hand-drawn digits (you can find an example from the authors of the paper applying PHATE to real scRNA-seq data here).

Install the PHATE library using pip or pip3:

pip install phate

Create a new Python file in your favorite editor or a notebook in Jupyter notebooks and run the following:

Let’s run this program to see the results

python3 mnist_phate.py

Looking at the PHATE embedding, you can see that we have generally clear separation of the digits, while the clusters themselves have unique shapes and distributions. Let’s see how the this compares to PCA and tSNE

PCA (left) overcrowds the data, making it difficult to draw concrete conclusions. tSNE (center) on the other hand does a nice job of separating them out into balled-up clusters, at the cost of a clear global structure of how the digits relate to another, as well as suppressing the unique spread of data within clusters. PHATE (right) reconciles these issues where you can clearly make out the clusters and get a sense of how they relate to one another (e.g., 3 and 9 having a lot of similarity). For the case of digits, this isn’t such a big deal, but for single-cell biology when we’re studying continuous processes such as turning stem cells into neurons, having that added detail of how cells transition from one cell type to another is very useful.

Sours: https://towardsdatascience.com/why-you-should-be-using-phate-for-dimensionality-reduction-f202ef385eb7

PHATE - Potential of Heat-diffusion for Affinity-based Trajectory Embedding¶

Latest PyPi versionLatest Conda versionLatest CRAN versionTravis CI BuildRead the DocsNature Biotechnology PublicationTwitterGitHub stars

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) is a tool for visualizing high dimensional data. PHATE uses a novel conceptual framework for learning and visualizing the manifold to preserve both local and global distances.

To see how PHATE can be applied to datasets such as facial images and single-cell data from human embryonic stem cells, check out our Nature Biotechnology publication.

Moon, van Dijk, Wang, Gigante et al. **Visualizing Transitions and Structure for Biological Data Exploration**. 2019. *Nature Biotechnology*.

Quick Start¶

If you have loaded a data matrix in Python (cells on rows, genes on columns) you can run PHATE as follows:

importphatephate_op=phate.PHATE()data_phate=phate_op.fit_transform(data)

PHATE accepts the following data types: , , and .

Usage¶

To run PHATE on your dataset, create a PHATE operator and run fit_transform. Here we show an example with an artificial tree:

importphatetree_data,tree_clusters=phate.tree.gen_dla()phate_operator=phate.PHATE(k=15,t=100)tree_phate=phate_operator.fit_transform(tree_data)phate.plot.scatter2d(phate_operator,c=tree_clusters)# or phate.plot.scatter2d(tree_phate, c=tree_clusters)phate.plot.rotate_scatter3d(phate_operator,c=tree_clusters)

Help¶

If you have any questions or require assistance using PHATE, please contact us at https://krishnaswamylab.org/get-help

class (n_components=2, knn=5, decay=40, n_landmark=2000, t='auto', gamma=1, n_pca=100, mds_solver='sgd', knn_dist='euclidean', mds_dist='euclidean', mds='metric', n_jobs=1, random_state=None, verbose=1, potential_method=None, alpha_decay=None, njobs=None, k=None, a=None, **kwargs)[source]

PHATE operator which performs dimensionality reduction.

Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) embeds high dimensional single-cell data into two or three dimensions for visualization of biological progressions as described in Moon et al, 2017 [1].

Parameters:
  • n_components (int, optional, default: 2) – number of dimensions in which the data will be embedded
  • knn (int, optional, default: 5) – number of nearest neighbors on which to build kernel
  • decay (int, optional, default: 40) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
  • n_landmark (int, optional, default: 2000) – number of landmarks to use in fast PHATE
  • t (int, optional, default: 'auto') – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the knee point in the Von Neumann Entropy of the diffusion operator
  • gamma (float, optional, default: 1) – Informational distance constant between -1 and 1. gamma=1 gives the PHATE log potential, gamma=0 gives a square root potential.
  • n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
  • mds_solver ({'sgd', 'smacof'}, optional (default: 'sgd')) – which solver to use for metric MDS. SGD is substantially faster, but produces slightly less optimal results. Note that SMACOF was used for all figures in the PHATE paper.
  • knn_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’, ‘cosine’, ‘precomputed’ Any metric from scipy.spatial.distance can be used distance metric for building kNN graph. Custom distance functions of form f(x, y) = d are also accepted. If ‘precomputed’, data should be an n_samples x n_samples distance or affinity matrix. Distance matrices are assumed to have zeros down the diagonal, while affinity matrices are assumed to have non-zero values down the diagonal. This is detected automatically using data[0,0]. You can override this detection with knn_dist=’precomputed_distance’ or knn_dist=’precomputed_affinity’.
  • mds_dist (string, optional, default: 'euclidean') – Distance metric for MDS. Recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used. Custom distance functions of form f(x, y) = d are also accepted
  • mds (string, optional, default: 'metric') – choose from [‘classic’, ‘metric’, ‘nonmetric’]. Selects which MDS algorithm is used for dimensionality reduction
  • n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
  • random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize SMACOF (metric, nonmetric) MDS If an integer is given, it fixes the seed Defaults to the global numpy random number generator
  • verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages
  • potential_method (deprecated.) – Use gamma=1 for log transformation and gamma=0 for square root transformation.
  • alpha_decay (deprecated.) – Use decay=None to disable alpha decay
  • njobs (deprecated.) – Use n_jobs to match sklearn standards
  • k (Deprecated for knn) –
  • a (Deprecated for decay) –
  • kwargs (additional arguments for graphtools.Graph) –
Type:array-like, shape=[n_samples, n_dimensions]

Stores the position of the dataset in the embedding space

Type:array-like, shape=[n_samples, n_components]

The diffusion operator built from the graph

Type:array-like, shape=[n_samples, n_samples] or [n_landmark, n_landmark]

The graph built on the input data

Type:graphtools.base.BaseGraph

The automatically selected t, when t = ‘auto’. When t is given, optimal_t is None.

Examples

>>> importphate>>> importmatplotlib.pyplotasplt>>> tree_data,tree_clusters=phate.tree.gen_dla(n_dim=100,n_branch=20,... branch_length=100)>>> tree_data.shape(2000, 100)>>> phate_operator=phate.PHATE(knn=5,decay=20,t=150)>>> tree_phate=phate_operator.fit_transform(tree_data)>>> tree_phate.shape(2000, 2)>>> phate.plot.scatter2d(tree_phate,c=tree_clusters)

References

The diffusion operator calculated from the data

Interpolates the PHATE potential to one entry per cell

This is equivalent to calculating infinite-dimensional PHATE, or running PHATE without the MDS step.

Returns:diff_potential
Return type:ndarray, shape=[n_samples, min(n_landmark, n_samples)]
(X)[source]

Computes the diffusion operator

Parameters:X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData. If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix
Returns:
  • phate_operator (PHATE)
  • The estimator object
(X, **kwargs)[source]

Computes the diffusion operator and the position of the cells in the embedding space

Parameters:
  • X (array, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix
  • kwargs (further arguments for PHATE.transform()) – Keyword arguments as specified in
Returns:

embedding – The cells embedded in a lower dimensional space using PHATE

Return type:

array, shape=[n_samples, n_dimensions]

(**kwargs)[source]

Deprecated. Reset parameters related to multidimensional scaling

Parameters:
  • n_components (int, optional, default: None) – If given, sets number of dimensions in which the data will be embedded
  • mds (string, optional, default: None) – choose from [‘classic’, ‘metric’, ‘nonmetric’] If given, sets which MDS algorithm is used for dimensionality reduction
  • mds_dist (string, optional, default: None) – recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used If given, sets the distance metric for MDS
(**kwargs)[source]

Deprecated. Reset parameters related to the diffusion potential

Parameters:
  • t (int or 'auto', optional, default: None) – Power to which the diffusion operator is powered If given, sets the level of diffusion
  • potential_method (string, optional, default: None) – choose from [‘log’, ‘sqrt’] If given, sets which transformation of the diffusional operator is used to compute the diffusion potential
(**params)[source]

Set the parameters on this estimator.

Any parameters not given as named arguments will be left at their current value.

Parameters:
  • n_components (int, optional, default: 2) – number of dimensions in which the data will be embedded
  • knn (int, optional, default: 5) – number of nearest neighbors on which to build kernel
  • decay (int, optional, default: 40) – sets decay rate of kernel tails. If None, alpha decaying kernel is not used
  • n_landmark (int, optional, default: 2000) – number of landmarks to use in fast PHATE
  • t (int, optional, default: 'auto') – power to which the diffusion operator is powered. This sets the level of diffusion. If ‘auto’, t is selected according to the knee point in the Von Neumann Entropy of the diffusion operator
  • gamma (float, optional, default: 1) – Informational distance constant between -1 and 1. gamma=1 gives the PHATE log potential, gamma=0 gives a square root potential.
  • n_pca (int, optional, default: 100) – Number of principal components to use for calculating neighborhoods. For extremely large datasets, using n_pca < 20 allows neighborhoods to be calculated in roughly log(n_samples) time.
  • mds_solver ({'sgd', 'smacof'}, optional (default: 'sgd')) – which solver to use for metric MDS. SGD is substantially faster, but produces slightly less optimal results. Note that SMACOF was used for all figures in the PHATE paper.
  • knn_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’, ‘cosine’, ‘precomputed’ Any metric from scipy.spatial.distance can be used distance metric for building kNN graph. Custom distance functions of form f(x, y) = d are also accepted. If ‘precomputed’, data should be an n_samples x n_samples distance or affinity matrix. Distance matrices are assumed to have zeros down the diagonal, while affinity matrices are assumed to have non-zero values down the diagonal. This is detected automatically using data[0,0]. You can override this detection with knn_dist=’precomputed_distance’ or knn_dist=’precomputed_affinity’.
  • mds_dist (string, optional, default: 'euclidean') – recommended values: ‘euclidean’ and ‘cosine’ Any metric from scipy.spatial.distance can be used distance metric for MDS
  • mds (string, optional, default: 'metric') – choose from [‘classic’, ‘metric’, ‘nonmetric’]. Selects which MDS algorithm is used for dimensionality reduction
  • n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used
  • random_state (integer or numpy.RandomState, optional, default: None) – The generator used to initialize SMACOF (metric, nonmetric) MDS If an integer is given, it fixes the seed Defaults to the global numpy random number generator
  • verbose (int or boolean, optional (default: 1)) – If True or > 0, print status messages
  • k (Deprecated for knn) –
  • a (Deprecated for decay) –

Examples

>>> importphate>>> importmatplotlib.pyplotasplt>>> tree_data,tree_clusters=phate.tree.gen_dla(n_dim=50,n_branch=5,... branch_length=50)>>> tree_data.shape(250, 50)>>> phate_operator=phate.PHATE(k=5,a=20,t=150)>>> tree_phate=phate_operator.fit_transform(tree_data)>>> tree_phate.shape(250, 2)>>> phate_operator.set_params(n_components=10)PHATE(a=20, alpha_decay=None, k=5, knn_dist='euclidean', mds='metric', mds_dist='euclidean', n_components=10, n_jobs=1, n_landmark=2000, n_pca=100, njobs=None, potential_method='log', random_state=None, t=150, verbose=1)>>> tree_phate=phate_operator.transform()>>> tree_phate.shape(250, 10)>>> # plt.scatter(tree_phate[:,0], tree_phate[:,1], c=tree_clusters)>>> # plt.show()
Returns:
Return type:self
(X=None, t_max=100, plot_optimal_t=False, ax=None)[source]

Computes the position of the cells in the embedding space

Parameters:
  • X (array, optional, shape=[n_samples, n_features]) – input data with n_samples samples and n_dimensions dimensions. Not required, since PHATE does not currently embed cells not given in the input matrix to PHATE.fit(). Accepted data types: numpy.ndarray, scipy.sparse.spmatrix, pd.DataFrame, anndata.AnnData. If knn_dist is ‘precomputed’, data should be a n_samples x n_samples distance or affinity matrix
  • t_max (int, optional, default: 100) – maximum t to test if t is set to ‘auto’
  • plot_optimal_t (boolean, optional, default: False) – If true and t is set to ‘auto’, plot the Von Neumann entropy used to select t
  • ax (matplotlib.axes.Axes, optional) – If given and plot_optimal_t is true, plot will be drawn on the given axis.
Returns:
  • embedding (array, shape=[n_samples, n_dimensions])
  • The cells embedded in a lower dimensional space using PHATE

© Copyright 2017 Krishnaswamy Lab, Yale University Revision .

Built with Sphinx using a theme provided by Read the Docs.
Sours: https://phate.readthedocs.io/
  1. Break alex g lyrics
  2. Amc concord car 1980
  3. Gulfport ms house rentals
  4. Extended ponytail black hair
  5. Tradition cleaners stevens point

Lab Projects

Handling the vast amounts of single-cell RNA-sequencing and CyTOF data, which are now being generated in patient cohorts, presents a computational challenge due to the noise, complexity, sparsity and batch effects present. Here, we propose a unified deep neural network-based approach to automatically process and extract structure from these massive datasets.

Our unsupervised architecture, called SAUCIE (Sparse Autoencoder for Unsupervised Clustering, Imputation, and Embedding), simultaneously performs several key tasks for single-cell data analysis including 1) clustering, 2) batch correction, 3) visualization, and 4) denoising/imputation. SAUCIE is trained to recreate its own input after reducing its dimensionality in a 2-D embedding layer which can be used to visualize the data.

Additionally, SAUCIE uses two novel regularizations: (1) an information dimension regularization to penalize entropy as computed on normalized activation values of the layer, and thereby encourage binary-like encodings that are amenable to clustering and (2) a Maximal Mean Discrepancy penalty to correct batch effects. Thus SAUCIE has a single architecture that denoises, batch-corrects, visualizes and clusters data using a unified representation. We show results on artificial data where ground truth is known, as well as mass cytometry data from dengue patients, and single-cell RNA-sequencing data from embryonic mouse brain.

 

SAUCIE.jpg
Sours: https://www.krishnaswamylab.org/projects

PHATE - Visualizing Transitions and Structure for Biological Data Exploration

Latest PyPI versionLatest Conda versionLatest CRAN versionTravis CI BuildRead the DocsNature Biotechnology PublicationTwitter

Quick Start

If you would like to get started using PHATE, check out the following tutorials.

Introduction

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) is a tool for visualizing high dimensional data. PHATE uses a novel conceptual framework for learning and visualizing the manifold to preserve both local and global distances.

To see how PHATE can be applied to datasets such as facial images and single-cell data from human embryonic stem cells, check out our publication in Nature Biotechnology.

Moon, van Dijk, Wang, Gigante et al. Visualizing Transitions and Structure for Biological Data Exploration. 2019. Nature Biotechnology.

PHATE has been implemented in Python >=3.5, MATLAB and R.

Table of Contents

System Requirements

All other software dependencies are installed automatically when installing PHATE.

Python

Installation with

The Python version of PHATE can be installed by running the following from a terminal:

Installation of PHATE and all dependencies should take no more than five minutes.

Installation from source

The Python version of PHATE can be installed from GitHub by running the following from a terminal:

Quick Start

If you have loaded a data matrix in Python (cells on rows, genes on columns) you can run PHATE as follows:

PHATE accepts the following data types: , , and .

Tutorial and Reference

For more information, read the documentation on ReadTheDocs or view our tutorials on GitHub: single-cell RNA-seq, artificial tree. You can also access interactive versions of these tutorials on Google Colaboratory: single-cell RNA-seq, artificial tree.

MATLAB

Installation

The MATLAB version of PHATE can be accessed by running the following from a terminal:

Then, add the PHATE/Matlab directory to your MATLAB path.

Installation of PHATE should take no more than five minutes.

Tutorial and Reference

Run any of our scripts to get a feel for PHATE. Documentation is available in the MATLAB help viewer.

R

In order to use PHATE in R, you must also install the Python package.

If or are not installed, you will need to install them. We recommend Miniconda3 to install Python and together, or otherwise you can install from https://pip.pypa.io/en/stable/installing/.

Installation from CRAN and PyPi

First install in Python by running the following code from a terminal:

Then, install from CRAN by running the following code in R:

Installation of PHATE and all dependencies should take no more than five minutes.

Installation with and

The development version of PHATE can be installed directly from R with :

Installation from source

The latest source version of PHATE can be accessed by running the following in a terminal:

If the folder is empty, you have may forgotten to use the option for . You can rectify this by running the following in a terminal:

Quick Start

If you have loaded a data matrix in R (cells on rows, genes on columns) you can run PHATE as follows:

phateR accepts R matrices, sparse matrices, s, and any other data type that can be converted to a matrix with the function .

Tutorial and Reference

For more information and a tutorial, read the phateR README. Documentation is available at https://CRAN.R-project.org/package=phateR/phateR.pdf or in the R help viewer with . A tutorial notebook running PHATE on a single-cell RNA-seq dataset is available at http://htmlpreview.github.io/?https://github.com/KrishnaswamyLab/phateR/blob/master/inst/examples/bonemarrow_tutorial.html or in .

Help

If you have any questions or require assistance using PHATE, please contact us at https://krishnaswamylab.org/get-help.

Sours: https://github.com/KrishnaswamyLab/PHATE

Cell phate single

Visualizing structure and transitions in high-dimensional biological data

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.

Sours: https://pubmed.ncbi.nlm.nih.gov/31796933/
Single Cell Sequencing - Eric Chow (UCSF)

3. Visualizing data using PHATE

Once you’ve inspected the principle components of your dataset, it’s time to start visualizing your data using PHATE. We’re going to demonstrate PHATE analysis on a few datasets. We will show:

  • How PHATE works
  • Running PHATE on several datasets
  • How to interpret a PHATE plot
  • Clustering using the diffusion potential
  • How to pick parameters for PHATE
  • How to troubleshoot common issues with PHATE plots

3.0 - What is PHATE and why should you use it?

PHATE is a dimensionaltiy reduction developed by the Krishnaswamy lab for visualizing high-dimensional data. We use PHATE for every dataset the comes through the lab: scRNA-seq, CyTOF, gut microbiome profiles, simulated data, etc. PHATE was designed to handle noisy, non-linear relationships between data points. PHATE produces a low-dimensional representation that preserves both local and global structure in a dataset so that you can make generate hypotheses from the plot about the relationships between cells present in a dataset. Although PHATE has utility for analysis of many data modalities, we will focus on the application of PHATE for scRNA-seq analysis.

PHATE is inspired by diffusion maps (Coifman et al. (2008)), but include several key innovations that make it possible to generate a two or three dimensional visualization that preserves continuous relationships between cells where they exist. For a full explanation of the PHATE algorithm, please consult the PHATE manuscript.

Installing PHATE

PHATE is available on PyPI, and can be installed by running the following command in a terminal:

PHATE is also available in and , but we’re going to focus on the Python implementation for this tutorial.

3.1 PHATE parameters

As we mentioned in the previous section, PHATE is a subclass of the , and it’s API matches that of the operator. To use PHATE, you must first instantiate a object, and then use to build a graph to your data and reduce the dimensionality of the data for visualization.

PHATE has three key parameters:

  • - sets the number of dimensions to which PHATE will reduce the input data
  • - sets the number of Nearest Neighbors to use for calculating the kernel bandwidth
  • - sets the rate of decay of the kernel tails

I’ll introduce other parameters thoughout the tutorial, but you should also check out the full documentation on readthedocs: https://phate.readthedocs.io/en/stable/.

3.2 Wait a second, what’s a kernel?

Now if you’ve never studied graph theory or discrete mathematics, these last two parameters might be confusing. To understand them, consider a very simple graph, the k-Nearest Neighbor (kNN) graph. Here, each cell is a node in the graph, and edges exist between a cell and it’s nearest neighbors. You can also think about this as a graph where all cells are connected, but the connections between non-neighboring cells have a strength or weight of 0, while the edges between neighboring cells have a weight of 1.

The kNN graph offers a very powerful representation for single cell data, but it also has some drawbacks. For example, consider the following graphs:

On the left, we have a kNN graph where k=4. Notice that all the blue cells, regardless of their proximity to the red cell, have edges of equal weight. Also notice that the green cell in the lower right hand corner has no connection to the red cell, despite being only trivially farther away than the next closest cell. These two properties of kNN graphs, a harsh cutoff for edges and uniform edge weights, means that choice of proper is critical for any method using this graph.

To overcome these limitations, we use a variation of the radial basis kernel graph on the right. This graph connects cells with edge weights proportionally to their distance to the red cell. You can think of this kernel function as a “soft” kNN, where the weights vary smoothly between 0 and 1. Note: in practice, all cells are connected at one stage during graph instantiation, but then edges that are below a cutoff, such as are set to 0.

Now that you understand this distinction, you can think of the parameter as setting a baseline for finding close cells, and as the “softness” the edge weighting.

3.3 Instantiating a PHATE estimator

In practice, I usually start by running PHATE on a new dataset with default parameters, which is , , and . Rarely, I will want to change these, but we will cover parameter tuning in a later section.

We can create a new object just like we did with the class:

and now we can generate an embedding by calling and plot the output using . Let’s start with the T cell data from Datlinger et al. (2017).

Note: you can also pass in the PCA-reduced data from earlier, but PHATE can also do this for you at the cost of increased compute time.

You’ll see some output that looks like this, telling you that PHATE is doing stuff:

Note, these times with setting , you may find slower performance on a laptop or when using sparse input.

3.4 Plotting PHATE with a scatter plot

Now we can plot using :

And see the following plot:

From this plot, we can tell that there are several kinds of cells in the data set, but it’s hard to tell much other than that without starting to add some color. Let’s plot the condition label for each cell, the library size, and expression of a mitochonrial gene.

And we see the following plots:

Examining these plots, we see that there are no regions enriched for high mitochondrial RNA (which would indicate apoptotic cells). Although there are some regions of the plot with higher library size than others, these cells are fairly well distributed over the plot. If we say a branch of cells shooting off the plot with very high or low library sizes, then we might want to revisit the filtering thresholds established during preprocessing.

Finally, when we examine the the distribution of condition labels, we see that there is a good amount of overlap between the two conditions. We’ll get to characterizing the differences between these two conditions in a later tutorial. For now, check out our method, MELD (Manifold Enhancement of Latent Dimensions) on BioRxiv (doi: ).

3.5 Plotting PHATE as a KDE plot

Scatter plots are a useful way to display data. They also have the appeal of showing you every single point in your dataset. Or do they?

Scatter plots have one drawback, which is that data points that are overlapping in the plot get drawn on top of each other. This makes it difficult to identify where most of the data density lies. You could color your plot by a density estimate, as is commonly done in FACS analysis. However, I find that in these plots, the eye is still drawn to outliers.

Instead, I think it’s useful to look at a Kernel Density Estimate (KDE) plot. These plots use kernel functions (just like in the graph building process) to estimate data density in one or more dimensions.

In python you can make a KDE plot using the package. works seamlessly with and and uses as a backend. The function for drawing a KDE plot is called, unsurprisingly .

Let’s generate a KDE plot of the T cell data to get an idea of how many clusters of cells are in the data.

Here, it’s much easier to see that there are around 5 regions of density in the data. Let’s cluster them and figure out what they are!

3.6 Clustering data using PHATE

A common mode of analysis for scRNA-seq is to identify clusters of cells and characterize the transcriptional diversity of these subpopulations. There are many clustering algorithms out there, and many of them have implementations in . One of our favorite clustering algorithms is Spectral Clustering. In the PHATE package, we have developed an adaptation of this method using the PHATE diffusion potential instead of the eigenvectors of the normalized Laplacian.

You can call this method with , and use these labels to color the PHATE plot. We’ll set based on our examination of the KDE plot.

Here, we can see that the clusters are localized on the PHATE plot, and now we can begin to characterize them. Note, I won’t claim that this is the best method for identifying clusters. In general, cluster assignments should be seen as “best guess” partitions of the data. For every clustering algorithm, you can change some parameters and increase or decrease the number of clusters identified. The best way to make sure you have “valid” clusters is to inspect each group individually, and make sure you’re happy with the resolution of the results.

Sours: https://dburkhardt.github.io/tutorial/visualizing_phate/

Similar news:

Potential of Heat-diffusion for Affinity-based Transition Embedding (PHATE)

Build Status

PHATE is a tool for visualizing high dimensional single-cell data with natural progressions or trajectories. PHATE uses a novel conceptual framework for learning and visualizing the manifold inherent to biological systems in which smooth transitions mark the progressions of cells from one state to another, as described in Moon, van Dijk, Wang et al. (2019), Visualizing Structure and Transitions in High-Dimensional Biological Data, Nature Biotechnology https://doi.org/10.1038/s41587-019-0336-3.

Use

Here we download a csv file containing raw scRNA-seq counts, preprocess it by filtering cells with less than 2000 counts, library size normalize and then apply a square root transform before running PHATE, then save the low-dimensional embedding results to phate_output.csv in your current working directory.

Validate

Run this command to confirm your container produces correct reference output:

Contact

Scott Gigante ([email protected])

Sours: https://data.humancellatlas.org/analyze/methods/phate


899 900 901 902 903