|Articles|February 20, 2018

The Data Toolkit That Can Analyze More Than 1M Cells

Why the technology is empowering researchers to analyze once-impenetrable data sets.

F. Alexander Wolf, PhD, and his team at the Institute of Computational Biology (ICB) at Helmholtz Zentrum München, the German Research Center for Environmental Health, have already been to the “Data Science Bowl”—a sort of “championship game” for machine learning.

Now, they may be at the forefront of a monumental breakthrough in the analysis of single-cell gene expression data with the advent of SCANPY, a “scalable toolkit” offering “preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks.” In fact, SCANPY, which stands for Single-Cell Analysis in Python (the programming language), may be the only currently available software package that can analyze data sets containing more than 1 million cells, including the Human Cell Atlas, a reference database of maps designed to describe and define the cellular basis of health and disease, which was developed by an international team of researchers.

A summary of Wolf’s work with SCANPY to date was published on February 6 in the journal Genome Biology.

“The Human Cell Atlas could profit from SCANPY,” Wolf, team leader in machine learning at ICB, told Healthcare Analytics News™. “Generating a cell atlas of the whole human body poses unseen computational challenges; we're talking about analyzing millions and millions of cells here. SCANPY makes a very good effort of resolving this.”

Wolf—who developed the software with his colleague Philipp Angerer in the Machine Learning Group of Fabian Theis, PhD, professor of mathematical modelling of biological systems at the Technical University of Munich—said the team has been asked to present SCANPY to the computational analysis committee of the Human Cell Atlas later this year. The Human Cell Atlas is only 1 of many “exploding” data sets (to use Wolf’s description) in healthcare research that, to date, have confounded investigators. Currently available software systems for gene-expression analysis simply haven’t been able to process data sets of this magnitude.

A key to SCANPY’s capabilities lies in the programming language upon which it is based. Python, which is more commonly used in the machine learning field, enables software to be more intuitive than conventional biostatics packages, which are typically written using the R programming language. With Python as its base, SCANPY is able to combine the preprocessing, cell visualization, and “pseudotemporal ordering” of separate systems in a single platform. Unlike conventional systems, which assess cells as points in a coordinate system, SCANPY uses algorithms (modelled on those used by social media platforms) that assess cells on a graph-like coordinate system that maps cells by identifying their closest neighbors, rather than characterizing a single cell by the expression value for thousands of genes.

In assessing its capabilities for the Genome Biology paper, Wolf and his colleagues found that SCANPY could perform specific cell analysis steps several times faster than existing platforms. They believe the platform is capable of analyzing 1.3 million cells in just a few hours, without subsampling.

“Quite generally, as soon as large data sets with many observations arise, [or] when you want to integrate data sets from many studies, SCANPY will either enable this or, if it’s possible already, make it much faster,” Wolf said. “Another goal is to use SCANPY as a back-end for data portals that are now created to simplify analyzing data for non-computational-expert users: visualizing cells, clustering them to find new cell types, finding trajectories and branchings, be it in the context of development, disease progression, or dose response, and finding the genes that mark all these effects in an interactive data exploration.”

Although SCANPY is still very much in the developmental stage, experts within the field believe it could have a significant impact on research in the short term. Martin Hemberg, PhD, of Wellcome Trust Sanger Institute, Cambridge, in the United Kingdom, who has expertise in bioinformatics, systems biology, and applied mathematics, told HCA that he can see the software playing a role in “every area” of basic research because “it provides broad support for processing scRNA-seq data.

“Processing scRNA-seq data remains challenging today for 2 main reasons: 1) The field has not reached a consensus for what is the best practice; and, 2), large volumes of data are computationally challenging to analyze,” continued Hemberg, who was not involved with the SCANPY project. “SCANPY provides a massive step forward for and it makes it much more feasible for researchers to analyze data sets that previously were intractable.”

Subscribe Now!

Latest CME

Multimedia

Community Practice Connections™: Case Discussions in TNBC… Navigating the Latest Advances and Impact of Disparities in Care

Tiffany A. Traina, MD, FASCO; Demetria Smith-Graziani, MD, MPH

The Data Toolkit That Can Analyze More Than 1M Cells

Newsletter

Related Content

Adtalem Global Education CEO sees ‘headroom for growth’

Medical school enrollment rises, but the news isn’t all good

OHSU Health’s new CEO takes over | MED MOVES

Common patient records can help make value-based care work | Viewpoint

Affordable Care Act subsidies appear likely to expire

Latest CME

Community Practice Connections™: Case Discussions in TNBC… Navigating the Latest Advances and Impact of Disparities in Care

Epithelioid Sarcoma: Applying Clinical Updates to Real Patient Cases

Collaborating Across the Continuum®: Identifying and Treating Epithelioid Sarcoma

Mastering Epithelioid Sarcoma: Enhancing Diagnostic Precision and Tailoring Treatment Strategies

Clinical Showcase™: Selecting the Best Next Steps for a Patient with Epithelioid Sarcoma

Brain Mets: Brain & Spine Metastases Research and Emerging Therapy Conference

2nd Annual Hawaii Cancer Conference

Medical Crossfire®: Bridging Evidence to Practice in AML…Updates on FLT3, IDH1/2, Maintenance, Combos, and Clinical Trials

A Breath of Strength: Managing Cancer Associated LEMS and Lung Cancer as One

Show Me the Data™: Bridging Clinical Gaps Along the Continuum From Resectable, Early Stage to Advanced Gastric/Gastroesophageal Junction Cancers

Striking the Right Nerve: Managing Cancer Associated LEMS in Lung Cancer Patients

19th Annual New York GU Cancers Congress™

Medical Crossfire®: Expert Interpretations of the Latest Data in CLL Management – Understanding the Impact of Optimal Treatment Selection on Patient Outcomes

Virtual Testing Board: Digging Deeper on Your Testing Reports to Elevate Patient Outcomes in Advanced Non–Small Cell Lung Cancer

11th Annual School of Gastrointestinal Oncology® (SOGO®)

Addressing Unmet Needs in HER2+ Metastatic BTC

Community Practice Connections™: Tailored Treatment Approaches for Older Patients With Advanced HR+/HER2– Breast Cancer

Community Practice Connections™: Optimizing Treatment Outcomes and Preserving Fertility in Premenopausal HR+ Breast Cancer

From Bench to Bedside: Paradigm Shifts in HER2+ Metastatic BTC Treatment

Proactive Adverse Event Management for HER2+ BTC Treatments

Community Practice Connections™: Empowering Interventional Radiologists in the Emerging Era of Oncolytic Immunotherapies for Melanoma

A Case-Guided Discussion on Managing Immune Thrombocytopenic Purpura (ITP)

GI Tumor Board—Applying Recent Advances in Biomarker Testing and Treatment in Metastatic Colorectal Cancer

Evolving Treatment Strategies in Pancreatic Cancer: Current Standards, Emerging Targets, and the Role of Molecular Testing

Medical Crossfire®: Precision Medicine in Glioma Treatment — Integration of Molecular Profiling to Inform Targeted Therapies

Cases and Conversations™: Sorting Through the Expanding Treatment Options for Patients with Relapsed/Refractory Multiple Myeloma

PER Tumor Board®: Applying Recent Advances to Transform the Treatment Paradigm in SCLC—Expert Perspectives on New Approvals and Emerging Strategies

Medical Crossfire®: Harnessing the Power of Modern Therapies in Newly Diagnosed Multiple Myeloma

Medical Crossfire®: Improving Patient Outcomes in Myeloproliferative Neoplasms With Novel Therapeutic Approaches

Tumor Board: Expert Insights on Managing Classical 𝘌𝘎𝘍𝘙 Mutations, 𝘌𝘎𝘍𝘙 Exon 20 Insertions, and Atypical 𝘌𝘎𝘍𝘙 Mutations in Metastatic NSCLC

Medical Crossfire®: Expert Perspectives on Targeting c-Met Overexpression and 𝘔𝘌𝘛 Genomic Alterations in NSCLC – Unveiling the Complexities of 𝘔𝘌𝘛 Dysregulation

Cases & Conversations™: Transforming AML Care—Precision Strategies, Evolving Therapies, and Clinical Insights

Medical Crossfire®: Integrating Next-Generation Endocrine Targeting Therapies to Improve Outcomes for Patients With HR+/HER2- Breast Cancer

Medical Crossfire® in Adjunctive Testing: Charting a New Course in Prostate Cancer Risk Assessment

Trending on Chief Healthcare Executive

OHSU Health’s new CEO takes over | MED MOVES

Medical school enrollment rises, but the news isn’t all good

Adtalem Global Education CEO sees ‘headroom for growth’

Common patient records can help make value-based care work | Viewpoint

Connecticut approves sale of two hospitals