header-bg

Computational Biology

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project Details

Advances in single-cell transcriptomic technologies are enabling the discovery of many novel cell phenotypes. However, this emerging knowledge remains fragmented across the scientific literature. Natural language processing (NLP) using large language models (LLMs) and other artificial intelligence methods offers a promising approach to extract and organize this information at scale, but the inconsistent nomenclature used to describe cell phenotypes in the literature limits the effectiveness of straightforward NLP approaches, and more advanced NLP methods requiring well-annotated corpora for development and evaluation are needed. We recently developed the NLM CellLink corpus, a corpus of excerpts from full-text articles that contain information about cell phenotypes. This corpus was manually annotated with mentions of human and mouse cell types and linked to Cell Ontology (CL) identifiers. This PhD project will use this corpus to support the development and evaluation of AI machine learning models for automatically identifying cell types in the scientific literature, including novel cell types, and their relationships with other key biological entities (e.g., marker genes, anatomical structures, perturbation responses, disease states) for translation into standardized semantically structured assertions and their incorporation into the NLM Cell Knowledge Network (NLM-CKN).

The student will join an interdisciplinary team at the National Library of Medicine that is applying computational biology and data science techniques to characterize cellular phenotypes and their roles in health and disease at scale.  Training will be provided in advanced computational and statistical analysis of multi-omics data, natural language processing using artificial intelligence methods, and the development and use of ontologies and other sematic web technology for biomedical knowledge representation.

Project Listed Date

Artificial Intelligence approaches for demystifying cellular phenotypes through semantic knowledge networks

Project

Artificial Intelligence approaches for demystifying cellular phenotypes through semantic knowledge networks

Project Details

Cells are the fundamental units of life. Single cell genomic technologies are revolutionizing our understanding of cellular phenotypes. Large single cell data consortia, including the NIH BRAIN Initiative and the Human BioMolecular Atlas Program (HuBMAP), have generated single cell atlas data from millions of cells/nuclei spanning multiple organs and biological systems. At the National Library of Medicine (NLM), we are building the NLM Cell Knowledge Network (http://cell-kn-mvp.org), a knowledgebase that focuses on representing the cell phenotypes and associated characteristics derived from single cell genomics data. It integrates data-driven information with knowledge from trustworthy reference ontologies, NCBI resources, and text mining efforts, resulting in a large-scale semantic knowledge network for innovative data mining and knowledge discovery.

This project consists of two main research components: 
i) developing novel computational methods for single cell and spatial transcriptomics analysis using machine learning and advanced statistics techniques, and 
ii) developing network analysis strategies for knowledge mining using cutting-edge artificial intelligence technologies. 

Students interested in one or both research components are encouraged to apply. The project team has interdisciplinary background, ranging from molecular biology, genetics, statistics, and computer science, providing a strong supporting system for students’ academic growth. Dr. Yun (Renee) Zhang is a tenure-track investigator at NLM and an alumnus of the University of Oxford
 

Project Listed Date

Computational methods to measure DNA replication with single-molecule resolution

Project

Computational methods to measure DNA replication with single-molecule resolution

Project Details

In the time it takes you to read this sentence, your body will produce millions of new cells. It is critical that each of them replicated their DNA accurately; errors in DNA replication can lead to genome instability and cancer. Cancerous cells often show different patterns of replication compared with healthy human cells, making DNA replication an important therapeutic target.  However, studying DNA replication at scale is a challenging problem: Existing methods either measure how a population of cells replicate, which “averages out” rare but important behaviour, or they work with single-molecule resolution but have low throughput.  

The Boemo Group (https://www.boemogroup.org) is a computational biology laboratory developing artificial intelligence software that measures the movement of replication forks from Oxford Nanopore sequencing data.  This method provides a high-throughput, inexpensive, accurate, and automated way to measure replication fork movement. The student will develop novel algorithms and computational approaches to track the movement of replication forks in both human cells and infectious microorganisms.  The student will also develop cutting-edge mathematical models of DNA replication that can be used to predict targets for replication-based therapies. This project will be highly collaborative and there will be the opportunity to learn, or improve upon, software engineering in Python/C/C++, GPU computing, deep learning with TensorFlow, the processing and management of large datasets.

University
8
Project Listed Date

AI for quantitative modelling and prediction in cellular biology

Project

AI for quantitative modelling and prediction in cellular biology

Project Details

Our progress in understanding and engineering living systems, and developing therapies, is severely limited by inability to build predictive, data driven models of cellular processes. Much of current cellular biology research, including work with human stem cells, microbes, and cell lines, proceeds by optically labelling cellular components such as proteins and by measuring and manipulating physiological signals optically. Microscope imaging is then used to track and quantify the interactions of these signals and components in living cells, including cells that have been genetically engineered or exposed to pharmacological agents. Quantities of interest, such as where proteins aggregate, or how rapidly cells grow are then extracted from images or movies and then quantified. This is challenging, slow and error prone because the experiments are often done piecemeal, often by hand, and focus on a handful of types of molecules or cellular interactions that are inferred from a condensed snapshot of the data, such as an average protein density.  

This project leverages recent advances in AI to analyse image data gathered from microbial populations (E coli). Our goal is to build predictive models of processes such as cell division and virus infection using high throughput microscope data. We approach this using a fusion of simulated and real data, with model-based predictions tested in automated, high throughput experiments. We wish to scale this up to cover other types of cells, including human stem cells and microbiota through collaboration with suitable groups at NIH.  This project would suit trainees with strong quantitative skills, a first degree in a STEM discipline and proficiency in coding in more than one language.

University
8
Project Listed Date

Mapping phenotypic variance in complex traits to genetic and non-genetic components using molecular data

Project

Mapping phenotypic variance in complex traits to genetic and non-genetic components using molecular data

Project Details

Genetics only explain a small proportion of phenotypic variance, with common diseases typically having 10%-30% heritability (Loh et al. 2017 Nature Genetics). This project aims to explain the remaining 70%-90% of variance using molecular data. Past efforts have attributed genetic variance to expression data (Yao et al. 2020 Nature Genetics) and different tissues (Amariuta et al. 2023 Nature Genetics); yet limited attention is paid to the non-genetic variance.  We aim to develop methods to provide an unbiased estimate of the environment variance in complex traits that are mediated through molecular traits. Specifically, we are interested in the proportion of non-genetic variance that are mediated by gene expression, protein level, and metabolomics. We will utilize large-scale proteomic and metabolomic data that are linked to electronic health records to validate the model and provide the molecular explanation for common complexity traits.

University
8
Project Listed Date
UK Mentor

Computational modelling in large scale imaging datasets to understand hypertensive disease progression after pregnancy

Project

Computational modelling in large scale imaging datasets to understand hypertensive disease progression after pregnancy

Project Details

Our research group aims to understand hypertensive disease progression of women and their children following pregnancy complications, such as hypertensive pregnancy and preterm birth, to identify optimal approaches to reduce long term risk. This includes development of new clinical tools to identify, track, and slow the disease progression as well as novel interventions.

This project will apply computational modelling and machine learning to large scale imaging datasets to study disease progression related to a hypertensive pregnancy across multiple modalities and organs. The insights into key structural and functional changes at the organ-level that describe stages of disease will be used to identify potential intervention targets.

Furthermore, we will use imaging data collected within our ongoing clinical trials to help us understand how interventions modify the underlying disease development and how this could be incorporated in clinical practice to transform long-term patient outcomes after a hypertensive pregnancy.

University
7
Project Listed Date

Understanding the clonality of drug resistance in cancer

Project

Understanding the clonality of drug resistance in cancer

Project Details

We are interested in understanding the clonality of drug resistance in cancer (primarily in a haematological cancer called Multiple Myeloma (MM)). To achieve our goals, we have developed state of the art long-read single-cell sequencing approaches (termed scCOLOR-seq) that allow us to measure single clones within patient samples. This method allows for the simultaneous measurement of gene expression, exon mutations, exon SNPs and translocations. We apply this technology and develop cutting edge computational analysis solutions to understand the relationship between clonality and drug resistance in oncology. Furthermore, our lab is also working as part of the Human Cell Atlas (HCA) project and we have several international collaborations with immunologists, cancer biologists to support our work.

The aim of this project is to develop computational analysis strategies aimed at better defining specific MM clones within patients that are resistant to first line therapeutics. The work will involve combining long-read (Oxford Nanopore Technology) multi-modal datasets and performing machine learning approaches to better define high risk patients. This work is important to better understand drug resistance mechanisms in MM and identify patients that may respond less well to therapy.

University
7
Project Listed Date
UK Mentor

Artificial intelligence in diagnostic prostate MRI to improve outcomes

Project

Artificial intelligence in diagnostic prostate MRI to improve outcomes

Project Details

There has been increasing interest in applying computational methods in medicine, to make sense of cancer’s ‘big data’ problem by exploiting recent advances in data-processing and machine learning to capture and integrate clinical, genomic, and image data collated from hundreds of cancer patients in real-time. Such methods can be applied to digital clinical images to extract image information about patterns of pixels that are not perceivable to the human eye, allowing characterisation of tumour.  Prostate cancer is the 2nd commonest male cancer worldwide, and MRI is the diagnostic tool of choice, however, MRI can miss 10% of significant tumours and leads to unnecessary (invasive) biopsy in around 1/3rd patients who do not have cancer.  

We will use a prototype AI system (Pi) developed with Lucida Medical on retrospective data, in a prospective clinical study. We plan to link histological data to imaging features derived from MRI (including texture analysis) to identify predictors of lesion aggressiveness and need for sampling, using biopsy cores and surgical specimens from the prospective cohort. Further work will link biopsy tissue to MRI data to identify radiogenomic markers of disease aggressiveness. The project presents an opportunity for AI to answer key clinical questions at the intersection of interpretation, imaging and biopsy.  

The project will involve working with
an established interdisciplinary programme of researchers and help in the assessment of cross-cutting “multi-omic” approaches to cancer assessment, involving integration of advanced image analysis, transcriptomic, genomic, tissue, and patient outcomes to inform the design of diagnostic strategies.

Institute or Center
University
8
Project Listed Date

Computational investigation of tumor microenvironment

Project

Computational investigation of tumor microenvironment

Project Details

We are interested in a variety of topics related to “Deciphering cellular heterogeneity in tumor microenvironment using single cell data”, “Functional characterization of tumor infiltrating T cells”, “Links between embryonic development and cancer”, “Functional characterization of non-coding somatic mutations”, and “Methods for single cell omics”. These projects involve non-trivial  methods development as well as sophisticated data analysis but the focus is always on the biological question. Our lab is involved in several collaborations with immunologists and cancer biologists.

Institute or Center
Project Listed Date

Combined Computational-Experimental Approaches to Predict Acute Systemic Toxicity.

Project

Combined Computational-Experimental Approaches to Predict Acute Systemic Toxicity.

University
8
Project Listed Date
NIH Mentor

Dr. Scott Auerbach,
Dr. Nicole Kleinstreuer,
& Dr. Nisha Sipes

Back to Top