header-bg

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project Details

Advances in single-cell transcriptomic technologies are enabling the discovery of many novel cell phenotypes. However, this emerging knowledge remains fragmented across the scientific literature. Natural language processing (NLP) using large language models (LLMs) and other artificial intelligence methods offers a promising approach to extract and organize this information at scale, but the inconsistent nomenclature used to describe cell phenotypes in the literature limits the effectiveness of straightforward NLP approaches, and more advanced NLP methods requiring well-annotated corpora for development and evaluation are needed. We recently developed the NLM CellLink corpus, a corpus of excerpts from full-text articles that contain information about cell phenotypes. This corpus was manually annotated with mentions of human and mouse cell types and linked to Cell Ontology (CL) identifiers. This PhD project will use this corpus to support the development and evaluation of AI machine learning models for automatically identifying cell types in the scientific literature, including novel cell types, and their relationships with other key biological entities (e.g., marker genes, anatomical structures, perturbation responses, disease states) for translation into standardized semantically structured assertions and their incorporation into the NLM Cell Knowledge Network (NLM-CKN).

The student will join an interdisciplinary team at the National Library of Medicine that is applying computational biology and data science techniques to characterize cellular phenotypes and their roles in health and disease at scale.  Training will be provided in advanced computational and statistical analysis of multi-omics data, natural language processing using artificial intelligence methods, and the development and use of ontologies and other sematic web technology for biomedical knowledge representation.

Project Listed Date
Back to Top