National Library of Medicine (NLM)

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Read more about AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project

AI-based natural language processing (NLP) to extract structured information about cellular phenotypes from the scientific literature

Project Details

Advances in single-cell transcriptomic technologies are enabling the discovery of many novel cell phenotypes. However, this emerging knowledge remains fragmented across the scientific literature. Natural language processing (NLP) using large language models (LLMs) and other artificial intelligence methods offers a promising approach to extract and organize this information at scale, but the inconsistent nomenclature used to describe cell phenotypes in the literature limits the effectiveness of straightforward NLP approaches, and more advanced NLP methods requiring well-annotated corpora for development and evaluation are needed. We recently developed the NLM CellLink corpus, a corpus of excerpts from full-text articles that contain information about cell phenotypes. This corpus was manually annotated with mentions of human and mouse cell types and linked to Cell Ontology (CL) identifiers. This PhD project will use this corpus to support the development and evaluation of AI machine learning models for automatically identifying cell types in the scientific literature, including novel cell types, and their relationships with other key biological entities (e.g., marker genes, anatomical structures, perturbation responses, disease states) for translation into standardized semantically structured assertions and their incorporation into the NLM Cell Knowledge Network (NLM-CKN).

The student will join an interdisciplinary team at the National Library of Medicine that is applying computational biology and data science techniques to characterize cellular phenotypes and their roles in health and disease at scale. Training will be provided in advanced computational and statistical analysis of multi-omics data, natural language processing using artificial intelligence methods, and the development and use of ontologies and other sematic web technology for biomedical knowledge representation.

Artificial Intelligence approaches for demystifying cellular phenotypes through semantic knowledge networks

Read more about Artificial Intelligence approaches for demystifying cellular phenotypes through semantic knowledge networks

Project

Artificial Intelligence approaches for demystifying cellular phenotypes through semantic knowledge networks

Project Details

Cells are the fundamental units of life. Single cell genomic technologies are revolutionizing our understanding of cellular phenotypes. Large single cell data consortia, including the NIH BRAIN Initiative and the Human BioMolecular Atlas Program (HuBMAP), have generated single cell atlas data from millions of cells/nuclei spanning multiple organs and biological systems. At the National Library of Medicine (NLM), we are building the NLM Cell Knowledge Network (http://cell-kn-mvp.org), a knowledgebase that focuses on representing the cell phenotypes and associated characteristics derived from single cell genomics data. It integrates data-driven information with knowledge from trustworthy reference ontologies, NCBI resources, and text mining efforts, resulting in a large-scale semantic knowledge network for innovative data mining and knowledge discovery.

This project consists of two main research components:
i) developing novel computational methods for single cell and spatial transcriptomics analysis using machine learning and advanced statistics techniques, and
ii) developing network analysis strategies for knowledge mining using cutting-edge artificial intelligence technologies.

Students interested in one or both research components are encouraged to apply. The project team has interdisciplinary background, ranging from molecular biology, genetics, statistics, and computer science, providing a strong supporting system for students’ academic growth. Dr. Yun (Renee) Zhang is a tenure-track investigator at NLM and an alumnus of the University of Oxford

Benjamin Lee

Read more about Benjamin Lee

Smartphone based image analysis for malaria diagnosis

Read more about Smartphone based image analysis for malaria diagnosis

Project

Smartphone based image analysis for malaria diagnosis

Project Details

Malaria is a major burden on global health with about 200 million cases worldwide, and 600,000 deaths per year. Inadequate diagnostics is a major barrier to effective management of cases and elimination of the disease. The current gold standard method for malaria diagnosis is light microscopy of blood films. About 170 million blood films are examined every year for malaria, which involves manually identifying and counting parasites. However, microscopic diagnostics are not standardized and depend heavily on the experience and skill of the microscopist, many of whom work in isolation, with no rigorous system in place for maintenance of their skills. For false negative cases this leads to incorrect diagnosis with unnecessary use of antibiotics, a second consultation, lost days of work, and in some cases progression into severe malaria. For false positive cases, this results in unnecessary use of antimalarial drugs and side effects.

To improve malaria diagnostics, the Lister Hill National Center for Biomedical Communications, an R&D division of the U.S. National Library of Medicine, NIH and Mahidol-Oxford Tropical Medicine Research Unit, University of Oxford, in Bangkok, Thailand are developing a fully automated low-cost system that uses a mobile phone and standard light microscope for parasite detection and counting on blood films. Compared to manual counting, automatic parasite counting is more reliable and standardized, reduces the workload of the malaria field workers and reduces diagnostic costs. To count parasites automatically, the system uses image processing methods to find cells infected with parasites in digitized images of blood films. The system is trained on manually annotated images and machine learning methods then discriminate between infected and uninfected cells, detect the type of parasites that are present, and perform the counting. The system uses a regular smartphone and digital images acquired on standard light microscopy equipment making it ideal for resource-poor settings.

This PhD project will develop and test this system for real-world use for malaria diagnosis. It will include optimisation of the system at NIH and testing of the system in the field at MORU including the smartphone application interface and performance, the system for connecting the smartphone to standard light microscopes, development of a core set of performance metrics for the application, field testing of the entire system for malaria diagnosis together with government healthcare workers and National Malaria Control Programme staff, structured interviews to gather feedback on the system and its potential role in malaria diagnosis in different settings, a formal field trial of the system performance and development of a system implementation guidance document for National Malaria Control Programmes.

The student will join a dynamic team of image analysis specialists at NLM and epidemiologists, modellers and clinicians at the MORU offices in Bangkok. They will spend time at field sites in malaria-endemic areas and will interact with government staff. Training will be provided at NIH on basic image analysis and smartphone application development and at MORU on malaria miscroscopy, clinical study methodology, data analysis and research ethics.