Large Language Models in Life Sciences

Project Charter

Explore the use of Large Language Models for biological research, and define the best practices for doing so, using target discovery and validation as the initial use case

The use of Large Language Models (LLMs), such as GPT-4, presents a transformative opportunity for pharmaceutical R&D, particularly in deep data mining that requires the synthesis of large, complex datasets and the integration of proprietary research within the broader context of public information. Through this initiative, we aim to define the role of LLMs in pre-competitive research, demonstrating their potential to accelerate drug discovery and enhance collaboration across the life sciences sector.

  • Phase 1 is now complete; results are in preparation for publication
  • Phase 2 is open for sponsorship
Why Target Discovery?

This process is universally relevant across pharmaceutical R&D and exemplifies the challenges LLMs can address—namely, mining vast and intricate datasets to produce actionable insights. By solving these challenges, we pave the way for broader applications of LLMs in the scientific and industrial research landscape.

Phase 1 Challenge and Outcome

It is attractive to be able to query biological knowledge bases using queries written in a natural language. Natural language queries are typically translated into structured queries using Large Language Models (LLM). However, unassisted naive LLMs often fail to interpret the scientific queries correctly and also may produce linguistically correct but factually false output colloquially known as hallucinations.

The objective of the first phase of the Pistoia Alliance LLMs in life sciences project was to conduct a systematic evaluation of several techniques aimed at querying biological databases in natural language with LLMs. We tested multiple LLMs, both proprietary and open-source, on a knowledge graph representing a subset of the Open Targets database frequently used in drug discovery.

We found that the best balance between accuracy and flexibility in this context is achieved by multiple LLM agents that can challenge the outputs of each other and interact with a human user. This strategy is very flexible and does not require prior knowledge of the data structure, query templates, secondary databases, or adaptor language models, and it exceeds the query accuracy of the other techniques. In addition, we documented many other lessons learned, see below.

Phase 2 Challenge

Create a proper benchmark for the assessment of natural language data mining systems. This benchmark should cover all steps, from named entity recognition to query strategy.

Additional Lessons Learned in Phase 1
  • In practice natural language data mining competes with the traditional way where structured queries are written by hand. Therefore, interpretation of natural language queries should not introduce any new errors in the data mining process. Which in turn means that it has to be 100% accurate on the very first attempt. This is a very high requirement to meet.
  • The structure of the database used for queries matters:
    • Providing automatically-generated graph schema doesn’t really help in practice.
    • Traditional way of using Retrieval-Augmented Generation (RAG) with a vector database may introduce implicit restrictions on what data is explored by the LLM and what data is not.
    • But one can think of a future adaptive database schema that would evolve with the influx of new knowledge and new queries from users.
  • The form of the prompt matters. LLMs can easier produce meaningful answers from prompts that resemble a story, rather than a dry question, even if the details of the story are irrelevant to the main question asked.
    • There are open-source libraries that help produce such story-like optimized prompts.
  • The winning practical strategy for natural language data mining should include a combination of LLM agents for complex tasks like strategy planning, and deterministic API calls to the data resources for simple data retrieval tasks that do not require AI.
  • If agentic architectures are used for the future LLM data mining systems, then agents must be made Findable and Reusable (FAIR) and thus require API standards, perhaps similar to the standards used in Service-Oriented Architectures today.
  • It is extremely important to implement a reliable named entity recognition system.
  • Besides hallucinations, LLMs may also engage in task avoidance behavior, where instead of writing database queries, the model attempts to generate the answer directly from the knowledge embedded in it in training.
  • There is no good biological test-set for LLM evaluation. Background knowledge may contaminate the results.
  • It follows from the above that it is extremely important to create a proper benchmark for the assessment of natural language data mining systems. This benchmark should cover all steps, from named entity recognition to query strategy. This will be the focus of the second stage of our project.
Publication

A paper draft is now in process of submission. A pre-print will be cited here when available.

Get involved

Talk to our project managers to learn more and get involved

Contact Us

Project Supporters

  • Abbvie logo
  • Astrazeneca Logo
New Idea

Agent Communication Protocol and AI Agent Standard Specs

Strategic Priority - Harnessing AI to Accelerate R&D

We believe that the next phase in the evolution of enterprise LLM applications is to create a framework that links diverse and heterogeneous AI agents into a network. As such we are looking to develop a new project...

Learn More

Our Events

20 May 2025

Pharma and Life Science AI/ML Training Session 1: Introduction to AI for Drug Discovery

Book Now
20 May 2025

Pharma and Life Sciences AI/ML Training Program 2025

Book Now
21 May 2025

AI-Ready Data and Why FAIR Data Matters in Life Science Companies

Book Now
27 May 2025

Pharma and Life Science AI/ML Training – Session 2: Generative AI for Drug Development

Book Now
28 May 2025

CMC Process Ontology Community of Interest

Book Now
29 May 2025

Pharma and Life Science AI/ML Training – Session 3: Computer Vision

Book Now
Event banner showing a panoramic view of Manchester
02 Jun 2025

Partner Event – Virtual Imaging Trials in Medicine 2025

Book Now
02 Jun 2025

Change Management: Sharing Best Practices Workshop

Book Now
09 Jun 2025

Pharma and Life Science AI/ML Training – Session 4: Multi Modal Deep Learning

Book Now
17 Jun 2025

Pharma and Life Science AI/ML Training – Session 5: LLMs in Early Discovery

Book Now
17 Jun 2025

Partner Event – ELIXIR Bioinformatics Industry Forum (EBIF) 2025

Book Now
Cheminformatics event banner with photo of Burlington House
18 Jun 2025

Persistent Challenges in Cheminformatics

Book Now
26 Jun 2025

European Life Science Informatics Forum Swiss Chapter

Book Now
26 Jun 2025

Partner Event: Semantic Data Europe 2025: Taxonomy, Ontology, and Knowledge Graphs

Book Now
Purple slide with white logo of a tower for general ontology training
28 Jun 2025

General Ontology Training Recordings 2025

Book Now
Purple event banner with white logo of a spire
28 Jun 2025

IDMP Ontology Training Recordings 2025

Book Now
30 Jun 2025

Pharma and Life Science AI/ML Training – Session 6: LLMs as Agents in Drug Discovery

Book Now
01 Jul 2025

Pharma and Life Science AI/ML Training – Session 7: AI in Computational Medicinal Chemistry

Book Now
08 Jul 2025

Pharma and Life Science AI/ML Training – Session 8: Knowledge Graphs

Book Now
15 Jul 2025

Pharma and Life Science AI/ML Training – Session 9: Digital Pharmaceutical Manufacturing

Book Now
17 Jul 2025

Pistoia Alliance, EMBL-EBI, and ELIXIR

Book Now
22 Sep 2025

User Experience for Life Science (UXLS) Conference

Book Now
Blue event banner for Lab of the Future conference with event time and location
30 Sep 2025

Partner Event – Lab of the Future Congress Europe 2025

Book Now
02 Oct 2025

The 1st European Controlled Substance Compliance & Shipping Conference

Book Now
Boston Bay image Conference 2025
11 Nov 2025

Pistoia Alliance USA Conference 2025

Book Now