Project Charter
Explore the use of Large Language Models for biological research, and define the best practices for doing so, using target discovery and validation as the initial use case
The use of Large Language Models (LLMs), such as GPT-4, presents a transformative opportunity for pharmaceutical R&D, particularly in deep data mining that requires the synthesis of large, complex datasets and the integration of proprietary research within the broader context of public information. Through this initiative, we aim to define the role of LLMs in pre-competitive research, demonstrating their potential to accelerate drug discovery and enhance collaboration across the life sciences sector.
- Phase 1 is now complete; results are in preparation for publication
- Phase 2 is open for sponsorship
Why Target Discovery?
This process is universally relevant across pharmaceutical R&D and exemplifies the challenges LLMs can address—namely, mining vast and intricate datasets to produce actionable insights. By solving these challenges, we pave the way for broader applications of LLMs in the scientific and industrial research landscape.
Phase 1 Challenge and Outcome
It is attractive to be able to query biological knowledge bases using queries written in a natural language. Natural language queries are typically translated into structured queries using Large Language Models (LLM). However, unassisted naive LLMs often fail to interpret the scientific queries correctly and also may produce linguistically correct but factually false output colloquially known as hallucinations.
The objective of the first phase of the Pistoia Alliance LLMs in life sciences project was to conduct a systematic evaluation of several techniques aimed at querying biological databases in natural language with LLMs. We tested multiple LLMs, both proprietary and open-source, on a knowledge graph representing a subset of the Open Targets database frequently used in drug discovery.
We found that the best balance between accuracy and flexibility in this context is achieved by multiple LLM agents that can challenge the outputs of each other and interact with a human user. This strategy is very flexible and does not require prior knowledge of the data structure, query templates, secondary databases, or adaptor language models, and it exceeds the query accuracy of the other techniques. In addition, we documented many other lessons learned, see below.
Phase 2 Challenge
Create a proper benchmark for the assessment of natural language data mining systems. This benchmark should cover all steps, from named entity recognition to query strategy.
Additional Lessons Learned in Phase 1
- In practice natural language data mining competes with the traditional way where structured queries are written by hand. Therefore, interpretation of natural language queries should not introduce any new errors in the data mining process. Which in turn means that it has to be 100% accurate on the very first attempt. This is a very high requirement to meet.
- The structure of the database used for queries matters:
- Providing automatically-generated graph schema doesn’t really help in practice.
- Traditional way of using Retrieval-Augmented Generation (RAG) with a vector database may introduce implicit restrictions on what data is explored by the LLM and what data is not.
- But one can think of a future adaptive database schema that would evolve with the influx of new knowledge and new queries from users.
- The form of the prompt matters. LLMs can easier produce meaningful answers from prompts that resemble a story, rather than a dry question, even if the details of the story are irrelevant to the main question asked.
- There are open-source libraries that help produce such story-like optimized prompts.
- The winning practical strategy for natural language data mining should include a combination of LLM agents for complex tasks like strategy planning, and deterministic API calls to the data resources for simple data retrieval tasks that do not require AI.
- If agentic architectures are used for the future LLM data mining systems, then agents must be made Findable and Reusable (FAIR) and thus require API standards, perhaps similar to the standards used in Service-Oriented Architectures today.
- It is extremely important to implement a reliable named entity recognition system.
- Besides hallucinations, LLMs may also engage in task avoidance behavior, where instead of writing database queries, the model attempts to generate the answer directly from the knowledge embedded in it in training.
- There is no good biological test-set for LLM evaluation. Background knowledge may contaminate the results.
- It follows from the above that it is extremely important to create a proper benchmark for the assessment of natural language data mining systems. This benchmark should cover all steps, from named entity recognition to query strategy. This will be the focus of the second stage of our project.
Publication
A paper draft is now in process of submission. A pre-print will be cited here when available.