Benchmarks for Natural Language Data Mining with LLMs

Problem Statement:

The use of Large Language Models (LLMs), such as GPT-4, presents a transformative opportunity for pharmaceutical R&D, particularly in deep data mining that requires the synthesis of large, complex datasets and the integration of proprietary research within the broader context of public information. The industry is actively experimenting with Natural Language into Query Language translation (NL2QL) or Scientific Chat applications. Most recently the Pistoia Alliance completed an investigation into the best strategies to use LLMs for data mining in a natural language (NL). The results of our work are published online and a formal peer-reviewed publication is pending. One of the key discoveries made in our study is the lack of appropriate benchmarks for the assessment of all steps in the NL data mining process. The lack of the appropriate test sets complicates the tool development in the NL data mining. The proposed project aims to close this gap.

In the foregoing we will use the terms “NL data mining” and “scientific chat” interchangeably.

Idea Proposal and Value Proposition

We propose the assessment or, if needed, development, of a series of benchmarks that would cover all four steps in the NL data mining process:

Understanding the question
Recognition of named entities, synonyms, and disambiguation of terms
Building the structured query
Assessment of the overall answer quality

We believe that our proposal, if completed, will allow pharmaceutical and biotechnology research organizations to make better tool use decisions based on the objective evaluation of technologies, and technology vendors to be able to plan product improvements. Regulatory agencies may also benefit. The overall quality and speed of drug discovery R&D will be improved.

Drive the Future of Pharma AI

Join us in developing industry-wide benchmarks that will accelerate drug discovery and improve AI tools. Contact project manager Vladimir Makarov at the Pistoia Alliance to contribute your expertise.

Get in Touch

Targeted Outputs

The main deliverable is a quality assessment framework for Scientific Chat:

Review of the already existing or proposed benchmarks for the four steps in the NL data mining process. Although we reasonably believe that this is a scientific gap, learning from earlier attempts should be instructive
White paper describing the problem space and the proposed solution
A set of scientific benchmarks for each of the four listed steps in the NL data mining. Each of these should contain suitable test sets, statistical evaluation criteria, and recommendations for updates
A plan for long-term maintenance and evolution of the proposed benchmarks

The Pistoia Alliance will serve as a neutral party in the organizing benchmark development and maintenance.

Critical Success Factors

Involvement of multiple pharmaceutical and technology companies
Quick action, as one may think about competition from academics in the same space

Why This Is a Good Idea / Why Now

Use of AI in general and LLMs in particular is now common in the pharmaceutical industry. As of today, AI solutions are widely used in life sciences research. Over 75 AI-discovered molecules entered clinical trials since 2015, with CAGR of over 60%, and success rate in the Phase I trials of 80-90%, significantly higher than the historical average of 40-65%. (Reference: www.drugdiscoverytrends.com/six-signs-ai-driven-drug-discovery-trends-pharma-industry). As a result of these early successes, investments in AI in biotech and pharma are increasing. 62% of respondents plan to invest in AI in the next two years, based on the Pistoia Alliance survey (200 expert opinions across Europe, the Americas, and APAC, reference: www.drugtargetreview.com/news/153454/the-pistoia-alliance-key-findings-on-ai/).
Lack of standardized verification techniques for AI quality limits the adoption of AI technologies. Leaderboards like Hugging Face focus on the tasks that are not specific enough for our use cases. In the first phase of our LLM project we attempted both recovery of potentially useful test sets from academic publications and creation of our own (much smaller) test set. We observed, in particular, that
- Existing test sets are small and niche
- Existing test sets are saturated
- Background knowledge contaminates the results, and allows top LLMs to avoid the task of structured query generation and instead hallucinate the results
Benchmark development is a nearly ideal pre-competitive work scenario, since it benefits the industry as a whole, allowing all involved parties to save on the efforts that otherwise would have to be duplicated, and not affording unfair advantages to any of the participants.

Authors: Vladimir Makarov, Pistoia Alliance

Date Submitted: May 11, 2025

Idea Originators (and Companies): Lars Greiffenberg, Abbvie

Other supporting Individuals/Companies: AstraZeneca

Identified Funders: Abbvie

Strategic Priority: AI

View All New Ideas

Projects

Communities