Date Submitted: 24-Jun-2025 (V2)
Authors: Nick Tomkinson, Gerd Blanke (updated by Christian Baber after PA Cheminformatics meeting)
Idea Originators (and Companies): Nick Tomkinson, AZ, Cambridge, UK; Gerd Blanke, StructurePendium Technologies GmbH, Essen, Germany
Strategic Priority: Delivering Data Driven Value
Problem Statement
Chemical exchange formats like CT Files, Smiles, or HELM are well established and described but documentation of the formats is incomplete and the development paths to accommodate innovation are not as clear as could be. Ambiguous interpretations can lead to inconsistent handling of chemical structures and reactions in the life-sciences industry resulting in unnecessary work and costs. This is a particular issue where software from multiple providers is used and where compounds are imported from external suppliers into internal systems.
This has impact in compound management, when migrating data between electronic lab notebooks & registration systems and when preparing data for data analysis including ML/AI. The increasing diversity & complexity of compounds of interest requires the evolution & development of exchange file formats and these changes need to be well documented, not open to interpretation and fit for purpose for all producers and consumers.
To ensure a common understanding of these exchange formats between the (competing) software vendors and consumers in the life-sciences industry there is a need for a common ground to exchange any identified discrepancies and to discuss the needs of further format enhancements within this broader community.
Idea Proposal and Value Proposition
The chemical exchange format committee (CEF) is open to software vendors and industry software users whose work relies on standard chemical exchange formats. To avoid
ambiguous interpretations of formats and ensure a commonly shared understanding of format enhancements the committee meets once or twice a year to discuss format improvements and resolve any interpretation discrepancies. A repository to capture discussion and a process for ratification are key deliverables.
Because the implementation of the exchange formats into products is done by competing companies Pistoia Alliance provides a well-established ground for this necessary exchange.
Targeted Outputs
Initially the committee would act as a discussion forum with documented minutes of use to both standard owners and the community as a whole. Formal outputs of the committee could include:
Potential Initial Outputs of Committee
· Curated documentation of standards
· Scope and SWOT analysis of formats
· Identification/documentation of ambiguities in standards
· Identification/documentation of shortcomings in standards (capabilities)
· Identification/documentation of shortcomings/idiosyncrasies in implementations
· Prioritization of requested enhancements
Depending on the availability of funding and focus of the steering committee the group could be extended to also include:
Possible Future Activities
· Curation/creation of validation sets
· ‘Certification’ of software implementations
· Reference implementation creation/curation
· Standard libraries creation/curation
· Open-source format converter
Estimated Costs
It is estimated that initial costs for the committee would be ~$60,000 in total for the first year but this could be tightened to a minimum of ~$40,000 while retaining some value or be as much as ~$90,000 if a face-to-face meeting or workshop was also held. This pricing is primarily down to committee facilitation (project management).
An initial pricing of $10,000 for each large company and $7,500 for other companies would allow us to build the committee as long as we had ~6 or more funding members.
Example Use Case(s) and Benefit(s)
· Public documentation of strengths and weaknesses would allow companies to make informed decisions on what formats to use both internally and for exchange of data reducing the need for data conversion, customization or refactoring
· Providing prioritized feedback to standard owners (and implementers) should improve the coverage and quality of both standards and implementations while ensuring that developers are focusing on the needs of the industry
The activities listed for the future would have significant benefits with an associated increase in effort required.
Critical Success Factors
· Engagement by both Pharma and Software Developers
· Engagement and acceptance by format owners, willing to be involved in discussions, contribute documentation and test/validation sets and act on feedback provided
Why This Is a Good Idea / Why Now
The outcome of the committee work are improvements to the structure and reaction-related data exchange within R&D of the life-sciences industry providing better quality data for consumers and makes the work of researchers and data analysts more efficient.
Within industry the chemical structure and reaction life cycles become more robust and less maintenance intensive. Participating software providers are better able to improve their software and have a forum to discuss format issues and to validate enhancement ideas.
Time-savings in quality-assurance for compound and reaction registration.
Improved accuracy of representation of compounds of interest, including small molecules and other modalities.
Other Relevant Information
Validation by consumers/producers of exchange process would benefit from a suite of test cases. Round-trip interconversion validation via public services would also be of value but would imply some independent gold-standard exact match comparison, which may represent a deliverable in itself