By Ilaria Maresi, The Hyve and Ian Harrow, the Pistoia Alliance


Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape that can power semantic models and knowledge graphs.


In our recent webinar, we explored how data generation for drug discovery begins with the identification of targets and finishes, often over a decade later, as a submission to a drug regulatory authority such as the FDA. Between discovery research, clinical development and submission, data has to move effectively across disciplines, research activities and laboratories. However, across these different stages of research, data is often identified inconsistently with inconsistent identifiers, various names for the same entity and is not linked, making it challenging to interpret and reuse effectively. This can be addressed by Knowledge Graphs which provide a means to link data and metadata both internally and to external sources following FAIR data guidelines for comprehensive integration, better interpretation and more reuse.


Knowledge graphs start with subject matter experts working with data engineers to capture understanding through concept models. These are combined with relevant ontologies, which define concepts and the relationships between them to create semantic models using, for example, the Resource Description Framework (RDF). They contain Uniform Resources Identifiers (URIs) for the data to provide a formal representation of meaning which can be read by machines for interpretation and analysis at scale. Knowledge graphs are a dynamic store for diverse sources of data and metadata, their concepts and relationships as nodes and edges in a graph and semantics for encoded or inferred meaning. Linked data and metadata are more likely to be Findable, Accessible, Interoperable and Reusable (FAIR).


Our webinar described a semantic model for data in clinical studies as an example which contains over 1,300 triples (triple = subject + predicate + object), 65 classes, 153 properties and 13 ontologies. The ontologies are selected by relevance to ensure interoperability as they include terms for harmonisation of data, annotation of metadata, links to controlled vocabularies and provenance for the data. A second semantic model, more focussed on the drug discovery process, was instantiated with data and metadata in order to create a knowledge graph containing ~14 million triples, which enables different views of an evolving model. Query of this knowledge graph will return results typically within seconds, although more complex queries could be slower. Pitfalls include the need to validate the model, which can be overcome using something like Shapes Constraint Language (SHACL). Unique and persistent identifiers (URIs) need to be maintained by an appropriate infrastructure service. Sometimes semantic models may be overengineering whereas common data models may be sufficient. Besides the advantages mention already, Knowledge graphs can be used in combination with machine learning algorithms for powerful analysis that can include semantics.


A useful list of materials to aid with the practical construction of knowledge graphs can be found at the end of the webinar and we plan to share “A Data Engineer’s Guide to Semantic Models” soon.


You can watch the webinar and download the slides here.


For further information about the Pistoia Alliance’s FAIR implementation for the life science industry project contact:

Further information about the webinar presentation contact: