If Data is the currency of AI, are we doing enough around data quality?

AI and machine learning (AI/ML) are certainly at peak hype at the moment around the current and future impact on Life Sciences and Healthcare. With a wealth of companies, large or small, offering new ways to interrogate existing data and build decision workflows supporting Life Science.

But is there something lacking in all the hype around AI/ML?

Maybe the biggest hindrance to using AI/ML effectively is both the volume and quality of data that exists. The models that are being built will only be as good as the data that has been used, and we know the saying ‘Rubbish in, Rubbish out’ (or variants of this ;-))

As an industry are we being overly optimistic and simplistic about the quality of data that we are feeding our smart new AI/ML pipelines? Can we assess the quality of the insights that these AI/ML tools are producing and are these tools giving any better decisions than would be obtained through other more traditional methods or human endeavours?

If we take early drug discovery and the experiments that are run today, although great strides have been made in digitising the data that is captured including the spread of platforms such as Electronic Lab Notebooks (ELNs) in the industry.

However having an experiment in an ELN does not necessarily mean that the data captured is any better quality than in previous tools or approaches. If anything there are times when ELNs have simply been buckets for records with little enhanced meta data to support future exploitation and integration and where this meta data is critical for AI/ML to be able to make correct insights.

We propose that increased efforts need to be placed on making these data, captured during our core experiments, have enhanced metadata and with key linkages to existing data management best practice including ontologies and other semantic & Linked Data principles.

The growth of FAIR data principles and Data Commons have come about for many reasons but there is strong synergy between these and our view that data needs to be much better described to be effective for AI/ML purposes.

As with all things to do with data, its rarely one step that makes the difference it’s more the combination of a number of smaller steps that can make a positive step forward.

We want to encourage the debate about the quality of our data that we use and how we can go about making this a key next stage in the full usage of AI/ML in Life Science and Health.

So some thoughts going forward:

  • What areas would provide quick wins for data quality improvement?
  • How might we achieve these goals?
  • How can we measure the quality of the data that we generate across the drug discovery workflow?

Welcome your comments…

More information on CoE for AI/ML in Life Sciences and FAIR plans

References

  1. Augmented Intelligence: AI and Life Science 
  2. FAIR principles

Posted in Pistoia Alliance Blog and tagged , , .