William Spooner Chief Science Officer of Eagle Genomics
William Spooner – Chief Science Officer of Eagle Genomics

Being MELGEN early stage researcher (ESR) is giving us the opportunity of tasting the industrial part of science. One of MELGEN’s industrial partners is Eagle Genomics, which is a bioinformatic, computational and systems biology company that provides rare expertise of handling “big data” for clients. Sathya Muralidhar (ESR06) and I (ESR05) were part of Eagle Genomics, in Cambridge for one month as our secondment placement. During that stay, we have gained a lot of new ways of thinking and perspectives in terms of looking at melanoma data. Why is it important for us? This kind of experience helps us to be aware that melanoma research is a very broad term and it touches also the industrial aspects.

In order to have a clear view of the work of Eagle Genomics and its contribution to melanoma research, I asked several questions to the chief science officer of the company and bioinformatician- William Spooner.

How does Eagle Genomics tackle big data?

This is a very big question. As a modern informatics company, big data has become central to our business. It is something we have come to understand very well over the past 8 years. When managing big data, we take into account the three V’s:

– To cope with its “volume” we avoid moving it around as much as possible; big data has become an anchor and it is often easier to move the compute to the data than to move the data to the computer.

– To cope with its “velocity” we rely on scalable and elastic cloud infrastructure and our portable Eaglehive data workflows.

– “Variety” is a huge issue as we tackle integrated, multi-omics studies (genomics, transcriptomics, proteomics). Data can only be integrated if it is accurately described. Descriptions are often inadequate in the source information and have to be improved via manual biocuration. Automating curation is a strategic research priority for Eagle.

There are a couple of other big data V’s that we take into account:

– “Veracity”, which is a combination of accuracy, understandability, and provenance. Veracity can be improved by careful quality assessment and yet more biocuration.

– Finally the data “value” that we define as its usefulness and relevance to the research question(s) in hand. These questions may be different to the purpose for which the data were originally collected. We have developed a unique statistical method for the measurement of the scientific value of data which turns out to be very useful in the management of big data. There is an old adage, “you can’t manage what you can’t measure” and this is just as true for big data as elsewhere.

Does Eagle Genomics have any specific approach to handling “big data”?

When faced with a “big data” problem we always take a question-driven approach; what are the business/scientific questions our customers trying to answer? The questions lead to a data valuation model; what are the relevant entities (genes, diseases, associations etc.) and what makes one instance of an entity more valuable than another? The valuation model informs the data model; what attributes are required to feed the valuation model? Armed with this understanding we can catalogue the relevant datasets, curate the required metadata, build the data quality control (QC) and processing pipelines and finally perform the informatics analysis steps. If this works properly the information we hand over to the biologists for interpretation should be as good (fit for purpose) as it can be.

How could this improve the quality of the melanoma research?

Scientific data collection and annotation is generally adequate for its primary purpose. It is therefore in data reuse and cross-functional integration that our approaches outlined above really come to the fore. The appreciation of such reuse is increasing as the sophistication of systems approaches to cancer research mature. In melanoma research, there are already some great data resources; TCGA, ICGC, COSMIC and others. We have used these datasets in combination with our approaches to reveal genetic (gene haplotype) associations with skin cancer prognosis. Such analyses would have been far too expensive and time-consuming for us to contemplate if using traditional approaches.

Does your company have any influence in data collection as well?

Our methods also have an impact on experimental design. When designing a study, we advise clients to consider as any other potential uses for the data. This allows them to ensure that the datasets they collect have increased scientific value; the value that can be statistically measured. In many cases, this is as simple as annotating data using established standards and ontologies. The case for making data FAIR (findable, accessible, interoperable, reusable) is as strong in the melanoma community as elsewhere in biomedical research.

Interview conducted by Joanna Pozniak (ESR05)

References:

Move the computation to the data:
http://www.nature.com/nature/journal/v498/n7453/full/498255a.html

Eaglehive data workflow engine:
https://www.eaglegenomics.com/efficient-data-processing-with-hpc-and-cloud-back-to-the-blackboard/

Data veracity:
http://www.datasciencecentral.com/profiles/blogs/data-veracity

Data valuation
http://www.techrepublic.com/article/data-curation-takes-the-value-of-big-data-to-a-new-level/

TCGA:
http://cancergenome.nih.gov

ICGC:
http://icgc.org

COSMIC:
http://cancer.sanger.ac.uk/cosmic

Eagle Genomics
https://www.eaglegenomics.com/