Pop quiz! What do the following items have in common?
- 2,120 Playstation 4 consoles
- 33,125 iPhones (32GB)
- 21 million mp3 music files
And the answer is – 1060000000000000 bytes of data. That’s 1.06 petabytes-which is an awful lot of information!
Interestingly, this also happens to be the amount of data that is stored in the International Cancer Genome Consortium (ICGC)– a data bank that contains large-scale, publicly available genomic data from 50 different cancer types, from across the globe. The ICGC is one of the largest banks of cancer data, however it is certainly not the only one. There are many other data banks around the world, which means that 1.06 petabytes is actually only a part of all the genomic data available on cancers!
So what exactly is ‘cancer data’ and what is it used for?
Cancer researchers working on different types of cancer gather vast amounts of information, including genomic data, information from donor patients and samples they have generously donated. These data are valuable resources and are stored and handled with utmost security and caution, according to the rules laid out in the respective organisation/country’s Data Protection laws and Human Tissue laws.
Combined together, this information helps researchers answer a wide range of questions such as- why do some donor patients respond to therapy while others don’t? Can we predict the progression of a specific cancer type? What effects do certain lifestyle factors have towards developing certain cancers? These are questions that have been puzzling cancer scientists in the past and are very important for us to answer for patients. However, advancements in Bioinformatics and Systems Biology help scientists to use cancer data as a resource to better answer these questions.
How is cancer data generated and why do we have so much of it?
Cancer data is generated from DNA (or RNA) of patient samples such as blood. For example, the DNA is then sequenced and stored as data, much like music is stored in a CD! Depending on the technology used, a single sample can produce kilobytes up to gigabytes of data. ‘Cancer genome studies’ are those that produce such data for groups of donor patients. So, combining many cancer genome studies will add up to vast amounts of data which can be accessed via data banks such as the ICGC.
Why is it important to ‘maintain’ this vast amount of data?
The 1.06 petabytes of data from ICGC are contributions from research groups all over the world, who have their own unique way of working with data. When all of their data come together in a single data bank, it is very important that the information is easily understandable and reusable. For example, a melanoma research group (Group A) have performed an experiment and have filed their sample details under the label ‘Sample Name’ and have uploaded the corresponding data on ICGC. Another melanoma research group (Group B) are working on something similar, but have filed their sample details under the label ‘Sample ID’. Both groups are perfectly valid in their labelling and the difference might appear trivial. However, to a third research group (Group C) who are interested in both of the above groups’ data, the labelling might not be immediately apparent. When such ‘mismatches’ happen for many other labels, both groups’ data become difficult to reuse. Which is such a shame given the resources that go into producing data! Thus, data maintenance bodies provide guidelines that can be a ‘template’ for scientists who submit their data to a data bank. This way, valuable data can be used to its full capacity to help scientists understand and eventually treat cancer more efficiently.
How important is data management in melanoma research and how is it relevant to MELGEN?
In recent years, melanoma researchers have been producing genomic data to help understand important facts about melanoma that could be relevant to patients. Some members of MELGEN work with such data, which makes us interested in data management practices. In trying to better understand how this works Joanna Pozniak and I- two research students from the University of Leeds spent 4 weeks at Eagle Genomics- one of the MELGEN industrial collaborators. Eagle Genomics offers many data management services, one such service is Biocuration- the process of making different types of data easily understandable and reusable by ‘fitting’ them to a common template. Guided by our experience working with Biocurators at Eagle Genomics, we plan to apply the data management principles to our research projects, hoping to take one (tiny) step towards complementing melanoma research!
For a technical description of data management in melanoma research please refer to the interview with Dr.Will Spooner, Chief Science Officer at Eagle Genomics-
References and Links
- International Cancer Genome Consortium data portal- https://dcc.icgc.org/repositories
- Eagle Genomics homepage- https://www.eaglegenomics.com/
Glossary of Terms
- ICGG homepage– http://icgc.org/
- Genomic data – data generated from the genome i.e the entire DNA of an organism. For example, ‘melanoma genomic data’ refers to data generated from a donor melanoma sample.
- Data Protection laws and Human Tissue laws– These laws require that data stored about people should be kept safe and confidentially and used only for the defined purposes. That the data must be destroyed securely when no longer needed, and that the data should be kept updated and maintained in such a way as it can be used optimally
- Bioinformatics– The science of processing and storing biological data
- Systems Biology– the science of studying an organism as a whole system, instead of its individual parts
- For ‘what is DNA?’ refer to- https://en.wikipedia.org/wiki/DNA
- For ‘what is RNA?’ refer to- https://en.wikipedia.org/wiki/RNA
- Sequencing– The process of ‘reading’ the DNA code.