There’s no doubt that Big Data is one of those concepts that is completely transforming the way we do research but what other capability does a facility like the Pawsey Supercomputing Centre have that can help marine science?
In June, the Pawsey Supercomputing Centre is offering a range of topics. WAMSI researchers, partners and friends are encouraged to consider taking advantage of the courses on offer or contact WAMSI Data Manager Luke Edwards to find out more about exploring the potential of the Pawsey supercomputing capacity.
On 8-9 June, the sessions are aimed at those new to supercomputing (8th), and those wanting to understand more about using the systems (9th).
The largest of these systems is Magnus. It has over 35,000 cores, delivering in excess of 1PetaFLOP of computing power. It is the most powerful public research supercomputer in the Southern Hemisphere and debuted at #41 in the Top500 list. Marine researchers can apply for time on Magnus (details here). Magnus and other supercomputers are designed for highly parallel distributed programs.
Marine researchers who don’t have problems or software programs that can make use of Magnus should also be aware of the NeCTAR research cloud. It enables researchers to create VMs (virtual machines), similar to Amazon Web Services, and deploy research tools and software without having to run their own physical servers. This dramatically reduces the overhead for researchers to run research applications while allowing them to scale up or down the amount of processing power required.
One of the important parts of the WAMSI program is to ensure the data collected through research is protected in the longer term and held in a place where it can be accessed both for management and planning purposes.
WAMSI Data Manager Luke Edwards, is based at the Pawsey Supercomputing Centre. Its huge data storage capacity is available not just for radio astronomers linked to the Square Kilometre Array, but all researchers, including marine researchers.
With greater than 40 Petabytes of data storage available, researchers are encouraged to apply for storage if they have more than 5Tbs of data. The aim of the storage is to facilitate sharing and collaboration with research partners. To discover more about how to apply for storage visit here.
The Pawsey Supercomputing Centre Visualisation team can also provide a package of hardware, software and expertise that can assist marine researchers.
Applying visualisation techniques to difficult datasets can require specialist hardware, which can include high end graphics cards for handling large datasets in real-time, novel display technologies to fully exploit the human visual system, and user interface devices to facilitate the interaction. To find out more visit here.
“The big question raised by researchers is how do I best document my data so in ten years’ time (WAMSI 4), I can use all this great data from the Kimberley and the Dredging Node,” Mr Edwards said.
“It’s the simple stuff that is the key to good data management, like making sure you have good conventions for file names and complete metadata,” he said.
WAMSI has a requirement that all data is made publicly available and therefore it’s important that its researchers make it discoverable through creating metadata.
“The big mistake researchers can make is to start off with a little bit of data not thinking they’ll need formal data management,” Mr Edwards said. “They might think they can handle it and back it up on a portable hard drive. But as the project continues and you incrementally collect more data, problems get bigger and bigger and then at the end of the project there’s a massive problem and it’s really inefficient to go back through that data. Managing it correctly from the start is much more efficient.”
There are four main reasons to implement good data management:
Data security is important. Backing up data, which is a major asset for any project, is an important risk management strategy. Managing risk in relation to sensitive data, privacy issues, Intellectual Property and private industry data is also important.
Transparency is about protecting yourself. Some climate scientists, for example, have had to deal with people suggesting that they’re making up their work, that it’s a big conspiracy. Having all your data readily accessible and discoverable ensures transparency and defends you again unfounded accusations.
Studies now show that researchers who make their data open get more exposure and get more citations.
It’s becoming more of a mandate by funders like WAMSI and peer reviewed journals are requiring data to be made publicly accessible.
Data management at WAMSI
In terms of the workflow, when the WAMSI project is finished or almost complete, that data must be deposited either in the Pawsey Data Portal or the CSIRO or AIMS data centre. Once that’s done and the metadata is finalised there’s an 18 month embargo period. So once data has been deposited it gives the researcher time to write up papers for publishing before that data is then made public.
|WAMSI data management workflow|
“So it’s all about how do I discover this data and how do I access this data,” Mr Edwards said. “The idea is to create metadata, which feed into the national infrastructure so you should eventually be able to do a Google search for the ‘Kimberly WAMSI data’ and then Google should come up with the website where the data can be accessed.
|Pawsey data portal page for WA Node Ocean Data Network|
How is it being used?
From the raw data there is good functionality through the Pawsey data portal that researchers can use. Some researchers are already using it for collaboration with restricted access.
Big data analytics enable us to find new cures and better understand and predict the spread of diseases. Police forces use big data to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions. A number of cities are even using big data analytics with the aim of turning themselves into Smart Cities, where a bus would know to wait for a delayed train and where traffic signals predict traffic volumes and operate to minimize jams.
Why is it so important?
The biggest reason big data is important to everyone is that it’s a trend that’s only going to grow.
As the tools to collect and analyze the data become less and less expensive and more and more accessible, we will develop more and more uses for it.
And, if you live in the modern world, it’s not something you can escape.
For researchers, being FAIR (Findable, Accessible, Interoperable and Reusable) will ensure it doesn’t go to waste.
This article is based on a presentation given by Luke Edwards at the 2015 WAMSI Research Conference