FAIRifying Open Simulation Datasets to Further Mine Insights from Large Hadron Collider (LHC) Data with Practice

For scientists in observational disciplines, data is the lifeblood of research. Collecting, organizing and sharing data both within and across fields drives pivotal discoveries that benefit society and help make it more secure.

Making data open and available, however, is only part of the answer to the question of how different scientists – often with very different training – can draw useful conclusions from the same dataset. In order to promote and guide the cultivation and exchange of data, researchers have developed a set of principles that could make the data more findable, accessible, interoperable and reusable (FAIR) for both people and machines.

Although these FAIR principles were first published in 2016, researchers are still figuring out how they apply to particular datasets. In a new study, researchers from UC San Diego, the U.S. Department of Energy’s Argonne National Laboratory, MIT and the Universities of Minnesota and Illinois Urbana-Champaign have laid out a set of new practices to guide the curation of high energy physics datasets that make them more FAIR.

The research, published in Nature Scientific Data, demonstrates how to FAIRify an open simulation dataset, consisting of Higgs boson decays and quark and gluon background, produced by the CMS Collaboration at the CERN Large Hadron Collider (LHC).

“The dataset is extremely complex even for expert particle physicists, so a major question related to FAIRness we sought to address was how to convey the necessary information even to nonexperts,” said Javier Duarte, a co-author of the paper, CMS collaborator and assistant professor of physics at UC San Diego. “To really enable the reusability of the massive datasets that will be produced by the LHC and other experiments, we have to ensure any scientist can understand the data.”

The production of FAIR data and other digital objects has become a powerful notion throughout the research world, aimed at increasing successful data integration and allowing for seamless service provision across multiple resources and organizations. The San Diego Supercomputer Center’s (SDSC) Research Data Services (RDS) Division Director Christine Kirkpatrick and Chief Strategist Melissa Cragin serve as leaders in FAIR data efforts via the U.S. GO FAIR Office, led out of SDSC.

"FAIR focuses attention on the need to more closely align research data management practices towards machine actionable data, code, workflows, AI models and other digital objects,” explained Kirkpatrick.

To assist researchers from other domains and highlight the interplay between AI research and scientific visualization, the recent study also provided software tools to visualize and explore this FAIR dataset.

“The FAIR principles were created to serve as goals for data producers and publishers to improve data management and stewardship practices,” said Argonne computational scientist Eliu Huerta, an author of the study. “The community expects that adhering to these principles will enhance the capabilities of machines to automate the finding and use of data, thereby streamlining the reuse of data for humans.”

In addition to building FAIR datasets, the research team also sought to understand the FAIRness of AI models. “To have a FAIR AI model, we believe you need to have a FAIR dataset to train it on,” said Yifan Chen, the first author of the paper and a graduate student at the University of Illinois Urbana-Champaign and Argonne’s Data Science and Learning division. “Applying the FAIR principles to AI models will automate and streamline the design and use of those models for scientific discovery.”

“Our goal is to shed new light into the interplay of AI models and experimental data and help create a rigorous framework for the development of AI tools to address the biggest challenges in science,” Huerta added.

“For the first five years, the focus of FAIR was on data. The conversation and practices have now moved on to making all aspects of data and computationally intensive research FAIR including workflows, science gateways, software and AI models,” said Christine Kirkpatrick, head of the GO FAIR US Office and division director of Research Data Services at SDSC.

Ultimately, Huerta said, the goal of FAIRness is to create an agreed-upon set of best practices and methodologies, which will maximize the impact of AI and pave the way for the development of next-generation AI tools. “We’re looking at the entire discovery cycle, from data production and curation, design and deployment of smart and modern computing environments and scientific data infrastructures, and the combination of these to create AI frameworks that power disruptive advances in our understanding of scientific phenomena,” he said.

Previous
Previous

Authorship in the Open Data Revolution

Next
Next

Reproducibility in AI, and What Computing Professionals Should Know for Supporting Researchers