“There must be some reward for sharing research data”
Publicly accessible research data (Open Research Data, or ORD) should make academic findings more transparent and easier to reproduce. Also, further use can be made of the data to study new questions or to train AI. Experts discussed where Switzerland stands in terms of ORD at an SCNAT symposium.
Research data ought to be publicly accessible. This is what various initiatives are calling for in the name of Open Research Data (ORD). swissuniversities, the ETH Domain, the Swiss National Science Foundation (SNSF) and the Swiss Academies of Arts and Sciences were mandated by the Federal government to develop an ORD Strategy and an Action Plan; the Strategy was published in 2021, followed by the Action Plan in 2022. Experiences gained since then, and the nature of the challenges, were discussed by around one hundred participants at the SCNAT symposium “Open Research Data – are we on track?” on 31 May 2024 in Bern.
Gilles Dubochet, Head of the Open Science Initiative at the Swiss Federal Institute of Technology in Lausanne (EPFL) and Head of the ORD Coordination Group, explained that the ORD Strategy has identified two fundamental areas where action is required. First, a practice must be established among researchers whereby they share their research data with their colleagues and use ORD for their own work. Second, an appropriate technical and organisational infrastructure is needed. Investments in data infrastructure should be placed on the same level as those in traditional research infrastructures. “It's one of the key achievements of the ORD Strategy that this is now on the political agenda in Switzerland,” Dubochet points out.
Switzerland is only at the beginning
“Freely available data is important so that research is transparent and reproducible,” said Evie Vergauwe of the Swiss Reproducibility Network, who is a member of the jury for the ORD Prize of the Swiss Academies of Arts and Sciences. “If research data is public, the original analyses can be reviewed, other hypotheses can be tested, or the data can be integrated into meta-studies,” Vergauwe commented.
Using an analysis of SNSF projects, she showed that sharing of scientific data is actually on the increase in Switzerland, but is still at a fairly early stage. According to this analysis, at least one dataset was published for 20 to 33 percent of projects completed in 2022, depending on the discipline. Surveys have revealed that researchers regard ORD as particularly important for the exchange of knowledge in research and the reproducibility of results. Over 50 per cent of respondents stated that they had already published research data. They see shortage of time as the main impediment to sharing data, but they also view the absence of rights as an obstacle. They believe that the effort involved in preparing, documenting and archiving should be recompensed. But this, they say, is not yet happening today.
Communities instead of individual researchers
“ORD is fundamentally changing the way we do science,” said Jérôme Kasparian, Professor of Applied Physics at the University of Geneva. Contributions by individual researchers are visible at present because they are cited by name. This is becoming less and less the case because of ORD. When it comes to training artificial intelligence with large volumes of data – if not before – individual contributions will become less important. “We are facing a paradigm shift,” Kasparian notes.
This is why Christophe Dessimoz, Executive Director of the Swiss Institute of Bioinformatics and Head of Elixir Switzerland, believes it is important that recognition should go not only to those researchers who make and publish a scientific discovery, but also to those who supply the data for it. “Making data publicly available should no longer be a by-product but a recognised part of the research process,” he says. It may well be expedient for scientific progress if someone gives up control over their data. On the other hand, of course, it is necessary to ensure that competitors do not simply help themselves to the life's work of others and take the credit for it.
Dubochet adds that in many disciplines, science has long since ceased to function according to traditional ideas, and the contributions of individuals are absorbed into the collective. One good example of this is CERN in Geneva. The particle physicists there work as a research community. As well as sharing the costly infrastructure, they also work closely together in other ways and even develop a joint research agenda. “It is the community that produces scientific output and sees itself as the joint owner of that output,” Dubochet comments. In any case, many of today's scientific questions can only be addressed in large aggregates such as these. ORD could drive this trend forward.
Not all data should be public
Nevertheless, there are good reasons for not publishing all data, Vergauwe believes. Brain scans, for example, would raise issues of privacy and data protection. Furthermore, Kasparian adds, not all data can easily be put to further use. Certain data can only ever be understood in the context where it was collected. “It is worthless without this meta-information, which often adds up to many times more than the actual data.” Reproducibility is also impossible in many such cases.
In Dessimoz' opinion, if research data (which is mostly very specific) is to be widely usable to any extent, and for it to become ORD, it must first be processed as well. He calls this “democratisation”. As well as context-dependent meta-information, there is a need for uniform structuring and general data standards. This is all the more important when AI is used. “AI is only as good as the data used to train it,” Dessimoz comments.
Commercial interests
Data is regarded as the new oil – and among tech companies, it is triggering the same kind of desires as oil. What does this mean for science and academia? In the research world, Kasparian says, it is normal that people find it hard to control what happens to the knowledge that is produced. “I think it's basically okay if someone else earns money from it.” But it then becomes problematic if researchers go to major additional effort and expense to make their data available. Dessimoz cites the Human Genome Project as an example of how companies have utilised publicly accessible data for commercial purposes. On the one hand, he says, this has led to valuable innovations – but on the other, it could have become problematic if patents had restricted the further scientific use of the gene sequences. Financial compensation from the companies also continues to be a sensitive issue. “At present, nobody is prepared to pay for data in advance without knowing whether they can turn a profit from it,” he says. But it is important to have a debate about whether and how researchers should be compensated for providing their data.
The discussions about ORD offer a good opportunity to reflect on the relationship between science as a publicly funded data producer and owner on the one hand, and private companies that do business with this data on the other, Dubochet believes. “ORD is often misunderstood as meaning simply that everything is available to everyone, freely and without control.” This is not the case. “But science needs to get a clear idea of what it wants to achieve here.”
An ORD culture is needed
Dessimoz considers that scientists and academics also have a responsibility for establishing an ORD culture and bringing it alive. The way they collect their data should be as standardised as possible; they should develop their own data collections and collaborate with existing data infrastructures. “Training, best practices and role models that colleagues can use for guidance are also important here,” Vergauwe adds. “And we must be aware that ORD means additional effort and expense for researchers,” she says. That's why appropriate incentives are required as well. Dessimoz also shares this view: sharing data must have a positive career impact, in the same way as publishing a paper.
The four experts take a critical view of handing over data preparation and management to external data stewards. “Data-intensive research is largely a reality nowadays,” Dubochet says, “and that's why it should be taken for granted that adept data handling is part of academic research.”
Investments in infrastructure
Alongside the researchers' initiative, building up an appropriate data infrastructure is another important pillar. “Switzerland is increasing its investments for this purpose, but we haven't yet reached the level that's necessary,” Dubochet notes. Furthermore, according to Dessimoz, the existing data platforms ought to press ahead with standardising their data so it can be used more widely. Closer collaboration with researchers is also needed to achieve this.
Finally, research institutions and funders should stipulate the publication of standardised data for projects, and they ought to reward the efforts involved. But, Dubochet adds, it is unrealistic to expect politicians to award separate funding for ORD. “ORD has to be funded within research projects.” He goes on to say that it is important for research institutions to grant academics more freedom so they can explore what is possible in the field of ORD with the help of new tools or AI.
First flagship projects
Adriano Rutz of ETH Zurich showed the symposium participants what this can lead to, based on the example of the LOTUS Initiative which he helped to develop. This freely accessible data platform links molecular structures with the biological organisms in which they are found. The platform, which contains around 750,000 referenced structural organism pairs, offers new ways of sharing and expanding knowledge in natural products research. This project's pioneering nature earned it the ORD Prize of the Swiss Academies of Arts and Sciences in 2023.