Developing Data Capabilities for AI at ZIB

Interview with ZIB researcher Tim Conrad

Editorial Team: Mr Conrad, thank you for taking the time to talk to us. Can you briefly explain what the Data Lake at the Zuse Institute Berlin (ZIB) is and what role it plays?

Tim Conrad: My pleasure. The Data Lake at ZIB is a central data infrastructure – or repository – which can be used to store, manage, and share research data of (almost) any size. The interesting part is that all kinds of data can be stored, whether it is structured (think of tables) or unstructured (think of images or videos) or in-between (for example logging data). A big advantage of the lake is that you can keep track of various versions of your datasets – as you might know already from Git repositories. And you can even create multiple branches of the same dataset to enable different team members to work on the same data without changing it for the rest of the team. Overall, I believe that this allows researchers at ZIB and beyond to work with and exchange data more efficiently.

Editorial Team: How exactly can the Data Lake be used in research?

Tim Conrad: One key aspect is the support for data-intensive research projects. Our scientists can store very large datasets from simulations or experiments and share them with others – which might be more complicated if you used standard file system-based directories. However, scientists whose datasets are not so large can also benefit from the Data Lake, in particular when they want to use modern analysis methods such as artificial intelligence and machine learning which can be performed more efficiently using a data lake.

Editorial Team: Could you elaborate on how AI and machine learning benefit from the Data Lake?

Tim Conrad: Absolutely. Machine learning and AI models require high-quality data for effective training. The Data Lake’s integrated metadata and search functions help researchers quickly find and reuse relevant datasets without duplication. This is especially valuable in interdisciplinary projects, where different team members may need access to specific datasets while applying distinct methodological approaches.

Editorial Team: That sounds like an efficient way to manage and use research data. Can you give an example of a specific project at ZIB that is already benefiting from the Data Lake?

Tim Conrad: One such project is MaRDI, the Mathematical Research Data Initiative, where ZIB is one of the project partners. MaRDI focuses on the systematic collection, management, and use of mathematical research data. A key component of this initiative is the development of standards and technologies for data management in mathematics, facilitating the exchange of mathematical models and research findings. By leveraging the Data Lake, MaRDI provides a structured environment where researchers can store and access high-quality, reusable datasets, particularly when these are too large for standard data repositories.

Editorial Team: How do the Data Lake at ZIB and MaRDI complement each other?

Tim Conrad: MaRDI directly benefits from the Data Lake as it provides a reliable infrastructure for storing and managing mathematical data. At the same time, MaRDI contributes important insights into the develop-ment of specific solutions for managing complex mathematical datasets. The close integration of both systems ensures the sustainable use of mathematical research data and enables more efficient collaboration within the mathematical community.

Editorial Team: That’s really impressive. How do you see the future of the Data Lake at ZIB?

Tim Conrad: The future lies in expanding its capabilities. We’re working on integrating more advanced search and indexing functions to make data discovery even easier. Another exciting direction is enabling real-time data analysis within the lake, so researchers can process and analyze data directly without having to move large datasets. Finally, we want to enhance interoperability with other national and international research infrastructures to facilitate broader scientific collaboration.

Editorial Team: Thank you for these insights, Mr Conrad. It’s clear that the Data Lake at ZIB can indeed become an important resource in supporting data-driven research.

Tim Conrad: Thank you, it was a pleasure discussing this!