Publications

EXA4MIND

EXA4MIND

Publications

EXA4MIND relies on a co-design approach, where technology partners from computing centres and universities and application partners from industry, academia and SMEs design an Extreme Data infrastructure in close collaboration.

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Scientific Publication

27

Authors: Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, Josef Sivic.

Publication date: 2025.

Published at the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment.

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Scientific Publication

27

Authors: Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic.

Publication date: 2024.

Published at the (CVPR) The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024.

Authors address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, their generated images preserve the environment and transform objects in the initial image.

OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks.

Scientific Publication

26

Authors: Sirko-Galouchenko, S. et al.

Publication date: 2024.

Published at the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird’s-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in lowdata scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.

Halton Scheduler For Masked Generative Image Transformer

Scientific Publication

25

Authors: Besnier, V., Chen, M., Hurych, D., Valle, E., Cord, M.

Publication date: 2025.

Published at ICLR 2025.

Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT’s token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token’s position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images.

Database-aided Workflows on HPC for Plasma Physics Simulation

Scientific Publication

24

Authors: Pauw, V., Ruhl, H., Martinovič, J., Hachinger, S.

Publication date: 2025.

Published at the International Symposium on Grids & Clouds (ISGC 2025).

The on-going work presented at ISGC25 Physics and Engineering track and in this article explores different technical approaches for management and analysis of data obtained from large physics simulations, with the goal of optimizing data-driven workflows across Cloud-Computing (IaaS) and HPC systems. The work is connected to research activities on database usage in the context of High Performance Computing conducted through the EXA4MIND Horizon Europe project. The case presented here, is a large-scale parameter study of plasma physics simulations carried out on supercomputing systems at LRZ (Garching b.M., DE).

Terazi: AI Fairness Tool for Doubly Imbalanced Data

Scientific Publication

23

Authors: Ozturk, A. U., Sever, Y., Yalcin, A., Pauw, V., Hachinger, S., Toroslu, I. H., Karagoz, P.

Publication date: 2025.

Published at the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25).

The field of Artificial Intelligence (AI) fairness focuses on developing unbiased approaches for machine learning problems, with many contributions and ready-to-use tools. However, existing solutions fall short when both sensitive attributes and target labels have imbalanced representations in a given dataset. Our proposed algorithm and its tool, terazi, aim to propose a fair AI solution for this doubly imbalanced case. The proposed solution is based on finding the optimal distribution within the imbalanced data to balance fairness and classification performance, and the tool facilitates using this solution. In this demonstration, we showcase the capabilities of our algorithm, and the easy-to-use GUI of our web application for data scientists, researchers, and AI practitioners.

Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLMs

Scientific Publication

22

Authors: Ozturk, A. U., Cekinel, R. F., Karagoz, P.

Publication date: 2025.

Published at the 18th Workshop on Building and Using Comparable Corpora (BUCC).

Satire detection is essential for accurately extracting opinions from textual data and combating misinformation online. However, the lack of diverse corpora for satire leads to the problem of stylistic bias which impacts the models’ detection performances. This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models. The approach is evaluated in both cross-domain (irony detection) and cross-lingual (English) settings. Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English. However, its impact on causal language models, such as Llama-3.1, is limited. Additionally, this work curates and presents the Turkish Satirical News Dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.

Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies

Scientific Publication

21

Authors: Cekinel, R. F., Karagoz, P., Çöltekin, C.

Publication date: 2025.

Published at the 31st International Conference on Computational Linguistics.

This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance compared to text-only models and how well VLMs utilize text and image information to enhance misinformation detection. Furthermore we propose a probing classifier based solution using VLMs. Our approach extracts embeddings from the last hidden layer of selected VLMs and inputs them into a neural probing classifier for multi-class veracity classification. Through a series of experiments on two fact-checking datasets, we demonstrate that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings. Furthermore, the proposed neural classifier significantly outperformed KNN and SVM baselines in leveraging extracted embeddings, highlighting its effectiveness for multimodal fact-checking.

Explaining Veracity Predictions with Evidence Summarization: A Multi-Task Model Approach

Scientific Publication

20

Authors: Cekinel, R. F., Karagoz, P.

Publication date: 2024.

Published at the 2024 IEEE International Conference on Big Data (BigData).

The rapid dissemination of misinformation through social media increased the importance of automated fact-checking. Furthermore, studies on what deep neural models pay attention to when making predictions have increased in recent years. While significant progress has been made in this field, it has not yet reached a level of reasoning comparable to human reasoning. To address these gaps, we propose a multi-task explainable neural model for misinformation detection. More specifically, this work formulates an explanation generation process of the model’s veracity prediction as a text summarization problem. Additionally, the performance of the proposed model is discussed on publicly available datasets and the findings are evaluated with related studies.

Toward Accurate RNA Folding Thermodynamics: Evaluation of Enhanced Sampling Methods for Force Field Benchmarking

Scientific Publication

19

Authors: Petra Kührová, Vojtěch Mlýnský, Michal Otyepka, Jiří Šponer, and Pavel Banáš.

Publication date: 2026.

Published by JCTC ACS journal.

Biologically functional RNAs operate near marginal stability, and their rugged free-energy landscapes and profound structural dynamics – typically not captured by structural biology experiments – play decisive roles. Atomistic molecular dynamics (MD) simulations provide a unique means to characterize these features. However, the applicability of atomistic MD is currently limited by accessible simulation timescales and, most importantly, by force-field (FF) accuracy. Folding free energies (ΔG°fold) of small RNA motifs represent well-defined targets for quantitative benchmarking of RNA FFs. In practice, however, obtaining thermodynamic estimates that are sufficiently robust for direct comparison with experimental data remains highly challenging, even for small RNA systems, and many published studies rely on sampling that is not fully converged. Here, we systematically assess the performance of widely used advanced enhanced-sampling techniques using the 8-mer r(gcGAGAgc) tetraloop as a representative benchmark system. We test temperature replica exchange (T-REMD), two solute-tempering variants of replica exchange (REST2 and REHT), as well as well-tempered metadynamics and on-the-fly probability enhanced sampling combined with solute tempering (ST-MetaD and ST-OPES).

Destabilization of Structured RNAs by OPC and TIP4PD Water Models

Scientific Publication

18

Authors: Krepl M., Mlýnský V., Rusnáková A., Banáš P., Otyepka M., Šponer J.

Publication date: 2026.

Published by the Journal of Chemical Theory and Computation.

The four-point OPC water model has recently gained a reputation as the preferred choice for molecular dynamics (MD) simulations of nucleic acids and proteins, providing more realistic reproduction of bulk physical properties of water than the older three-point models. It has been shown to improve, for example, simulations of unstructured biomolecules such as RNA tetranucleotides or intrinsically disordered proteins. However, the performance for folded RNA structures was not specifically explored. Here we present extensive testing of the OPC water model on three different RNAs with intricate tertiary structures – the ribosomal L1 stalk RNA-protein protuberance, the mini tetraloop-tetraloop receptor (miniTTR-6) folded RNA, and the GAAA tetraloop-tetraloop receptor homodimer.

Data management for distributed computational workflows: An iRODS-based setup and its performance

Scientific Publication

17

Authors: Hayek M., Golasowski M., Hachinger S., García-Hernández R.J., Munke J., Lindner G., Slaninová K., Tunka P., Vondrák V., Kranzlmüller D., Martinovič J.

Publication date: 2025.

Published by PLOS ONE.

Modern data-management frameworks promise a flexible and efficient management of data and metadata across storage backends. However, such claims need to be put to a meaningful test in daily practice. We conjecture that such frameworks should be fit to construct a data backend for workflows which use geographically distributed high-performance and cloud computing systems. Cross-site data transfers within such a backend should largely saturate network bandwidth, in particular when parameters such as buffer sizes are optimized. To explore this further, we evaluate the “integrated Rule-Oriented Data System” iRODS with EUDAT’s B2SAFE module as data backend for the “Distributed Data Infrastructure” within the LEXIS Platform for complex computing workflow orchestration and distributed data management.

The Kink-Turn 7 Motif: An Additional Test for RNA Force Field Performance

Scientific Publication

16

Authors: Lemmens, Toon; Mlýnský, Vojtěch; Šponer, Jiří; Pykal, Martin; Banáš, Pavel; Otyepka, Michal; Krepl, Miroslav.

Publication date: December 2025.

Published by the Journal of Chemical Theory and Computation.

The kink-turn is a recurrent RNA structural motif that induces a sharp bend (kink) in the A-form RNA helix. It is defined by key structural features, including consecutive sheared AG base pairs, an A-minor interaction, and multiple base–sugar interactions. An accurate representation of these densely packed noncanonical interactions by molecular dynamics simulations poses a significant challenge for contemporary force fields (FFs). Here, we present extended simulations of the ribosomal kink-turn 7 (Kt-7) from H.m., the so-called “consensual” kink-turn, using a broad spectrum of pair-additive and polarizable RNA FFs. None of the tested FFs manage to flawlessly describe all of the structural features of the Kt-7 although several FFs provide rather acceptable results and should not cause problems in simulations of larger RNAs containing a kink-turn. On aggregate, the widely used OL3 (ff99bsc0χOL3) and polarizable AMOEBA FFs achieve the best performance for this motif. Interestingly, some more recently parametrized FF variants struggle to describe the Kt-7’s tertiary A-minor interaction – a ubiquitous tertiary contact in RNA. This raises some concerns about the broader applicability of these FFs and suggests that they may be overfitted to small model systems, such as RNA tetranucleotides. In some cases, irreversible unkinking of the entire kink-turn motif can also be observed. The kink-turn motif is highly sensitive to variations in RNA FFs, and we strongly recommend its inclusion in training and benchmarking data sets as an important regression test to improve the robustness and accuracy of RNA FF parametrization.

Annealed Winner-Takes-All for Motion Forecasting

Scientific Publication

15

Authors: Yihong Xu, Victor Letzelter, Mickaël Chen, Éloi Zablocki, Matthieu Cord.

Publication date: 2025.

Published at the IEEE International Conference on Robotics and Automation. ICRA2025.

In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, commonly addressed using data-driven models with Multiple Choice Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives. However, these methods face initialization sensitivity and training instabilities. Additionally, to compensate for limited performance, some approaches rely on training with a large set of hypotheses, requiring a post-selection step during inference to significantly reduce the number of predictions. To tackle these issues, we take inspiration from annealed MCL, a recently introduced technique that improves the convergence properties of MCL methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we demonstrate how the aWTA loss can be integrated with state-of-the-art motion forecasting models to enhance their performance using only a minimal set of hypotheses, eliminating the need for the cumbersome post-selection step. Our approach can be easily incorporated into any trajectory prediction model normally trained using WTA and yields significant improvements.

DocSpider: a dataset of cross-domain natural language querying for MongoDB

Scientific Publication

14

Authors: Arif Görkem Özer, Recep Firat Cekinel, Ismail Hakki Toroslu, Pinar Karagoz.

Publication date: 2025.

Published by Cambridge University Press.

Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider, leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset.

Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

Scientific Publication

13

Authors: A. Vobecky, D. Hurych, O. Siméoni, S. Gidaris, A. Bursuc, P. Pérez and J. Sivic.

Publication date: 2025.

Published by the International Journal of Computer Vision.

Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. This work investigates learning semantic segmentation in urban scenes without any manual annotation. Researchers propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Researchers show the generalization capabilities of their method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. They present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.

Can We Ever Develop an Ideal RNA Force Field? Lessons Learned from Simulations of the UUCG RNA Tetraloop and Other Systems

Scientific Publication

12

Authors: Vojtěch Mlýnský, Petra Kührová, Martin Pykal, Miroslav Krepl, Petr Stadlbauer, Michal Otyepka, Pavel Banáš and Jiří Šponer.

Publication date: 2025.

Published by the Journal of Chemical Theory and Computation.

In this work, researchers present a comprehensive evaluation of widely used pair-additive and polarizable RNA ffs using the challenging UUCG tetraloop (TL) benchmark system. Extensive standard MD simulations, initiated from the NMR structure of the 14-mer UUCG TL, revealed that most ffs did not maintain the native state, instead favoring alternative loop conformations. Notably, three very recent variants of pair-additive ffs, OL3_CP–gHBfix21, DES-Amber, and OL3_R2.7, successfully preserved the native structure over a 10 × 20 μs time scale. To further assess these ffs, researchers performed enhanced sampling folding simulations of the shorter 8-mer UUCG TL, starting from the single-stranded conformation. Estimated folding free energies (ΔG°_fold) varied significantly among these three ffs, with values of 0.0 ± 0.6, 2.4 ± 0.8, and 7.4 ± 0.2 kcal/mol for OL3_CP–gHBfix21, DES-Amber, and OL3_R2.7, respectively. The ΔG°_fold value predicted by the OL3_CP–gHBfix21 ff was closest to experimental estimates, ranging from −1.6 to −0.7 kcal/mol. In contrast, the higher ΔG°_fold values obtained using DES-Amber and OL3_R2.7 were unexpected, suggesting that key interactions are inaccurately described in the folded, unfolded, or misfolded ensembles. These discrepancies led them to further test DES-Amber and OL3_R2.7 ffs on additional RNA and DNA systems, where further performance issues were observed. The results emphasize the complexity of accurately modeling RNA dynamics and suggest that creating an RNA ff capable of reliably performing across a wide range of RNA systems remains extremely challenging. In conclusion, our study provides valuable insights into the capabilities of current RNA ffs and highlights key areas for future ff development.

Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?

Scientific Publication

11

Authors: Yihong Xu, Loïck Chambon, Éloi Zablocki, Mickaël Chen, Alexandre Alahi, Matthieu Cord, Patrick Pérez.

Publication date: 2024

Published at the IEEE International Conference on Robotics and Automation. ICRA2024.

Motion forecasting is crucial in enabling autonomous vehicles to anticipate the future trajectories of surrounding agents. To do so, it requires solving mapping, detection, tracking, and then forecasting problems, in a multi-step pipeline. In this complex system, advances in conventional forecasting methods have been made using curated data, i.e., with the assumption of perfect maps, detection, and tracking. This paradigm, however, ignores any errors from upstream modules. Meanwhile, an emerging end-to-end paradigm, that tightly integrates the perception and forecasting architectures into joint training, promises to solve this issue. However, the evaluation protocols between the two methods were so far incompatible and their comparison was not possible. In fact, conventional forecasting methods are usually not trained nor tested in real-world pipelines (e.g., with upstream detection, tracking, and mapping modules). In this work, we aim to bring forecasting models closer to the real-world deployment. First, we propose a unified evaluation pipeline for forecasting methods with real-world perception inputs, allowing us to compare conventional and end-to-end methods for the first time. Second, our in-depth study uncovers a substantial performance gap when transitioning from curated to perception-based data. In particular, we show that this gap (1) stems not only from differences in precision but also from the nature of imperfect inputs provided by perception modules, and that (2) is not trivially reduced by simply finetuning on perception outputs. Based on extensive experiments, we provide recommendations for critical areas that require improvement and guidance towards more robust motion forecasting in the real world.

Regularizing Self-supervised 3D Scene Flows with Surface Awareness and Cyclic Consistency

Scientific Publication

10

Authors: Patrik Vacek, David Hurych, Karel Zimmermann, Patrick Perez, Tomas Svoboda.

Publication date: 2024

Published in Arxiv 2024.

Learning without supervision how to predict 3D scene flows from point clouds is essential to many perception systems. We propose a novel learning framework for this task which improves the necessary regularization. Relying on the assumption that scene elements are mostly rigid, current smoothness losses are built on the definition of “rigid clusters” in the input point clouds. The definition of these clusters is challenging and has a significant impact on the quality of predicted flows. We introduce two new consistency losses that enlarge clusters while preventing them from spreading over distinct objects. In particular, we enforce \emph{temporal} consistency with a forward-backward cyclic loss and \emph{spatial} consistency by considering surface orientation similarity in addition to spatial proximity. The proposed losses are model-independent and can thus be used in a plug-and-play fashion to significantly improve the performance of existing models, as demonstrated on two most widely used architectures. We also showcase the effectiveness and generalization capability of our framework on four standard sensor-unique driving datasets, achieving state-of-the-art performance in 3D scene flow estimation.

Let-It-Flow: Simultaneous Optimization of 3D Flow and Object Clustering

Scientific Publication

9

Authors: Patrik Vacek, David Hurych, Tomáš Svoboda, Karel Zimmermann.

Publication date: 2024

Published at IEEE Transactions on Intelligent Vehicles.

We study the problem of self-supervised 3D scene flow estimation from real large-scale raw point cloud sequences, which is crucial to various tasks like trajectory prediction or instance segmentation. In the absence of ground truth scene flow labels, contemporary approaches concentrate on deducing optimizing flow across sequential pairs of point clouds by incorporating structure based regularization on flow and object rigidity. The rigid objects are estimated by a variety of 3D spatial clustering methods. While state-of-the-art methods successfully capture overall scene motion using the Neural Prior structure, they encounter challenges in discerning multi-object motions. We identified the structural constraints and the use of large and strict rigid clusters as the main pitfall of the current approaches and we propose a novel clustering approach that allows for combination of overlapping soft clusters as well as non-overlapping rigid clusters representation. Flow is then jointly estimated with progressively growing non-overlapping rigid clusters together with fixed size overlapping soft clusters. We evaluate our method on multiple datasets with LiDAR point clouds, demonstrating the superior performance over the self-supervised baselines reaching new state of the art results. Our method especially excels in resolving flow in complicated dynamic scenes with multiple independently moving objects close to each other which includes pedestrians, cyclists and other vulnerable road users.

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Scientific Publication

8

Authors: Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez.

Publication date: 2024

Published in Transactions on Machine Learning Research (TMLR) 2024.

Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.

Management of Physics Simulations with Databases

Scientific Publication

7

Authors: Viktoria Pauwa, David Číž, Vojtě ch Mlý nský, Pavel Banáš, Michal Otyepka, Stephan Hachinger and Jan Martinovič.

Publication date: 2024

Published by the Proceedings of Science journal.

The on-going work presented in this article explores different technical approaches and systems for management and analysis of data obtained from large physics simulations, optimising the respective data-driven workflows across Cloud-Computing (IaaS) and HPC systems. The work is carried out in the context of the EXA4MIND Horizon Europe project, which produces an Extreme Data processing platform, bringing together specialised data management systems and powerful computing infrastructures. We evaluate two typical use cases with physics simulations carried out on supercomputing systems at LRZ and IT4Innovations. These use cases come from different areas of physics – they focus on the treatment of low energy many-body systems of molecules, and of high-energy (relativistic) elementary particles, respectively.

Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish

Scientific Publication

6

Authors: R. F. Cekinel, Ç. Çöltekin, P. Karagoz.

Publication date: 2024.

Published at the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation.

The rapid spread of misinformation through social media platforms has raised concerns regarding its impact on public opinion. While misinformation is prevalent in other languages, the majority of research in this field has concentrated on the English language. Hence, there is a scarcity of datasets for other languages, including Turkish. To address this concern, we have introduced the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations. Additionally, we aim to assess the effectiveness of cross-lingual transfer learning for low-resource languages, with a particular focus on Turkish.

Three Pillars improving Vision Foundation Model Distillation for Lidar

Scientific Publication

5

Authors: G. Puy, S. Gidaris, A. Boulch, O. Siméoni, C. Sautier, P. Pérez, A. Bursuc, R. Marlet

Publication date: 2024.

Published at the CVPR Conference.

Self-supervised image backbones can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. Ideally, 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results, obtained thanks to distillation methods that keep improving. Yet, we still notice a large performance gap when measuring the quality of distilled and fully supervised features by linear probing. In this work, instead of focusing only on the distillation method, we study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbones, and the pretraining dataset. In particular, thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features, and to improve the robustness of the pretrained backbones to domain gaps and perturbations.

IndexAI: AI Based Index Selection for NoSQL Databases (pre-print version)

Scientific Publication

4

Authors: M. M. Khosravi, P. Karagoz, I. H. Toroslu.

Publication date: 2024.

Published at the IEEE International Conference on Big Data.

In this work, we consider the automated index selection for NoSQL databases and investigate the feasi- bility of supervised learning and reinforcement learning based solutions. The experiments conducted on the YCSB dataset show that reinforcement learning improves index selection per- formance as in relational databases, and supervised learning gives promising results and can be considered applicable under sufficient amount of training data.

POP3D: Open-Vocabulary 3D Occupancy Prediction from Images

Scientific Publication

3

Authors: A. Vobecky, O. Siméoni, D. Hurych, S. Gidaris, A. Bursuc, P. Pérez, J. Sivic.

Publication date: 2023.

Published at the Advances in Neural Information Processing Systems (Neurips) Conference.

This research describes an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold: a new model architecture for open-vocabulary 3D semantic occupancy prediction; a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations; and a quantitative demonstration of the strengths of the proposed model on several open-vocabulary tasks.

Simple Adjustment of Intranucleotide Base-Phosphate Interaction in the OL3 AMBER Force Field Improves RNA Simulations

Scientific Publication

2

Authors: V. Mlýnský, P. Kührová, P. Stadlbauer, M. Krepl, M. Otyepka, P. Banás, J. Šponer.

Publication date: 2023.

Published by the Journal of Chemical Theory and Computation.

Molecular dynamics (MD) simulations represent an established tool to study RNA molecules. The outcome of MD studies depends, however, on the quality of the force field (ff). Here researchers suggest a correction for the widely used AMBER OL3 ff by adding a simple adjustment of the nonbonded parameters. The research suggests that the combination of OL3 RNA ff and NBfix0BPh modification is a viable option to improve RNA MD simulations.

Wine in the Cloud, or: Smart Vineyards with a Distributed "Extreme Data Database" and Supercomputing

Scientific Publication

1

Authors: P. Harsh, S. Hachinger, M. Derquennes, A. Edmonds, P. Karagoz, M. Golasowski, M. Hayek and J. Martinovič.

Publication date: 2023.

Published by the Proceedings of Science journal.

In this contribution, researchers sketch an application of Earth System Sciences and Cloud-/Big-Data-based IT, which shall soon leverage European supercomputing facilities: smart viticulture, as put into practice by Terraview. TerraviewOS is a smart vineyard ‘operating system’, allowing wine cultivators to optimise irrigation, harvesting dates and measures against plant diseases. The system relies on satellite and drone imagery as well as in-situ sensors where available. The substantial need for computing power in TerraviewOS, in particular for training AI-based models to generate derived data products, makes the further development of some of its modules a prime application case for the EXA4MIND project.

Data Management Plan (Updated)

Public Deliverable

8.3

The Data Management Plan lays out our concept for handling main aspects of the life cycle of the project data (data organisation and long-term storage, access, preservation and sharing). This document also includes a specification of outputs (datasets generated during the project).

Dissemination and Communication Final Report

Public Deliverable

7.4

This deliverable provides the final overview of communication and dissemination activities carried out throughout the EXA4MIND project. It consolidates the outreach actions implemented during the project lifetime and assesses how these activities contributed to the visibility of project results, scientific outputs and ecosystem collaborations.

Demonstration of the Tool for Industry Application Case and KPI Measures

Public Deliverable

5.3

This deliverable presents the final demonstration and validation of the tools and AI models developed for the EXA4MIND Industry Application Case, specifically targeting Advanced Driving Assistance Systems (ADAS).

Validation and Assessment Report of the ADAMS4SIMS System

Public Deliverable

4.3

This deliverable reports the validation and assessment of the ADAMS4SIMS system as an integrated scientific data platform.

Final Version of the ADAMS4SIMS System

Public Deliverable

4.2

The Automated Data Mining System for Systematic Improvement of Molecular Simulations (ADAMS4SIMS) integrates a comprehensive suite of functionalities essential for the preprocessing, storage, and analysis of molecular dynamics (MD) data. The system consists of a back-end, a web-based GUI, and a PostgreSQL database. The HDF5 format is employed for managing large trajectory datasets effectively. While the preprocessing tools focus on metadata extraction and data preparation, the analytical tools available provide users with the capability to perform scalable analyses of their simulations, enabling systematic improvements in structural modeling.

Collection of APIs to External Applications

Public Deliverable

3.4

This deliverable presents the collection of Application Programming Interfaces (APIs) integrated by the EXA4MIND project to enable interaction between the platform and external applications. The document describes the APIs that expose the core services of the EXA4MIND ecosystem, including AI inference capabilities, workflow orchestration, distributed data processing, database access, and supporting infrastructure services.

Report on FAIRification of Database Data

Public Deliverable

2.4

This deliverable describes the module (including a FAIR Data Subsetting Application Programming Interface as a novel approach to FAIR data in databases) and gives examples of how datasets from EXA4MIND are published. The work described is important for implementing EXA4MIND’s Data Management Plan and for reaching project objectives related to FAIR data publication and persistent-identifier assignment.

Final Architecture, Lessons Learned, Benchmarks and Validation

Public Deliverable

1.4

This deliverable describes the final architecture of the solution, a multitude of lessons learned collected across the work packages throughout the project period, and our successful qualitative and quantitative platform validation efforts, including benchmarks.

Demonstration of the TERRAVIEW Application Case

Public Deliverable

6.2

This deliverable demonstrates the outcome of the SME/Smart – Viticulture application case of EXA4MIND. The work has focused on a Soil Moisture Content (SMC) monitoring system capable of processing large-scale satellite imagery to support viticulture. This “A quaview” platform is shown, assessed and validated. A central component of this endeavour is a rigorous benchmarking study that evaluates the efficiency of the EXA4MIND “Extreme Data” architecture against traditional data retrieval methods.

Query Performance Report

Public Deliverable

3.3

This deliverable introduces the “Toolboxes” module of the EXA4MIND Platform, extending and covering the idea of a Data Preprocessing Toolbox developed within Work Package 3 (WP3, Extreme Data Analytics and Processing) of the EXA4MIND project.

Data Preprocessing Toolbox

Public Deliverable

3.2

Data Management Plan

Public Deliverable

8.1

The Data Management Plan lays out the planning for handling main aspects of the life cycle of the project data (data organisation and long-term storage, access, preserva- tion, and sharing). This document also includes a preliminary specification of outputs (what data will be generated during the project). It is a living document and will be continuously updated during the project.

Dissemination and Communication Interim Report

Public Deliverable

7.2

This interim report evaluates the first 18 months of communication and dissemination activities for the EXA4MIND project, assessing progress towards objectives set in Deliverable D7.1 Impact Master Plan (EXA4MIND 2023). It focuses on stakeholder collaboration and dissemination efforts, detailing achievements, metrics, and outreach evaluations.

Impact Master Plan

Public Deliverable

7.1

This deliverable outlines the planning of the dissemination, communication, exploitation and standardisation strategies for the EXA4MIND Horizon Europe project. This planning will be of relevance throughout the duration of the project and will be revisited periodically as it progresses.

Intermediate Version of the ADAMS4SIMS System

Public Deliverable

4.1

The Automated Data Mining System for Systematic Improvement of Molecular Simulations (ADAMS4SIMS) is currently at an intermediate stage of development. This version of ADAMS4SIMS integrates a suite of functionalities essential for the preprocessing, storage, and analysis of molecular dynamics data. The system consists of a back-end, a web-based GUI, and a PostgreSQL database. The HDF5 format is employed for managing large trajectory datasets effectively. While the preprocessing tools focus on metadata extraction and data preparation, the analytical tools available provide users with the capability to perform analyses of their simulations. The intermediate version of ADAMS4SIMS presented with this deliverable lays the groundwork for future enhancements, with a roadmap that includes improving data accessibility, preprocessing and advancing analytical methodologies.

Architecture of AQIS

Public Deliverable

3.1

One of the core objectives of the EXA4MIND project is enabling Extreme Data Analytics and Processing. To fulfl this objective, one of the major modules of the project is the Advanced Querying and Indexing System (AQIS), the architecture of which is presented in this document. With the development of AQIS, we aim to provide a data query system having a uniform querying mechanism for varying data models. The system includes modular components to cope with complex data and variety of querying capabilities/languages over them.

Data and Workflow Management Toolbox Alpha Status Report

Public Deliverable

2.2

The EXA4MIND project connects pre-eminent databases and data management systems to supercomputing systems and European Data Spaces as well as the world of FAIR research data. The core purpose of this endeavour is running next-generation Extreme Data workfows, with emphasis on data analytics, Machine Learning / Artifcial Intelligence, or classical simulations. This deliverable reports on the Data and Workfow Management Toolbox provided for this purpose, building upon the successful LEXIS Platform (delivered by the H2020 project, GA 825532). Furthermore, it illustrates the first workfows run by our application cases at supercomputing centres.

Extreme Data Flow Patterns

Public Deliverable

2.1

This deliverable of the EXA4MIND project collects and analyses data flow patterns from all the project application cases. The collected data flow descriptions are used to identify a set of common occurring patterns that will be taken into account when designing the Extreme Data Database.

Continuous Improvements of Architecture, and Release Management Process and System

Public Deliverable

1.2

The EXA4MIND project builds a platform for Extreme Data Analytics, leveraging Eu-ropean supercomputing centres and best-of-breed data stores from SQL and noSQL databases over object stores to classical High-Performance-Computing backend fle systems. The platform is to connect to the European data ecosystem with European Data Spaces, the European Open Science Cloud (EOSC) and EUDAT. The platform with its components, and in particular its core, the “Extreme Data Database”, need to be developed, released and deployed (or also published) in a well-coordinated manner. A formalisation of architectural improvement cycles, release management processes and deployment or publication processes is needed, but should not inhibit research work. This deliverable devises the necessary formalisms. It discusses our approach to categorise software products in modules and submodules, to manage individual submodule and entire module releases. Then, it describes our current modular and portable EXA4MIND architecture, explaining its concept, its parts and its usage in practice. The deliverable closes laying out our plan for handling architectural improvement cycles.

Application Cases And Architecture Requirements

Public Deliverable

1.1

This deliverable contains requirements provided by the project’s application-case work packages WP4-WP6 and their mapping to the EXA4MIND Platform features. The document is roughly divided into two parts. The first part is containing a unified description of each application case and its requirements. The second half of the document contains the mapping of the requirements to the technical features of the EXA4MIND Platform and the project objectives provided by the technical work packages WP1-WP3.

13

Thank you very much, and farewell from all of us at EXA4MIND!

Welcome to the final newsletter of the EXA4MIND Project – Thank you very much, and farewell from all of us at EXA4MIND! In this edition you will find the ‘end-of-project reflections’ of Jan Martinovič, EXA4MIND project coordinator, and our wrap-up campaign, which contains interviews with representatives from each consortium member to discuss the project’s future impact on the European data sector.

12

Heading into the final stretch of EXA4MIND

Welcome to the twelfth newsletter of the EXA4MIND Project – Heading into the final stretch of EXA4MIND. In this edition you will find all about our last plenary meeting in Ostrava, which provided partners with the perfect opportunity to say goodbye in person just two months before the project ends, the latest events that EXA4MIND has taken part in, and five new EXAKI training courses.

11

2025: a successful year for EXA4MIND!

Welcome to the tenth newsletter of the EXA4MIND Project – 2025: a successful year for EXA4MIND! In this edition you will find an editorial by Jan Martinovič, EXA4MIND project coordinator & Stephan Hachinger, EXA4MIND science and co-design coordinator. You will also find all about our presence in key events, new scientific contributions and new EXA4MIND graphical materials.

10

EXA4MIND in action: Highlights from a Supercomputing 'Back-to-School'!

Welcome to the tenth newsletter of the EXA4MIND Project – EXA4MIND in action: Highlights from a Supercomputing ‘Back-to-School’! In this edition you will find all about our participation in the Czech-Bavarian Supercomputing Summer School 2025, our presence at EuChemS Comp Chem 2025 in Naples, a glimpse into Researchers’ Night 2025, and the upcoming events.

9

Introducing EXAKI, the EXA4MIND Kickstarter Initiative

Welcome to the special edition of the EXA4MIND’s newsletter – EXAKI. In this special edition, you will find out all about EXAKI, the EXA4MIND Kickstarter Initiative.

8

EXA4MIND's meteoric career continues

Welcome to the eighth newsletter of the EXA4MIND Project – EXA4MIND’s meteoric career continues. In this edition, you will find all about our last Plenary Meeting in Ankara, a new scientific publication, our participation in international events, the final video in the ‘Faces of EXA4MIND’ campaign, updates on the DataNexus Cluster, and all the upcoming events.

7

Prague Plenary sets the stage for 2025

Welcome to the seventh newsletter of the EXA4MIND Project – Prague Plenary sets the stage for 2025. In this edition, you will find all about our last Plenary Meeting, the latest Scientific Contributions and new interviews of the ‘Faces of EXA4MIND’ campaign.

6

Two years of Extreme Data innovations and progress

Welcome to the sixth newsletter of the EXA4MIND Project – Two years of Extreme Data innovations and progress. In this edition, you will find an editorial by Stephan Hachinger, Science and Co-design Coordinator of EXA4MIND, our participation in international events such as EBDVF together with the DataNexus Cluster, ELIXIR CZ Annual Conference 2024 and SC24, new videos of the ‘Faces of EXA4MIND’ campaign and the latest DataNexus Cluster videos.

5

In the middle of the journey

Welcome to the fifht newsletter of the EXA4MIND Project – In the middle of the journey. In this edition, you will find an editorial by Jan Martinovič, the project coordinator, who reflects and takes stock of what has been achieved so far in his letter ‘In the middle of the journey’, an article about how EXA4MIND contributes to the development of European Data Spaces, the latest scientific development from our partner Valeo: ‘Valeo4Cast: A Modular Approach to End-to-End Forecasting’, and our latest Synergies and Partnerships: the DataNexus cluster and the collaboration with StandICT.eu

4

Time to reach the next level!

Welcome to the fourth newsletter of the EXA4MIND Project – time to reach the next level! In this edition, you will find all about our last plenary meeting, a new method to improve the safety and reliability of autonomous driving by our consortium member Antonín Vobecký from Czech Technical University in Prague, EXA4MIND’s participation in international events, new videos of the ‘Faces of EXA4MIND’ campaign featuring our consortium members, and upcoming events.

3

On to the second year of the project!

Welcome to the third newsletter of the EXA4MIND Project – On to the second year of the project!! In this edition, you will find the consortium partners review the first year of the project and the objectives for the second year, presentation of the EXA4MIND External Advisory Board, highlights of international events attended by the project in the last months, and our last campaign ‘Faces of EXA4MIND’.

2

The journey continues!

Welcome to the second newsletter of the EXA4MIND Project – The journey continues! In this edition, you will find details about the last plenary meeting and co-design meeting with Application Cases partners, highlights from all national and international events attended by the EXA4MIND project in the last months, and a preview of upcoming events.

1

We are glad to have you on board!

Welcome to the first newsletter of the EXA4MIND project. We are glad to have you on board! In this edition, you will find a warm welcome from the EXA4MIND project coordinator, information about the organisations driving the project and their expectations of EXA4MIND, the presentation of our application cases, a recap of the events in which EXA4MIND has been actively involved, and interesting news about TerraviewOS, a consortium partner, which has emerged as the winner of the Gravity05 global sustainability challenge.

EXA4MIND AI Inference Service Solution

Poster

7

Authors: Mohamad Hayek, Stephan Hachinger, Huseyn Gurbanov, Viktoria Pauw, Martin Golasowski, Jan Martinovič, Pinar Karagoz, Ismail Hakki Toroslu.

Presented at the SCA/HPCAsia 2026 Conference.

Tapping Data Spaces in HPC: Enabling, Exploring and Benchmarking Database & Object Store Usage

Poster

6

Authors: Mohamad Hayek, Stephan Hachinger, Huseyn Gurbanov, Viktoria Pauw, Martin Golasowski, Jan Martinovič, Pinar Karagoz, Ismail Hakki Toroslu.

Presented at the SCA/HPCAsia 2026 Conference.

How to Make Data Pipeline Scalable: A Case Study for Healthcare Data Analytics

Poster

5

Authors: Atakan Gelecek, Pinar Karagoz, Ismail Hakki Toroslu, Stephan Hachinger, Jan Martinovic.

Presented at the workshop entitled ‘RNA modelling across scales‘ organised by the Centre Européen de Calcul Atomique et Moléculaire (CECAM).

A universal molecular force field format for transparent and accessible parameters for MD simulations of biomolecular systems

Poster

5

Authors: Banáš, P., Mlýnský, V., Číž, D., Furmánek, R., Pilát, N., Špačková, A., Pauw, V., Hachinger, S., Šponer, J., Martinovič, J., Otyepka, M.

Presented at the workshop entitled ‘RNA modelling across scales‘ organised by the Centre Européen de Calcul Atomique et Moléculaire (CECAM).

Integrated Database of Force-Field Parameters, Experimental Measurements, and Molecular Dynamics Simulations as a Tool for Force Field Development

Poster

4

Authors: Banáš, P., Mlýnský, V., Číž, D., Furmánek, R., Pilát, N., Špačková, A., Pauw, V., Hachinger, S., Šponer, J., Martinovič, J., Otyepka, M.

Presented at the EuChemS CompChem 2025 conference.

ADAMS4SIMS - Database Curation for MD Simulations

Poster

3

Authors: Číž, D., Furmánek, R., Pauw, V., Martinovič, J., Otyepka, M., Banáš, P., Mlýnský, V.

Presented at the High Performance Computing in Science and Engineering 2024 conference (HPCSE 2024).

Optimised DBMS Deployments and DBMS Benchmarks for Extreme Data Workflows

Poster

2

Authors: Gurbanov, H., Brunati, R., Karagoz, P., Toroslu, I. H., Hayek, M., Golasowski, M., Hachinger, S., Martinovič, J.

Presented at the High Performance Computing in Science and Engineering 2024 conference (HPCSE 2024)

Tuning and tweaking of the state-of-the-art amber OL3 RNA force field

Poster

1

Authors: Mlýnský, V., Kührová, P., Krepl, M., Stadlbauer, P., Otyepka, M., Bussi, G., Banáš, P., Šponer, J.

Presented at the EuChemS CompChem 2023 conference.