Assessing the Potential of Large Language Models for Streamlined Data Collection in Research Data Management

January 2025

The master’s thesis should explore the potential of Large Language Models (LLMs) for data collection processes in the field of research data management.

Large-scale research data management heavily relies on data being extracted from documents and reports of ongoing research projects, and subsequently evaluated by experts. This process is problematic in two ways. First, manual extraction is often cumbersome and resource-intensive (both regarding costs and time) as well as error-prone, and second – and more importantly – data must be collected and evaluated even after the completion of projects for to facilitate ongoing continuous monitoring.
To achieve this, relevant sources need to be identified and explored, to extract data about activities and the involved project partners (to facilitate maintenance of the research data platform).

To address these issues, existing methods for gathering unstructured data (e.g., collected from various organization websites) should be converted into structured data, conforming to a pre-defined structure. The research focus should be primarily on open-source methods and applications including

Web Scrapers: For collecting data from various sources.
Large Language Models (LLMs) (and supporting technologies): For identifying, retrieving, and transforming unstructured data into structured formats.

The approach may also incorporate interactions between LLMs and additional components such as knowledge graphs, databases, APIs, and human experts. The comparison should cover metrics such as success rate, hallucination rate of the LLM components, ease of use, traceability of the results, explainability of the results, scalability, and extensibility.

The framework should be designed to be modular and extensible, enabling the integration of new data sources and extraction methods when necessary. The evaluation should include a comparison with existing manual and semi-automated methods to highlight the improvements in efficiency and accuracy.

Thesis Goal The goal of this Master Thesis is to first gain an overview of existing approaches for automated data extraction and storage, and how LLM can be leveraged in this context. Based on this, an automated data collection and management framework should be developed, and evaluated on a real-world dataset.

Supervision

This thesis is co-supervised by Prof. Dipl.-Ing. Mag. Dr.techn. Alexandra Mazak-Huemer from the Austrian Council for Sciences, Technology and Innovation.

Michael Vierhauser

Supervision