Thesis

Options

There are mainly three options for the thesis:

  1. Choose a topic, or propose an own topic and work on your own (see list below for options) (this is the default option)
  2. Collaboration with local start-ups, or companies (paid Master’s thesis)
  3. Work together with a research partner organisation

Template

Feel free to use the Latex thesis template (based on input from Karl Voit and Keith Andrews):

Template, and preview

Collection of a few helpful tips for Master’s thesis, provided by Annemarie Harzl. For printing the thesis one can choose the CopyShop or an online service (e.g., masterprint). After the thesis is finished beware of predatory publishers that offer to print the thesis for free!

Topics

Cooperations

ESG reporting

Artificial Intelligence has reached a level that it can support humans in more and more tasks. In the master thesis the research question is to which extent AI can support sustainability reporting, which is becoming mandatory for large companies in the EU. What are the opportunities and what are the challenges? For example, could AI provide insights into the CO2 emissions throughout the company or the supply chain? Can AI be used to inform the employees about possible/required changes of behaviour, or suggest measures to the management to improve sustainability? The work on the thesis will be supported by AI experts, who provide expertise and insights into the inner workings of contemporary AI tools, in particular Large Language Models (LLMs), and sustainability reporting experts.

Excellence in the Austrian Research Ecosystem: A Quantitative Citation Analysis

This master thesis analyses the excellence of the Austrian research ecosystem by means of a quantitative citation analysis. Citation data for articles from various institutions, including universities, universities of applied sciences and non-university research institutions, are meticulously collected and analysed via Scopus and OpenCitations APIs. The analysis provides insights into the influence and quality of the research output of these institutions. This research requires knowledge in Python and a keen interest in science of science.

Evaluating the Impact of Scientific Publications: A Citation Context Analysis Using Large Language Models

Citations are crucial for assessing the impact of scientific publications, yet quantitative analyses often overlook the varied contributions of individual citations. To address these limitations, this thesis introduces an automated classification of citation context using Large Language Models (LLMs), offering a nuanced approach that combines the strengths of quantitative and qualitative analyses. This method aims to enhance the understanding of a publication’s impact by accurately identifying the intent behind each citation. This research requires knowledge in Python and a keen interest in LLMs and science of science.

Web Application for Machine Learning The goal is to develop a web application for sensitivity analysis and uncertainty quantification of computational model data. The work is in conjunction with UQtab. This topic can easily be combined with many scientific aspects.

Analyse the Research Groups in Austria In an initial step a crawler is to be build to crawl information regarding Austria research institutions. Code for the crawler technology is already available. The work is in conjunction with the an EU project. Next, the topic is made available to an LLMs via a RAG system.

Technolgy Scouting Similar to the previous topic, here the goal is to crawl web sites discussing certain technologies. Again, the goal is to provide a user friendly UI to operate an LLM, in combination with a RAG.

Information Extraction from Historical Text The 19th century is known for its big changes in politics, technology and society - and also called long nineteenth century. The goal of the thesis is to use NLP to support historians in their work to better understand events including their causes and how they have been perceived. To this end, sources like historical news papers are to be collected, and information extraction methods should be applied. In particular, the extraction of cause and effects related to historical events are of interest.

For example, collect text from archive.org and analyse the change in language (frequency of words, frequency of phrases and/or grammar).

LLMs

Few-Shot LLM Pruning Develop a few-shot prompt for a given task, e.g., named entity recognition, classification, etc. Next, analyse the network activation and in a succeeding step prune away neuron that are not used for the target tasks. Study the impact of the level of pruning on the performance of the network.

LLMs and Counterfactual Inference Are LLMs able to understand causality? In an existing dataset a number of “old” LLMs are analysed. The goal is to update this work and collect data with contemporary LLMs. In particular it is interesting to find out, which test do not work well. In the base case, one can find patterns where even contemporary LLMs struggle.

NLP

Change in Authorship Style (e.g., GPT-3) The writing style is unique to a person, but it is also subject to change. A long text might be authored by multiple people (or a generative language model, such as GPT-3). The goal is to detect, where these changes in style happen.

Shift in Reporting How are newspapers reporting about certain topic and when do they use certain words? Are articles written differently if they use “Europe” vs. articles using “European Union”? Are there event that change the way, how these are reported?

Causal NLP

Causal Expression Extraction with LLMs Develop prompt engineering techniques for the extraction of causal expressions from text, e.g., papers. In a next step infer more general concepts and relations with LLMs. Finally, convert the extracted causal information into knowledge graphs.

Causal Expressions in IPCC Reports Build a automatic extraction of causal expressions and apply this to longer documents, such as the IPCC report. The causal expressions are then collected and aggregated to give a quick overview of the main arguments.

Causal Inference in NLP Measure the strength of causal effect via textual resources. How much does an event change the way people write about a topic? The event here could be a governmental intervention, a natural disaster, an accident, a personal experience. Part of this project is to collect data via controlled experiments.

Extraction of Causal Patterns for Knowledge Base Completion Extract causal knowledge from a specific domain and transform the extracted information in structured form. The goal is to build (or extend) a knowledge graph. Here the domain can be freely chosen.

Privacy-Preserving

Dataset Anonymisation Given a textual dataset (containing sensitive information, e.g., specific words), the goal is to remove any sensitive information. For example, if the age of a person is mentioned, it should be replaced. Person names should be identified and removed.

Privacy-Preservation Based on a dataset, define some sensitive attribute x_s, which has some predictive power on a target variable y. First, compute the part of x_s, which is helpful for prediction (correlation between x_s and y). Next, inject this information into 1) a new variable, 2) an existing variable (change the values), 3) all other variables (change all values a tiny bit). Finally, the effectiveness of the methods needs to be evaluated (i.e., same classification performance, but no more correlation between x_s and y) and the shift in the distribution (of the other variables).

Analyse Contracts Analyse textual contracts in the German language and extract legal relevant terms and phrases from the text. This also includes dates and time ranges. In addition, one can create a reference list with phrases that are expected and then match against an existing text. Finally, this can be used to create a classification scheme for the contracts.

Web Crawling and NLP

Climate Change Efforts There are many local efforts in addressing climate change on regional level, which is reported on respective web pages. The goal is to systematically crawl regional web sites and identify climate change actions taken by local governments. This topic combines the engineering required for web crawling and NLP tools for information extraction.

Data Science and Machine Learning

Dataset Augmentation for Tabular Data Based on a paper on causal GANs, reimplement the algorithm and evaluate on own datasets.

Split Features into Neighbourhood and Similarity Many machine learning and data science tasks assume the features to semantically equal. The idea is to split the feature set into two sets, the first representing features encoding the closeness of instances, and a second set encoding the similarity between instances. This approach can then implemented in for example Local Outlier Factor or other methods.

Custom Loss for Privacy-Preservation via Causality Develop a loss function when training e.g. a Variational Autoencoder to additionally include a loss term for the “leak” of sensitive information.

Causal Outlier/Anomaly Detection Goals: 1) Given a dataset (including potentially unlabelled outliers) and a causal structure, research to which extend does the knowledge of the causal structure help to identify outliers. 2) Given a dataset and labelled outliers, research to which extend this helps for causal discovery.

Privacy-Preservation/Fairness via Causality Based on existing datasets, define some sensitive attributes x_s, where we want to protect the relationship between these attributes and the target attribute y (e.g., impact of gender on salary). Based on the knowledge about the dataset, derive a causal model, e.g., a causal graph. Research methods to remove the correlation between x_s and y (e.g., via introducing a new synthetic confounder attribute).

Data Quality

A Comprehensive Review of Methods, Tools, and Metrics for Quality Assessment of Datasets in AI Applications Data quality plays an important role in AI tools and applications. The goal of the thesis is to assess the current state of the art of quality assessment approaches and based on this knowledge to derive a methodology for assessment of datasets. In particular, the focus is not only on the utility of the dataset (e.g., how well will a prediction work), but also other aspects like bias or private information.

Software Development

Web App for Graph Database Goal: Develop a web application to visualise a graph database (e.g., Neo4J) and make the content available for intuitive search requests. The graph database stores a knowledge graph, either taken from existing datasets (e.g., ConceptNet, ATOMIC), or with own data. The user can then interact with the graph and drill down to individual nodes.