Thesis

Options

There are mainly three options for the thesis:

  1. Choose a topic, or propose an own topic and work on your own (see list below for options) (this is the default option)
  2. Collaboration with local start-ups, or companies (paid Master’s thesis)
  3. Work together with a research partner organisation

Template

Feel free to use the Latex thesis template (based on input from Karl Voit and Keith Andrews):

Template, and preview

Collection of a few helpful tips for Master’s thesis, provided by Annemarie Harzl. For printing the thesis one can choose the CopyShop or an online service (e.g., masterprint). After the thesis is finished beware of predatory publishers that offer to print the thesis for free!

Topics

NLP

Information Extraction from Historical Text The 19th century is known for its big changes in politics, technology and society - and also called long nineteenth century. The goal of the thesis is to use NLP to support historians in their work to better understand events including their causes and how they have been perceived. To this end, sources like historical news papers are to be collected, and information extraction methods should be applied. In particular, the extraction of cause and effects related to historical events are of interest.

For example, collect text from archive.org and analyse the change in language (frequency of words, frequency of phrases and/or grammar).

Recursive Word Sense Induction Collect a textual dataset, split and pre-process the data. Each word is then clustered, and pure clustered are used to split the word and replace its occurrence with a cluster representation. Continue this process until no pure clusters can be found.

Change in Authorship Style The writing style is unique to a person, but it is also subject to change. For example, if a person is exposed to a certain situation, the writing style might also change. There are a number of sub-topics here, including real-world experiments and dataset collection & analysis.

For example, study the impact of the tool used to write text: 1) decide on two tools (e.g., desktop, mobile), 2) recruit participants, 3) ask participants to write on a specific topic, 4) collect the text and analyse the differences.

Shift in Reporting How are newspapers reporting about certain topic and when do they use certain words? Are articles written differently if they use “Europe” vs. articles using “European Union”?

Zero-Shot Learning Recently, pre-trained models have been studied for their ability to solve problems without an explicit training phase. For example, just given some examples, a model can be adapted for sentiment detection or information extraction. As as starting point OpenPrompt can be used.

Causal NLP

Causal Inference in NLP Measure the strength of causal effect via textual resources. How much does an event change the way people write about a topic? The event here could be a governmental intervention, a natural disaster, an accident, a personal experience.

Part of this project is to collect data via controlled experiments.

Extraction of Causal Patterns for Knowledge Base Completion Extract causal knowledge from a specific domain and transform the extracted information in structured form. The goal is to build (or extend) a knowledge graph.

Here the domain can be freely chosen.

Privacy-Preserving

Dataset Anonymisation Given a textual dataset (containing sensitive information, e.g., specific words), the goal is to remove any sensitive information. For example, if the age of a person is mentioned, it should be replaced. Person names should be identified and removed.

Privacy-Preservation Based on a dataset, define some sensitive attribute x_s, which has some predictive power on a target variable y. First, compute the part of x_s, which is helpful for prediction (correlation between x_s and y). Next, inject this information into 1) a new variable, 2) an existing variable (change the values), 3) all other variables (change all values a tiny bit). Finally, the effectiveness of the methods needs to be evaluated (i.e., same classification performance, but no more correlation between x_s and y) and the shift in the distribution (of the other variables).

Web Crawling and NLP

Climate Change Efforts There are many local efforts in addressing climate change on regional level, which is reported on respective web pages. The goal is to systematically crawl regional web sites and identify climate change actions taken by local governments. This topic combines the engineering required for web crawling and NLP tools for information extraction.

Causal Data Science and Machine Learning

Custom Loss for Privacy-Preservation via Causality Develop a loss function when training e.g. a Variational Autoencoder to additionally include a loss term for the “leak” of sensitive information.

Causal Outlier/Anomaly Detection Goals: 1) Given a dataset (including potentially unlabelled outliers) and a causal structure, research to which extend does the knowledge of the causal structure help to identify outliers. 2) Given a dataset and labelled outliers, research to which extend this helps for causal discovery.

Privacy-Preservation/Fairness via Causality Based on existing datasets, define some sensitive attributes x_s, where we want to protect the relationship between these attributes and the target attribute y (e.g., impact of gender on salary). Based on the knowledge about the dataset, derive a causal model, e.g., a causal graph. Research methods to remove the correlation between x_s and y (e.g., via introducing a new synthetic confounder attribute).

Software Development

Web App for Causal Exploration Goal: Extend an existing web-app consisting of three parts: 1) a part, where one can draw a simple causal graph, 2) a part, where one can upload and view a simple data-set (e.g., upload via .csv file), 3) an results part (e.g., an unbiased estimate of dependencies). Depending on the causal graph, the results section will be updated.

Web App for Dataset Generation Goal: Develop an easy-to-use web application that allows to generate various datasets. For example, it can be used to produce tabular data, or alternatively, it may produce time series data. The generated data should be as realistic as possible.