Thesis

Options

There are mainly three options for the thesis:

  1. Choose a topic, or propose an own topic and work on your own (see list below for options) (this is the default option)
  2. Collaboration with local start-ups, or companies (paid Master’s thesis)
  3. Work together with a research partner organisation

Template

Feel free to use the Latex thesis template (based on input from Karl Voit and Keith Andrews):

Template, and preview

Collection of a few helpful tips for Master’s thesis, provided by Annemarie Harzl. For printing the thesis one can choose the CopyShop or an online service (e.g., masterprint). After the thesis is finished beware of predatory publishers that offer to print the thesis for free!

Topics

NLP

Information Extraction from Historical Text The 19th century is known for its big changes in politics, technology and society - and also called long nineteenth century. The goal of the thesis is to use NLP to support historians in their work to better understand events including their causes and how they have been perceived. To this end, sources like historical news papers are to be collected, and information extraction methods should be applied. In particular, the extraction of cause and effects related to historical events are of interest.

For example, collect text from archive.org and analyse the change in language (frequency of words, frequency of phrases and/or grammar).

Change in Authorship Style (e.g., GPT-3) The writing style is unique to a person, but it is also subject to change. A long text might be authored by multiple people (or a generative language model, such as GPT-3). The goal is to detect, where these changes in style happen.

Shift in Reporting How are newspapers reporting about certain topic and when do they use certain words? Are articles written differently if they use “Europe” vs. articles using “European Union”? Are there event that change the way, how these are reported?

Causal NLP

Causal Expressions in IPCC Reports Build a automatic extraction of causal expressions and apply this to longer documents, such as the IPCC report. The causal expressions are then collected and aggregated to give a quick overview of the main arguments.

Causal Inference in NLP Measure the strength of causal effect via textual resources. How much does an event change the way people write about a topic? The event here could be a governmental intervention, a natural disaster, an accident, a personal experience. Part of this project is to collect data via controlled experiments.

Extraction of Causal Patterns for Knowledge Base Completion Extract causal knowledge from a specific domain and transform the extracted information in structured form. The goal is to build (or extend) a knowledge graph. Here the domain can be freely chosen.

Privacy-Preserving

Dataset Anonymisation Given a textual dataset (containing sensitive information, e.g., specific words), the goal is to remove any sensitive information. For example, if the age of a person is mentioned, it should be replaced. Person names should be identified and removed.

Privacy-Preservation Based on a dataset, define some sensitive attribute x_s, which has some predictive power on a target variable y. First, compute the part of x_s, which is helpful for prediction (correlation between x_s and y). Next, inject this information into 1) a new variable, 2) an existing variable (change the values), 3) all other variables (change all values a tiny bit). Finally, the effectiveness of the methods needs to be evaluated (i.e., same classification performance, but no more correlation between x_s and y) and the shift in the distribution (of the other variables).

Analyse Contracts Analyse textual contracts in the German language and extract legal relevant terms and phrases from the text. This also includes dates and time ranges. In addition, one can create a reference list with phrases that are expected and then match against an existing text. Finally, this can be used to create a classification scheme for the contracts.

Web Crawling and NLP

Climate Change Efforts There are many local efforts in addressing climate change on regional level, which is reported on respective web pages. The goal is to systematically crawl regional web sites and identify climate change actions taken by local governments. This topic combines the engineering required for web crawling and NLP tools for information extraction.

Data Science and Machine Learning

Split Features into Neighbourhood and Similarity Many machine learning and data science tasks assume the features to semantically equal. The idea is to split the feature set into two sets, the first representing features encoding the closeness of instances, and a second set encoding the similarity between instances. This approach can then implemented in for example Local Outlier Factor or other methods.

Custom Loss for Privacy-Preservation via Causality Develop a loss function when training e.g. a Variational Autoencoder to additionally include a loss term for the “leak” of sensitive information.

Causal Outlier/Anomaly Detection Goals: 1) Given a dataset (including potentially unlabelled outliers) and a causal structure, research to which extend does the knowledge of the causal structure help to identify outliers. 2) Given a dataset and labelled outliers, research to which extend this helps for causal discovery.

Privacy-Preservation/Fairness via Causality Based on existing datasets, define some sensitive attributes x_s, where we want to protect the relationship between these attributes and the target attribute y (e.g., impact of gender on salary). Based on the knowledge about the dataset, derive a causal model, e.g., a causal graph. Research methods to remove the correlation between x_s and y (e.g., via introducing a new synthetic confounder attribute).

Software Development

Web App for Graph Database Goal: Develop a web application to visualise a graph database (e.g., Neo4J) and make the content available for intuitive search requests. The graph database stores a knowledge graph, either taken from existing datasets (e.g., ConceptNet, ATOMIC), or with own data. The user can then interact with the graph and drill down to individual nodes.