Natural Language Processing

VU (706.230)

This course provides insights into the field of text analysis with a focus on written text. In the theoretical part a wide range of algorithms are presented, which are designed to allow to extract structured information out of unstructured, textual resources. In practical projects from varying application areas these text analytics methods will be applied and systematically evaluated.

The theoretical part will be mainly presented as classical lectures and in the practical part each student is expected to work on projects.

The main instructor of the course: Roman Kern

The grading depends on the quality of the work conducted in the practical project (which can be freely chosen) and a homework via TeachCenter.

About


Motivation

Today, the society at large is heavily influenced by machine making decisions, including what information is presented to users. A ongoing trend, which is associated with the term artificial intelligence. The real-world consequences reach from influencing buying behaviour to voting behaviour. One key element of artificial intelligence is the analysis of the text being written and disseminated. The key here is to be able to extract valuable information out of the text, leading to a range of tasks being associated with natural language processing.

About the Course

Content, theoretical, and practical goals.

Content

Course topics include:

  • History of NLP
  • Processing Pipelines
  • Embedding Techniques
  • Stylometry
  • Sentiment Detection
  • Word Sense Disambiguation
  • Causality in NLP

Theoretical Goals

In this course the students will learn about:

  • Understand key challenges and limitations in NLP
  • Analyse a NLP problem and its data
  • Apply NLP techniques on text
  • Evaluate the performance of NLP methods

Practical Goals

At the end of this course the students will able to apply:

  • Preprocess (big) textual data sets
  • Feature engineering on textual data
  • Apply NLP methods and libraries on text
  • Conduct evaluation to test the performance

Topics


Lecture

The lectures take place in the lecture hall (HS i9), but also be recorded and made available via TUbe. It is not mandatory to visit the lecture and it can be completed remotely.

Topic Notes

Course Organization

Overview of the administrative aspect of the course (e.g., important dates, project overview, ...).

History of NLP

Brief overview of the discipline of NLP, how it evolved and an overview of the main tasks in NLP.

Traditional NLP Pipeline

Overview of the traditional approach to NLP, including sentence splitting, tokenisation, PoS-tagging, Sentiment Detection, ...

Evaluation & Hypothesis-Testing

Overview of the evaluation approaches to NLP.

Dror, R., Baumer, G., Shlomov, S. and Reichart, R. 2018. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. (2018), 1383–1392.

Word Embeddings

Overview of contemporary word emdedding techniques starting with Word2Vec and GloVe

Pennington, J., Socher, R. and Manning, C.D. 2014. GloVe: Global Vectors for Word Representation. Empirical Methods in Natural Language Processing (EMNLP) (2014), 1532–1543.

Deep Learning

Recent advances in NLP via Deep Learning, starting with Sequence2Sequence models up to BERT/BART/GPT-III.

Singh, S., & Mahmood, A. (2021). The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures, 1–27.

Word Sense Disambiguation

Short introduction into the ambiguities of words, e.g., looking at the difference between senses and meaning and overview of the main approaches to resolve these.

Darmon, A.N.M., Bazzi, M., Howison, S.D. and Porter, M.A. 2018. Pull out all the stops: Textual analysis via punctuation sequences. 1989, (2018).

Klein, D. and Murphy, G. 2002. Paper has been my ruin: conceptual relations of polysemous senses. Journal of Memory and Language. 47, 4 (Nov. 2002), 548–570. DOI:https://doi.org/10.1016/S0749-596X(02)00020-7.

Stylometry

Overview of stylometric features used in NLP, for tasks like authorship attribution and plagiarism detection.

Causality in NLP

There are two ways, how causality is related to NLP: i) directly identify causal relationship from text, e.g., identify the relationship between a drug and its response, ii) use causal structures to improve an NLP tasks, e.g., improve text classification by exploiting a better understanding of the text generation process.

Privacy-Preservation, Fairness, and Bias in NLP

Even in text there are some sensitive information, which needs to be identified and further removed to arrive protect the privacy. Especially word embeddings have been studied for their inherit bias. Discriminatory information in text may have implications on fairness, including legal concerns.

Reports


The report describes the work on the practical project, and the evaluation results.

Overview Guidelines

  • Overview of the problem (clearly stated, maybe introduce a notation)
  • Related literature (how has the problem been solved before, are there any preliminary work)
  • Description of the method (what has been done, with explanations for decisions)
  • Description of the dataset, if applicable (expected charts of distributions or tables with key characteristics)
  • All preprocessing steps, dataset splits, hyperparameter tuning, etc.
  • Important: Evaluation results (how well is the problem solved?)

As a general guideline, the report should be written to allow to reproduce the results and to adhere to standards as reported. Consult the ARR Responsible NLP Research checklist as help for deciding, what and how to report. If you like, you also may add an Impact Statement.

Structure of the Report

Here a number of suggestions for the written report and how to structure the report:

  • Clear description of the problem setting in the introduction
  • Provide some related work to make clear that authors are aware of the current state-of-the-art and optionally alternative solutions (might be part of the introduction, or a separate related work section).
  • Fundamental techniques do not need to be described (e.g., back-propagation, logistic regression), if the approach makes use of lesser known techniques, these might be briefly introduced to make the report self-contained
  • Consider introducing a formal notation of the problem/task (and use the notation consistently throughout the report)
  • The report should be written in a way so that it is possible to reproduce the results (i.e., to reimplement the approach). Optionally, make use of an appendix to list (hyper-)parameters, which might be relevant, but would deter the readability of the report.
  • Evaluation results, including:
    • Evaluation methodology (how the evaluation was conducted)
    • Description of the dataset (including relevant statistics of the dataset)
    • Provide baseline results to aid the reader to understand how hard the problem is
    • Consider a in-depth analysis, why the approach worked (maybe conduct an ablation study; demonstrate and discuss the feature importance; discuss special cases, where the approach did not work, ...)
  • Maybe split the evaluation part into two sections: 1) the objective results, 2) the interpretation of the results (typically the discussion section)
  • The discussion section should provide the key insights and the context (i.e., not only list what worked (or not), but also provide possible reasons why it worked (or not), or when it should work). Also, there should be a clear limitations section or paragraph that make clear that the authors are aware of potential limitations.
  • The conclusions should contain the key insights and takeaway message (i.e., not only a repetition of what has been done)
  • The appendix should contain a table with the contributions of the individual authors and a second table with the responsible NLP ckecklist.

Resources


There are a few seminar textbooks on the topic of NLP and text mining:

  • Papers with Code - great web site with example for implementation of many publications

Projects


Project Rules

There is no limitation on what tools or programming language to use.

You are free to make use of existing tools and libraries.

There is also no limitation on what language the project is conducted (English, German, Klingon, ...), otherwise stated (since the available of tools for specific languages vary, the grading will be adapted to the additional effort for "exotic" languages). See bottom for further dataset recommendations.

Project Topic Suggestions

Name Description Additional Information
TU Graz Data Team Participate in a (text-based) challenge together with the members of the Data Team In interested, join the discord.
Extract Causal Expressions Build a system to automatically extract causal expressions from (unseen) text. Training data suggestions: BECAUSE (English), New Resource (German, this includes also a German baseline performance: I. Rehbein and J. Ruppenhofer, “A new resource for German causal language,” in Proceedings of the 12th Language Resources and Evaluation Conference, (Marseille, France), pp. 5968–5977, European Language Resources Association, May 2020.)
Emotion and Motivation Extraction Extract emotions and sentiment from text, optionally together with a trigger. This can also be extended to toxic text (e.g., flamewars, shitstorm, ...), e.g., what triggered a certain response. See Poria, S., Majumder, N., Hazarika, D., Ghosal, D., Bhardwaj, R., Jian, S. Y. B., … Mihalcea, R. (2020). Recognizing Emotion Cause in Conversations.
Event Extraction Extract event and temporal statements from text. Typically newspaper articles are a starting point to identify (ongoing) events Caselli, T., & Vossen, P. (2017). The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction, 77–86. https://doi.org/10.18653/v1/w17-2711
Diachronic Changes in Text Language is always on the move. Each generation introduces their own changes to languages. In this project the aim is to study change over time. This may include change in topics, change in grammar, change in word sense, ... Recommended starting point for word senses: SemEval 2020, Task 1
NLP on Historic Text In the context of cultural heritage initiatives, more and more historical textual data is made available, typically in form of text generation via scanned in sources that were OCRed. Here the challenge is not only the change in language, but often the bad quality of the sources and the errors introduced in the processing chain. For historic text all of the other mentioned task are applicable. Possible data sources: Archiv.org, ANNO, Europeana
NLP on Legal Text Laws and other legal text deviate to great extend from typical language use. The project may: 1) analyse and quantify the changes between legal text and common text, 2) study a use case to support legal activities. Starting point: CLAIM repository and paper: Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., & Sun, M. (2020). How does NLP benefit legal system: A summary of legal artificial intelligence. arXiv preprint arXiv:2004.12158.
Text Simplification Often text is written for a specific target audience, with the result that the general public does not fully understand the text. Examples are: legal text, medical text, engineering text. This project aims to develop text simplification for a domain-specific dataset in order to allow a greater audience to understand the content of the texts. May also include the use of domain-specific knowledge bases (graphs) as look-up of terminology. Could also be tackled as a text translation task.
Knowledge Graph #1 Use an existing knowledge graph for tasks like named entity recognition. The knowledge graph might be domain specific.
Knowledge Graph #2 Starting with an existing knowledge graph use a (large) textual corpus to add information to the knowledge graph (completion).
Trending Topics What are the current trends? Identify how topics shift in online conversation. Dataset suggestion: Twitter, Reddit, Newspapers
Fake News Develop an algorithm that is able to help identify fabricated news. There are many reasons how and why news or facts are fabricated (accident, misunderstanding, wishful thinking, propaganda, ...). In this project pick one scenario together with a dataset and develop a method to identify fake news. Options include unsupervised systems that look a linguistic clue to supervised and knowledge-based system (database of facts). Recommended starting point: Murayama, T. (2021). Dataset of Fake News Detection and Fact Verification: A Survey. arXiv preprint arXiv:2111.03299.
Readability Are newspaper articles from one newspaper more (or easier) readable than from another newspaper? Are there difference in Wikipedia articles based on the topic? Do classical readability score really measure how readable a text is? Note: Will need to obtain own dataset
Tools-Specific Writing Style Dataset Collect a dataset, where pairs of text are collected. The pairs should cover the same topic, but the text is written via different tools, e.g., 1) compare desktop vs. mobile interfaces, 2) writing with and without spelling suggestions/corrections. The project consists in the planning of the collection, the collection and an in-depth analysis of the dataset.
Situation-Specific Writing Style Dataset Collect a dataset, where pairs of text are collected. This time, the difference is the situation the writer is at the time of writing. Situations may range from, writing in the bus/tram, writing at home, writing under pressure, etc. All other factors apart the situation should be controlled (e.g., same phone, same text (length, ...)). The project consists in the planning of the collection, the collection and an in-depth analysis of the dataset.
DerStandard Forum Writing Style Develop methods to identify specific users based on 1) writing style (certain key phrases), 2) time, 3) reply to certain topics/other users Note: Other tasks with this dataset can also be proposed
Fine-Grained Sentiment Detection Develop (or apply) methods for opinion mining/sentiment detection that go beyond simple positive/neutral/negative classification. This can be 1) more detailed emotions/sentiment, 2) aspect-oriented sentiment detection (like in this example sentence "the food is good, but the service is bad") Dataset depends on the actual task. Hint: have a look at the SemEval initiative.
Quora Authorship Attribution The goal is to develop an authorship attribution method. Since texts on Quora are longer than, e.g., Twitter, the method is expected to perform better. The dataset from Quora need to be crawled
Authorship Dataset Collect a (large) dataset for authorship attribution, i.e., recruit (a lot of people) and have them write the same story using their own words/style. Analyse the dataset and propose ways to algorithmically identify individual authors/styles. Note: A tool to collect the data is available
Plagiarism Detection For a given document, specify which part has been copied from another document (or not). Can be both, 1) external (with a reference corpus), or 2) intrinsic (just by style change). For datasets have a look at the PAN competition.
Text Classification Classical NLP task, given an unseen document and a predefined set of classes (e.g., spam, non-spam), determine which class the document belongs to. Typically, this task is tackled as an supervised machine learning task. For datasets see bottom.
Information Extraction Another classical NLP task is to identify entities and relations from text. In the classical settings the entities are named entities (person, location, ...) and a specific, pre-defined relation. Typically this task is solved via machine learning (sequence classification). For datasets see bottom.
Interaction Information Given a large dataset, 1) split the text into words, 2) compute the interaction information (multivariate mututal information) between tuples of three words, 3) analyse the results and categorise the relationship types of the tuples (in specific identify cases, where the sign reverses). Advanced: Create a synthetic dataset for in-depth analysis
Privacy-Preservation on Text Often we want to share datasets (e.g., for research), but the dataset contains private information. This is especially problematic for text, and for specific domains, e.g., medical domain. Develop a method to convert a sensitive dataset into a dataset that can be shared with others. See this paper for overview and used datasets:
Mahendran, D., Luo, C., & McInnes, B. T. (2021). Review: Privacy-preservation in the context of natural language processing. IEEE Access, 9, 147600–147612. https://doi.org/10.1109/ACCESS.2021.3124163
Open Information Extraction for German Proposition extraction from newspaper articles, with the goal to extract factual statements. Either adapt and use an existing approach, or propose an own dedicated approach to extract the core facts (propositions) from text. Note: Will need to obtain own dataset
Your Own You are free to choose a own problem (in the area of NLP) Note: Please send your proposal to the instructor for an okay.

NLP Datasets

There are a number of NLP related datasets, which may be use for the practical projects.