(706.715) Knowledge Discovery and Data Mining 2

About

Motivation

Knowledge Discovery and Data Mining are closely related to the concepts Big Data, Data Science and Data Analytics. Data science encompasses a number of diverse skills, ranging from software engineering, data mining and statistics by following a scientific, evaluation driven approach. This course aims to develop some of these important skill, with a diverse set focus areas. In addition, is a necessary prerequisite for many Knowledge Discovery applications to develop strong skills in analysing big data sets and preprocessing them.

The slides and resources from the previous years are available here: 2022, 2021, 2020 (WS), 2020, 2019, 2018, 2017, 2016, 2015, 2014

Content

Course topics include:

Data Mining
Time Series Data
Causality
Anomaly Detection
Fairness and Bias
Pattern Mining
Machine Learning
Information Retrieval

Theoretical Goals

In this course the students will learn about:

Non-IID data such as temporal, spacial, graphical
Advanced data mining algorithms
Causality, Assumptions, Bias, Fairness
Correct evaluation

Practical Goals

At the end of this course the students will:

Know how to work with non-IID data
Be able to use advanced data mining algorithms
Understand how to conduct a fair and unbiased data analysis
Have gained significant practical data mining experience

Topics

Lectures

Topic	Notes
Course Organization	Introduction to the course and the administrative aspects.
Ensemble Methods	Combination of multiple learning algorithms/models/hypothesis Each learning algorithm might have different strength and weaknesses. The idea of an ensemble is to combine the (weak) learners (e.g., base classifiers) into a combination that eliminates some of the weaknesses and combines some of the strengths.
Time Series Data	Time series data requires specific preprocessing and analysis, since we cannot expect that observations in the future are independent from observations in the past.
Anomalies in Data	Outliers/Anomalies/Surprise/Noise - all the same, or different?
Graph and Spatial Data Science	Graphs can model complex systems, such as, social, biological and web networks and even molecules and meshes. Here, graphs are assumed to model dependencies as relation between nodes. For spatial data these dependencies are considered to be closeness, e.g., geographical distance.
Causal Data Science	Causal relations are a special kind of dependencies in the real world, which leaves imprints in the observational data. Causal data science considers this process.
Fairness and Privacy in Data Science	Many real world systems are considered unfair, so does the data derived thereof. Can we mend this situation in data science? How do we deal with sensitive, confidential, and personal data?
Bias & Assumptions in Data Science	What assumptions do we have to make to effectively conduct data science? Is our data biased, or our algorithms? And why? And what can we do about that?
Evaluation	How do we know that our analysis and prediction models truly perform in real world?

Materials

Poster Templates

Poster templates for: Scribus
Poster templates for: PowerPoint

Exercise & Projects

Overview

There are several practical projects from different phases of the Knowledge Discovery process to choose from. The work on the projects will be conducted by groups of students with a default size of two to three. There is also the possibility to form groups of more people or conduct the project on your own, where the project scope is then adjusted appropriately. The students are expected to present their work as a poster presentation.

For all projects the evaluation of the work is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.

There are data sets proposed for most of the projects, but participants are free to come up with data sets of their own, or make own project proposals.

Poster

Questions that should be covered by the poster

What problem are you working on?
What are the key characteristics of the data set?
Why did you choose this approach?
How have you tackled the problem?
What are your evaluation results (is the problem solved)?
What have you learnt (new insights)?
Did something unexpected happen?
Would the solution apply to other scenarios (and how well)?

Project Topics

Challenges

Topic	Notes
The KDDM2 Challenge	This is the main recommended project for all teams. This year's KDDM2 challenge will be about digital twins. Your task is to take historical data describing the past state(s) of the real counterpart and build a reasonable digital model. To evaluate your model's quality, you will be asked to forecast the future state of the real counterpart, based on its historical data. There will be leaderboard that shows which team is currently best at forecasting the future state, as well as the performance of some baselines. This is the default practical part of KDDM2. You will hear further details on November 23 in the lecture unit on digital twins. If you want to start earlier than that, or if you want to do a different project, please send an email to the course instructors.
TUG Data Team	Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage.

Topic

Notes

The KDDM2 Challenge

This is the main recommended project for all teams.

This year's KDDM2 challenge will be about digital twins. Your task is to take historical data describing the past state(s) of the real counterpart and build a reasonable digital model. To evaluate your model's quality, you will be asked to forecast the future state of the real counterpart, based on its historical data. There will be leaderboard that shows which team is currently best at forecasting the future state, as well as the performance of some baselines.

This is the default practical part of KDDM2. You will hear further details on November 23 in the lecture unit on digital twins. If you want to start earlier than that, or if you want to do a different project, please send an email to the course instructors.

TUG Data Team

Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage.

Data Science Topics

Topic	Notes
Anomaly Detection	Task: Given some observations (data), find the instances that do not conform to the remainder of the observations. Approach: Select a dataset and setting. Analyze the dataset, search for appropriate anomaly detectors, implement & run them on the dataset, and report their performance. Suggested datasets: ODDS Large collection of outlier detection datasets with ground truth: ODDS DAMI Datasets used in scientific publications (complete with performance results): DAMI Advanced #1: Implement your own anomaly detection algorithm and compare its performance against the baseline. Advanced #2: Work with time series datasets (if interested, please contact instructors).
Privacy Preservation	Task: Given a dataset, which contains some sensitive information, the dataset should be transformed into a representation (e.g., a modified version of that dataset), which no longer contains the sensitive information. The type of the sensitive information is defined beforehand and could either be regarding the membership (e.g., if a certain person is part of the dataset), or some attribute (e.g., the income).
Fairness in Data Science	Task: Propose a dataset, which may contain information that might lead to unfair results, unless special methods are applied. It should be clear that in a default data science pipeline some instances (e.g., a sub-population) would be treated unfair. It is expected that first a definition of fairness is introduced for the dataset (i.e., what do we consider as fair in this context).
Detect Causality	Task: Identification of causality relationship directly from the data (i.e., cause, effect). This is an advanced data science task with big potential. Data-Set: Data from a past challenge (complete with prior research): Causality Challenge #1. Papers: Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., ... & Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal systems. Nature communications, 6, 8502.
Dataset Collection	Task: Collect datasets (e.g., open-governmental) datasets, analyse these and assess the suitability of these datasets for a number of application scenarios. Approach: Research datasets, assess their key characteristics, apply data science methods to assess their usefulness. Advanced: Build a database (or similar) to allow to collect and update the relevant key parameters of each dataset.
Health Index	Task: In data science we often we want to prevent (or predict) failures. For example we want to avoid accidents to happen due to fatigue of material. Therefore, it is necessary to derive a health index from the data that predicts e.g. the remaining life span of parts. Papers: Arias Chao, M., Kulkarni, C., Goebel, K., & Fink, O. (2021). Aircraft Engine Run-to-Failure Dataset under Real Flight Conditions for Prognostics and Diagnostics. Data, 6(1), 5.

Machine Learning Topics

Topic	Notes
Machine Learning	Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource. Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time. Suggested data-sets: Stack Exchange One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump Last.fm The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.
Explainable AI	Task: Compare two machine learning models with each other and explain/interpret what they have learnt. For example, which features have been picked by the respective models. Approach: Train two separate machine learning models - they should be distinctively different, for example a CNN and a linear model. Advanced: Compare approaches what are predominantly physical-driven (i.e., the features are derived from expert Knowledge or physical laws) with data-driven (i.e., no explicit feature engineering) approaches.

Topic

Notes

Machine Learning

Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource.

Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time.

Suggested data-sets:

Stack Exchange
One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump
Last.fm
The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset

Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.

Explainable AI

Task: Compare two machine learning models with each other and explain/interpret what they have learnt. For example, which features have been picked by the respective models.

Approach: Train two separate machine learning models - they should be distinctively different, for example a CNN and a linear model.

Advanced: Compare approaches what are predominantly physical-driven (i.e., the features are derived from expert Knowledge or physical laws) with data-driven (i.e., no explicit feature engineering) approaches.

Time Series Analytics Topics

Topic	Notes
Timeseries Prediction	Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range. Approach: Take the stream of data and remove the last ~10% of the data. Build a prediction algorithm that is able to predict the future values of the streams as accurately as possible and compare against the values you removed. Suggested data-sets: Intel Berkeley Research Lab Take for example the data from the sensors of the Intel Berkeley Research Labs, see Stream Data Mining Repository. The data is in a format used by many machine learning frameworks, e.g. Weka. Powersupply Stream Use the power supply stream from the same data source: Stream Data Mining Repository. Here the challenge is to integrate seasonality into the analysis. UCI Repository Repository of multiple data-sets, including timeseries. Advanced: Investigate your prediction algorithm and try to determine under which (controlled) circumstances it makes correct predictions.
Time Series Classification	Task: Classification of time series data. Approach: Pick a dataset from the linked repository of time series datasets and try to reproduce (or surpass) the posted performance values. Suggested data-sets: Timeseries Repository Web-site of timeseries dataset together with the performance of a number of different algorithms: Welcome to the UEA & UCR Time Series Classification Repository. Advanced: Analyse how the performance (i.e., classification accuracy) drops, the fewer data is used (fewer parts of the timeseries), to simulate a early classification problem.
Pattern Mining in Time Series	Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series dataset. Suggested datasets: UCI Repository Repository of multiple data-sets, including timeseries. Advanced: Preprocess the data using the Matrix Profile technique and analyze the effect this has on your results.
Seasonality in Time Series	Task: Detect the seasonality (= number of observations of the dominant repeating pattern in time series). Approach: Take the stream of data, build your own season length detection algorithm and compare against an existing algorithm. Suggested data-sets: CRAN Time Series Time series dataset, see paper on how to use this dataset: Toller, M., Santos, T. and Kern, R. (2019) ‘SAZED: parameter-free domain-agnostic season length estimation in time series data’, Data Mining and Knowledge Discovery. doi: 10.1007/s10618-019-00645-z. NOAA Water Level For example the data from the currents NOAA Water Level.

Graph data

Topic	Notes
Graph dataset collection	Task: Collect a graph dataset, analyze its properties (e.g., attributes, homophily, graph structural properties), and define a prediction task on it, such as, node classification or link prediction. Example: crawl a network of articles on a certain topic of your choice. Edges can be the web links between the articles. You can use the InfoBox or the first paragraph to create node features/labels.
Prediction on a graph dataset	Task: Select a publicly available graph dataset with a task (e.g., node property prediction, edge property prediction). Analyze the dataset, and provide approaches for solving the respective task. Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks. Resources for graph data: PyTorch Geometric, SNAP, OGB (for large scale data).
Transform into a graph	Task: Select a publicly available non-graph dataset (e.g., text corpus, tabular data, image data, ...) with a certain task (e.g., classification, regression). Find a way to build a graph from this data to improve the performance of traditional models through graph models. Study your built graph by analyzing homophily and find which models are suitable for the task. Train your models to solve the respective task. It's not required that your solution significantly outperforms the traditional solution. Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks. For inspiration, see section 5.2. Downstream Tasks in this paper and Text GCN section in this paper.

Topic

Notes

Graph dataset collection

Task: Collect a graph dataset, analyze its properties (e.g., attributes, homophily, graph structural properties), and define a prediction task on it, such as, node classification or link prediction.

Example: crawl a network of articles on a certain topic of your choice. Edges can be the web links between the articles. You can use the InfoBox or the first paragraph to create node features/labels.

Prediction on a graph dataset

Task: Select a publicly available graph dataset with a task (e.g., node property prediction, edge property prediction). Analyze the dataset, and provide approaches for solving the respective task.

Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks.

Resources for graph data: PyTorch Geometric, SNAP, OGB (for large scale data).

Transform into a graph

Task: Select a publicly available non-graph dataset (e.g., text corpus, tabular data, image data, ...) with a certain task (e.g., classification, regression). Find a way to build a graph from this data to improve the performance of traditional models through graph models. Study your built graph by analyzing homophily and find which models are suitable for the task. Train your models to solve the respective task. It's not required that your solution significantly outperforms the traditional solution.

Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks.

For inspiration, see section 5.2. Downstream Tasks in this paper and Text GCN section in this paper.