The goal of this course is to continue and build upon the theory covered in Knowledge Discovery and Data Mining 1. In KDDM2 the emphasis lies on practical aspects and therefore the practical exercise is integral part of this course. In addition a number of algorithmic approaches will be covered in detail, which are related to the project topics. Participants may choose one out of a number of proposed projects from different stages of the Knowledge Discovery process and different data sets.
The instructors of the course are Roman Kern and Maximilan Toller, and if there are open questions, please feel free to send an e-mail with a prefix [KDDM2] in the subject.
Knowledge Discovery and Data Mining are closely related to the concepts Big Data, Data Science and Data Analytics. Data science encompasses a number of diverse skills, ranging from software engineering, data mining and statistics by following a scientific, evaluation driven approach. This course aims to develop some of these important skill, with a diverse set focus areas. In addition, is a necessary prerequisite for many Knowledge Discovery applications to develop strong skills in analysing big data sets and preprocessing them.
The slides and resources from the previous years are available here: 2020 (WS), 2020, 2019, 2018, 2017, 2016, 2015, 2014
Course topics include:
In this course the students will learn about:
At the end of this course the students will able to apply:
The lectures take place via videos and slides, which can be accessed via TeachCenter. In addition there are weekly supplementary sessions Thursday, 14:00 - 15:00, online via WebEx (even weeks) and offline in the lecture hall (odd weeks) - please check the time table. In these sessions an instructor will be present and available for questions, short tutorials and hands-on assistance for projects/homework (feel free to join).
Topic | Notes |
---|---|
Course Organization |
Introduction to the course and the administrative aspects. |
Ensemble Methods |
Combination of multiple learning algorithms/models/hypothesis Each learning algorithm might have different strength and weaknesses. The idea of an ensemble is to combine the (weak) learners (e.g., base classifiers) into a combination that eliminates some of the weaknesses and combines some of the strengths. Today, ensemble algorithms like Random Forests or Gradient Boosting are goto-methods for many data science tasks. |
Time Series Data |
Time series data requires specific preprocessing and analysis. |
Anomalies in Data |
Outliers/Anomalies/Surprise/Noise - all the same, or different? |
Graph Data Science |
Graphs can model complex systems, such as, social, biological and web networks and even molecules and meshes. Graphs require specific processing and analysis compared to other data structures, i.e., tabular, image, or time series data. |
Causal Data Science |
How does causality help in Data Science? |
Privacy-Preserving and Fairness in Data Science |
Confidentiality and fairness in Data Science. Introduction to main concepts, including k-Anonymity, Differential Privacy and Federated Learning. |
Bias & Assumptions in Data Science |
What assumptions do we have to make to effectively conduct data science? Is our data biased, or our algorithms? And why? And what can we do about that? |
Created with Ganttproject: kddm2-project-plan-ws2021.gan
There are several practical projects from different phases of the Knowledge Discovery process to choose from. The work on the projects will be conducted by single students on their own (groups of one), but there is also the possibility to form groups of two people, where the project scope is then expanded appropriately (see advanced). The students are expected to present their work as a video presentation (with a short teaser video).
For all projects the evaluation of the work is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.
There are data sets proposed for each of the project, but participants are free to come up with data sets of their own, or make own project proposals.
Topic | Notes |
---|---|
TUG Data Team |
Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage. |
Topic | Notes |
---|---|
Anomaly Detection | Task: Given some observations (data), find the instances that do not conform to the remainder of the observations. Approach: Select a dataset and setting. Analyze the dataset, search for appropriate anomaly detectors, implement & run them on the dataset, and report their performance. Suggested datasets:
Advanced #1: Implement your own anomaly detection algorithm and compare its performance against the baseline. Advanced #2: Work with time series datasets (if interested, please contact instructors). |
Privacy Preservation | Task: Given a dataset, which contains some sensitive information, the dataset should be transformed into a representation (e.g., a modified version of that dataset), which no longer contains the sensitive information. The type of the sensitive information is defined beforehand and could either be regarding the membership (e.g., if a certain person is part of the dataset), or some attribute (e.g., the income). |
Fairness in Data Science | Task: Propose a dataset, which may contain information that might lead to unfair results, unless special methods are applied. It should be clear that in a default data science pipeline some instances (e.g., a sub-population) would be treated unfair. It is expected that first a definition of fairness is introduced for the dataset (i.e., what do we consider as fair in this context). |
Detect Causality | Task: Identification of causality relationship directly from the data (i.e., cause, effect). This is an advanced data science task with big potential. Data-Set: Data from a past challenge (complete with prior research): Causality Challenge #1. Papers:
|
Dataset Collection | Task: Collect datasets (e.g., open-governmental) datasets, analyse these and assess the suitability of these datasets for a number of application scenarios. Approach: Research datasets, assess their key characteristics, apply data science methods to assess their usefulness. Advanced: Build a database (or similar) to allow to collect and update the relevant key parameters of each dataset. |
Health Index | Task: In data science we often we want to prevent (or predict) failures. For example we want to avoid accidents to happen due to fatigue of material. Therefore, it is necessary to derive a health index from the data that predicts e.g. the remaining life span of parts. Data-Set: A suggestion for starting point: NASA Prognostics Data Repository, especially the dataset mentioned in the paper below. Papers:
|
Topic | Notes |
---|---|
Machine Learning | Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource. Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time. Suggested data-sets:
Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses. |
Explainable AI | Task: Compare two machine learning models with each other and explain/interpret what they have learnt. For example, which features have been picked by the respective models. Approach: Train two separate machine learning models - they should be distinctively different, for example a CNN and a linear model. Advanced: Compare approaches what are predominantly physical-driven (i.e., the features are derived from expert Knowledge or physical laws) with data-driven (i.e., no explicit feature engineering) approaches. |
Topic | Notes |
---|---|
Timeseries Prediction | Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range. Approach: Take the stream of data and remove the last ~10% of the data. Build a prediction algorithm that is able to predict the future values of the streams as accurately as possible and compare against the values you removed. Suggested data-sets:
Advanced: Investigate your prediction algorithm and try to determine under which (controlled) circumstances it makes correct predictions. |
Time Series Classification | Task: Classification of time series data. Approach: Pick a dataset from the linked repository of time series datasets and try to reproduce (or surpass) the posted performance values. Suggested data-sets:
Advanced: Analyse how the performance (i.e., classification accuracy) drops, the fewer data is used (fewer parts of the timeseries), to simulate a early classification problem. |
Pattern Mining in Time Series | Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series dataset. Suggested datasets:
Advanced: Preprocess the data using the Matrix Profile technique and analyze the effect this has on your results. |
Seasonality in Time Series | Task: Detect the seasonality (= number of observations of the dominant repeating pattern in time series). Approach: Take the stream of data, build your own season length detection algorithm and compare against an existing algorithm. Suggested data-sets:
|
Topic | Notes |
---|---|
Query Completion | Scenario: A user is starting to search by typing in some words... Task: The system should automatically suggest word completions, depending on the already entered words. Suggested data-sets:
Advanced #1: Provide a estimate of the number of hits for each completion suggestion. Advanced #2: Provide not only completions, but also similar queries (yielding similar search results), may also include synonyms. |
Blog Search | Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context. Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful. Suggested data-sets: Same as previous task. Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page. |
Topic | Notes |
---|---|
Graph dataset collection | Task: Collect a graph dataset, analyze its properties (e.g., attributes, homophily, graph structural properties), and define a prediction task on it, such as, node classification or link prediction. Example: crawl a network of articles on a certain topic of your choice. Edges can be the web links between the articles. You can use the InfoBox or the first paragraph to create node features/labels. |
Prediction on a graph dataset | Task: Select a publicly available graph dataset with a task (e.g., node property prediction, edge property prediction). Analyze the dataset, and provide approaches for solving the respective task. Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks. Resources for graph data: PyTorch Geometric, SNAP, OGB (for large scale data). |
Transform into a graph | Task: Select a publicly available non-graph dataset (e.g., text corpus, tabular data, image data, ...) with a certain task (e.g., classification, regression). Find a way to build a graph from this data to improve the performance of traditional models through graph models. Study your built graph by analyzing homophily and find which models are suitable for the task. Train your models to solve the respective task. It's not required that your solution significantly outperforms the traditional solution. Tools: network analysis (e.g., with networkx), label/attribute propagation, graph embedding, graph neural networks. For inspiration, see section 5.2. Downstream Tasks in this paper and Text GCN section in this paper. |