(706.715) Knowledge Discovery and Data Mining 2

About

Motivation

Knowledge Discovery and Data Mining are closely related to the concepts Big Data, Data Science and Data Analytics. Data science encompasses a number of diverse skills, ranging from software engineering, data mining and statistics by following a scientific, evaluation driven approach. This course aims to develop some of these important skill, with a diverse set focus areas. In addition, is a necessary prerequisite for many Knowledge Discovery applications to develop strong skills in analysing big data sets and preprocessing them.

The slides and resources from the previous years are available here: 2020, 2019, 2018, 2017, 2016, 2015, 2014

About the Course Content, theoretical, and practical goals.

Content

Course topics include:

Data Mining
Information Retrieval
Pattern Mining
Machine Learning
Text Mining
Time Series
Causality

Theoretical Goals

In this course the students will learn about:

Learn about the KDD process in detail
Learn about working with big data sets
Learn about real-world problem settings
Learn about advanced statistics and algorithms

Practical Goals

At the end of this course the students will able to apply:

Preprocess (big) data sets
Feature engineering on (heterogeneous) data
Clustering and classification algorithms
Information retrieval and recommender algorithms

Topics

Lectures

The lectures take place via videos and slides, which can be downloaded directly from this web site. In addition there are online Q&A sessions Thursday, 14:00 - 15:00 via WebEx, where the lecturer will be present and available for questions regarding the topics and projects (feel free to join).

Topic	Notes
Course Organization Videos: Introduction to course (13 min) Slides: Course Organisation	Introduction to the course and the administrative aspects.
Ensemble Methods Videos: Ensemble Introduction (15 min) Ensemble Methods (54 min) Slides: Ensemble Methods	Combination of multiple learning algorithms/models/hypothesis Each learning algorithm might have different strength and weaknesses. The idea of an ensemble is to combine the (weak) learners (e.g., base classifiers) into a combination that eliminates some of the weaknesses and combines some of the strengths. Today, ensemble algorithms like Random Forests or Gradient Boosting are goto-methods for many data science tasks.
Time Series Data Analysis Videos: Overview & Stationarity (27 min) Forecasting (23 min) Classification & Representation (14 min) Slides: Slides for Time Series Data Analysis	Time series data requires specific preprocessing and analysis.
Anomalies in Data Videos: Definition & Overview (22 min) Robust Statistics (12 min) Anomaly Detection (23 min) Slides: Slides for Anomalies in Data	Outliers/Anomalies/Surprise/Noise - all the same, or different?
Causal Data Science Videos: Causality Introduction (21 min) Correlation without Reason (43 min) Potential Outcomes (11 min) Structural Causal Model (8 min) Causal Graphs (24 min) Causal Inference (28 min) Causal Discovery (11 min) Conclusions & Practical Aspects (10 min) Slides: Slides for Causal Data Science	How does causality help in Data Science?
Privacy-Preserving Data Science Video: Video (22 min) Slides: Slides for Privacy-Preserving Data Science	Privacy-protection and confidentiality in Data Science. Introduction to main concepts, including k-Anonymity, Differential Privacy and Federated Learning.
Bias & Assumptions in Data Science Videos: Introduction (8 min) Assumptions (26 min) Bias (31 min) Fairness (14 min) Shifts (13 min) Extra content: Benford's Law (3 min) Slides: Slides for Assumption & Bias in Data Science	What assumptions do we have to make to effectively conduct data science? Is our data biased, or our algorithms? And why? And what can we do about that?
Projects Videos: Teaser Videos (25 min, password protected) All videos and slides of the participating teams can be accessed via TU Graz Cloud (password protected)	All practical projects, together with a teaser video for an overview! Additionally, please consider providing feedback via these forms: Team #1 Team #3 Team #4 Team #6 Team #7 Team #8 Team #9 Team #10 Team #11 Team #12 Team #15 Team #16 Team #17 Team #18 Team #19 Team #20 Team #21 Team #22 Team #23 Team #24 Team #25 Team #26 Team #27 Team #28 Team #29 Datateam Kidney #1 Datateam Kidney #3

Topic

Notes

Course Organization

Videos:

Introduction to course (13 min)

Slides: Course Organisation

Introduction to the course and the administrative aspects.

Ensemble Methods

Videos:

Ensemble Introduction (15 min)
Ensemble Methods (54 min)

Slides: Ensemble Methods

Combination of multiple learning algorithms/models/hypothesis

Each learning algorithm might have different strength and weaknesses. The idea of an ensemble is to combine the (weak) learners (e.g., base classifiers) into a combination that eliminates some of the weaknesses and combines some of the strengths.

Today, ensemble algorithms like Random Forests or Gradient Boosting are goto-methods for many data science tasks.

Time Series Data Analysis

Videos:

Overview & Stationarity (27 min)
Forecasting (23 min)
Classification & Representation (14 min)

Slides: Slides for Time Series Data Analysis

Time series data requires specific preprocessing and analysis.

Anomalies in Data

Videos:

Slides: Slides for Anomalies in Data

Outliers/Anomalies/Surprise/Noise - all the same, or different?

Causal Data Science

Videos:

Causality Introduction (21 min)
Correlation without Reason (43 min)
Potential Outcomes (11 min)
Structural Causal Model (8 min)
Causal Graphs (24 min)
Causal Inference (28 min)
Causal Discovery (11 min)
Conclusions & Practical Aspects (10 min)

Slides: Slides for Causal Data Science

How does causality help in Data Science?

Privacy-Preserving Data Science

Video:

Video (22 min)

Slides: Slides for Privacy-Preserving Data Science

Privacy-protection and confidentiality in Data Science. Introduction to main concepts, including k-Anonymity, Differential Privacy and Federated Learning.

Bias & Assumptions in Data Science

Videos:

Introduction (8 min)
Assumptions (26 min)
Bias (31 min)
Fairness (14 min)
Shifts (13 min)
Extra content: Benford's Law (3 min)

Slides: Slides for Assumption & Bias in Data Science

What assumptions do we have to make to effectively conduct data science? Is our data biased, or our algorithms? And why? And what can we do about that?

Projects

Videos:

Teaser Videos (25 min, password protected)
All videos and slides of the participating teams can be accessed via TU Graz Cloud (password protected)

All practical projects, together with a teaser video for an overview!
Additionally, please consider providing feedback via these forms:

Calender

Deadlines

There are the following main deadlines for the course:

Enrol for the course: 02.10.2020
Please register for the course in TUG Online.
Group registration: 29.10.2020
Send an e-mail to the instructor with the name(s) of the group and the chosen topic/dataset. You will get a group number in response.
Submit teaser video: 21.01.2021
Upload a short summary of your presentation to KDDM2 Dropzone, of less than a minute. It should motivate the other participants of the course to watch your full presentation video! Please include the team number in the video!
Submit presentation/code & homework: 29.01.2021
Upload the video of your presentation and the slides, together with the source code to the KDDM2 Dropzone, please name your files according to your group number. Furthermore, this deadline is also related to the submission of the homework via TeachCenter. The full video of the presentation should be around 10-20 minutes.

Work Plan

Created with Ganttproject: kddm2-project-plan-ws2020.gan

Materials

Presentation Templates

Slide templates for Latex: Presentation with Beamer
Alternative slide templates for Latex: Presentation with Beamer
Slide templates for PowerPoint: PowerPoint Template

Web Resources

List of Machine Learning tools
Introduction to Information Retrieval (Book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze)
Tutorial on Classification
Slides on Gradient Boosted Regression Trees

Examination & Grading

Grading distribution

The grading consists of two parts:

Homework - will be published via TeachCenter, will be 10 questions with 2 points each.
Project - there will be total of 60 points with the following distribution:
- Presentation: 15 points
  - How is the problem & solution being presented?
- Source code: 5 points
  - Quality of the code (i.e., comments, documentation)
  - Portability of the code
- Project: 40 points
  - How is the problem tackled?
  - What is the complexity of the solution?
  - How is the solution evaluated?

Grading scheme

0-40 points: 5
41-50 points: 4
51-60 points: 3
61-70 points: 2
71-80 points: 1

Exercise & Projects

Overview

There are several practical projects from different phases of the Knowledge Discovery process to choose from The work on the projects will be conducted by single students on their own (groups of one), but there is also the possibility to form groups of two people, where the project scope is then expanded appropriately (see advanced). The students are expected to present their work as a video presentation.

For all projects the evaluation is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.

There are data sets proposed for each of the project, but participants are free to come up with data sets of their own, or make own project proposals.

Presentations

Questions that should be covered by the presentation

What problem are you working on?
What are the key characteristics of the data set?
Why did you choose this approach?
How have you tackled the problem?
What are your evaluation results (is the problem solved)?
What have you learnt (new insights)?
Did something unexpected happen?
Would the solution apply to other scenarios (and how well)?

Project Topics

Challenges

Topic	Notes
TUG Data Team	Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage.

Data Science Topics

Topic	Notes
Outlier Detection	Task: Given some observations (data), find the instances that do not conform to the remainder of the observations. Approach: Select a data-set and setting. Research appropriate algorithmic approaches, implement selected outlier detection algorithms, apply them on selected dataset and report their performance. Suggested data-sets: ODDS Large collection of outlier detection datasets with ground truth: ODDS DAMI Datasets use in scientific publications (complete with performance results): DAMI Advanced: Implement your own outlier detection algorithm and compare its performance against the baseline.
Privacy Protection	Task: Given a dataset, which contains some sensitive information, the dataset should be transformed into a representation (e.g., a modified version of that dataset), which no longer contains the sensitive information. The type of the sensitive information is defined beforehand and could either be regarding the membership (e.g., if a certain person is part of the dataset), or some attribute (e.g., the income).
Sensor Analytics (Studentlab)	Task: Collect sensor data and detect certain states (only for teams) Option A: Biosensors (e.g. heartrate, ...), detect positions Option B: Industrial sensors (e.g. temperature, ...), estimate the numner of people within a room Option C: Fluid & gas sensors (e.g. CO2, ...), detect certain liquids Data-Set: Needs to be collected in course of the project, or alternatively use the one from "Databases 2" course.
Detect Causality	Task: Identification of causality relationship directly from the data (i.e., cause, effect). This is an advanced data science task with big potential. Data-Set: Data from a past challenge (complete with prior research): Causality Challenge #1. Papers: Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., ... & Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal systems. Nature communications, 6, 8502.
Dataset Collection	Task: Collect datasets (e.g., open-governmental) datasets, analyse these and assess the suitability of these datasets for a number of application scenarios. Approach: Research datasets, assess their key characteristics, apply data science methods to assess their usefulness. Advanced: Build a database (or similar) to allow to collect and update the relevant key parameters of each dataset.

Machine Learning Topics

Topic	Notes
Machine Learning	Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource. Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time. Suggested data-sets: Stack Exchange One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump Last.fm The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.
Explainable AI	Task: Compare two machine learning models with each other and explain/interpret what they have learnt. For example, which features have been picked by the respective models. Approach: Train two separate machine learning models - they should be distinctively different, for example a CNN and a linear model. Advanced: Compare approaches what are predominantly physical-driven (i.e., the features are derived from expert Knowledge or physical laws) with data-driven (i.e., no explicit feature engineering) approaches.

Topic

Notes

Machine Learning

Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource.

Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time.

Suggested data-sets:

Stack Exchange
One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump
Last.fm
The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset

Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.

Explainable AI

Task: Compare two machine learning models with each other and explain/interpret what they have learnt. For example, which features have been picked by the respective models.

Approach: Train two separate machine learning models - they should be distinctively different, for example a CNN and a linear model.

Advanced: Compare approaches what are predominantly physical-driven (i.e., the features are derived from expert Knowledge or physical laws) with data-driven (i.e., no explicit feature engineering) approaches.

Time Series Analytics Topics

Topic	Notes
Timeseries Prediction	Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range. Approach: Take the stream of data and build a prediction algorithm, that is able to predict the future values of the streams a accurately as possible. Suggested data-sets: Intel Berkeley Research Lab Take for example the data from the sensors of the Intel Berkeley Research Labs, see Stream Data Mining Repository. The data is in a format used by many machine learning frameworks, e.g. Weka. Powersupply Stream Use the power supply stream from the same data source: Stream Data Mining Repository. Here the challenge is to integrate seasonality into the analysis. UCI Repository Repository of multiple data-sets, including timeseries. Advanced: Try to detect events (e.g. meetings) within the data. This is a hard task, as there is no ground truth to evaluate against, thus it is part of the project to with strategies on how to measure the quality of the algorithms.
Time Series Classification	Task: Classification of time series data. Approach: Pick a dataset from the linked repository of time series datasets and try to reproduce (or surpass) the posted performance values. Suggested data-sets: Timeseries Repository Web-site of timeseries dataset together with the performance of a number of different algorithms: Welcome to the UEA & UCR Time Series Classification Repository. Advanced: Analyse how the performance (i.e., classification accuracy) drops, the fewer data is used (fewer parts of the timeseries), to simulate a early classification problem.
Pattern Mining in Time Series	Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series data-set (optionally apply SAX beforehand). Suggested data-sets: UCI Repository Repository of multiple data-sets, including timeseries. Advanced: Pre-process the data via piecewise linear approximation.
Seasonality in Time Series	Task: Detect the seasonality (= number of observations of the dominant repeating pattern in time series). Approach: Take the stream of data and build a prediction algorithm, that is able to predict the future values of the streams a accurately as possible. Suggested data-sets: CRAN Time Series Time series dataset, see paper on how to use this dataset: Toller, M., Santos, T. and Kern, R. (2019) ‘SAZED: parameter-free domain-agnostic season length estimation in time series data’, Data Mining and Knowledge Discovery. doi: 10.1007/s10618-019-00645-z. NOAA Water Level For example the data from the currents NOAA Water Level.

Information Retrieval

Topic	Notes
Query Completion	Scenario: A user is starting to search by typing in some words... Task: The system should automatically suggest word completions, depending on the already entered words. Suggested data-sets: Wikipedia The Wikipedia can be downloaded as dump, either as XML or as MySQL Database from the Wikimedia website. Europeana Instead of using a dataset to retrieve relevant items from, one can directly use a search engine. The Europeana project directly supports a JSON query interface, which can be accessed with an API key: Europeana API Portal Advanced #1: Provide a estimate of the number of hits for each completion suggestion. Advanced #2: Provide not only completions, but also similar queries (yielding similar search results), may also include synonyms.
Blog Search	Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context. Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful. Suggested data-sets: Same as previous task. Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

Topic

Notes

Query Completion

Scenario: A user is starting to search by typing in some words...

Task: The system should automatically suggest word completions, depending on the already entered words.

Suggested data-sets:

Wikipedia
The Wikipedia can be downloaded as dump, either as XML or as MySQL Database from the Wikimedia website.
Europeana
Instead of using a dataset to retrieve relevant items from, one can directly use a search engine. The Europeana project directly supports a JSON query interface, which can be accessed with an API key: Europeana API Portal

Advanced #1: Provide a estimate of the number of hits for each completion suggestion.

Advanced #2: Provide not only completions, but also similar queries (yielding similar search results), may also include synonyms.

Blog Search

Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context.

Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful.

Suggested data-sets: Same as previous task.

Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

Knowledge Discovery and Data Mining 2 (2020/21)

VU (706.715)

About

Motivation

About the Course Content, theoretical, and practical goals.

Content

Theoretical Goals

Practical Goals

Topics

Lectures

Calender

Deadlines

Work Plan

Materials

Presentation Templates

Web Resources

Examination & Grading

Grading distribution

Grading scheme

Exercise & Projects

Overview

Presentations

Questions that should be covered by the presentation

Project Topics

Challenges

Data Science Topics

Machine Learning Topics

Time Series Analytics Topics

Information Retrieval