Knowledge Discovery and Data Mining 2

Roman Kern (website)

Content

Course topics include:

Data Mining
Information Retrieval
Pattern Mining
Machine Learning
Text Mining
Time Series

Theoretical Goals

In this course the students will learn about:

Learn about the KDD process in detail
Learn about working with big data sets
Learn about real-world problem settings
Learn about advanced statistics and algorithms

Practical Goals

At the end of this course the students will know how to:

Preprocess (big) data sets
Feature engineering on (textual) data
Clustering and classification algorithms
Information retrieval and recommender algorithms

Lectures

The lectures take place in the HS i8 (Inffeldgasse 13, ground floor), on Thursday, 12:15 - 13:45. For exceptions, please see below!

14.03.2019: Course Organization + Ensemble Methods
21.03.2019: Text Mining
28.03.2019: Time Series Data Analysis
04.04.2019: Information Retrieval
02.05.2019: Pattern Mining
09.05.2019: Q & A Session
23.05.2019: Deep Learning
27.06.2019: Presentation: KDDM2 Conference (Start 10:00, HS i9)
Poster
Team #7 (most popular poster),
Team #8, Team #11, Team #14, Team #16, Team #17, Team #19
Reports
Team #2, Team #5, Team #6, Team #12, Team #18
Presentations
Team #1

Work Plan

Created with Ganttproject: kddm2-project-plan-2019.gan

Templates

Poster

Poster templates for: Scribus
Poster templates for: PowerPoint

Report

Report should use the ACM Style

Web Resources

Introduction to Information Retrieval (Book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze)
Tutorial on Classification
Slides on Gradient Boosted Regression Trees
List of Machine Learning tools

All students are required to register for the VU in TUGOnline until 07.03.2019, 23:59.

It is not planned that there is a written (or oral) examination, as the focus lies on the practical exercise. Therefore, the grading will entirely will depend on how the exercise is conducted and on its results.

Overview

There are several practical projects from different phases of the Knowledge Discovery process to choose from.

The work on the projects will be conducted by single students on their own (groups of one). But there is also the possibility to form groups of two people, where the project scope is then expanded appropriately (see advanced). The students are expected to present the progress of their work as a poster presentation (or alternatively, as a written report).

For all projects the evaluation is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.

There are data sets proposed for each of the project, but students are free to come up with data sets of their own, or make project proposals.

Grading is conducted on these criteria:

How is the problem tackled?
What is the complexity of the solution?
How is the solution evaluated?
How is the problem & solution being presented?

Poster Presentations

Questions being covered by the poster

What problem are you working on?
Why did you choose this approach?
How have you tackled the problem?
What are your evaluation results (is the problem solved)?
What have you learnt (new insights)?
Did something unexpected happen?
Would the solution apply to other scenarios (and how well)?

Topics

I. Participate in a challenge

TUG Data Team

Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage.

Suggested challenges:

PAN
Homepage: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution, or Style Change Detection

II. Text Mining Topics

Document Representation

Task: Build and evaluate different approaches to combine multiple word vectors into a single document vector.

Data-Set: Stanford Sentiment Treebank Dataset

Advanced: Develop and evaluate your own document representation method.

Keywords: Word2Vec, Doc2Vec

Papers:

Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.

Sarcasm and Irony Detection for German

Task: Build a data-set for irony and sarcasm based on a provided web-platform and then design an algorithm to distinguish between "normal" text and irony/sarcasm.

Data-Set: Collect yourself via platform - you will need to "hire" a number of volunteers to provide sufficiently much data.

Papers:

Ling, J., & Klinger, R. (2016, May). An Empirical, Quantitative Analysis of the Differences between Sarcasm and Irony. In International Semantic Web Conference (pp. 203-216). Springer, Cham.

E-Mail Parsing

Task: Parse a semi-structured data set with the goal to add more structure to the data. The goal is to separate correct sentences from other textual fragments, e.g. ascii tables, greetings, etc.

Use-Case: Pre-processing of textual data for further processing, e.g. for prediction of potential receivers.

Data-Set: The data being worked on are e-mails, which already are semi-structured and the header contains information like the sender.

Suggested data-sets:

Apache Mailing List
e.g. take the MBox files (details for downloading) or alternatively use Amazon
Enron
Mails from Enron employees: Details

Papers:

Lampert, A., Dale, R., & Paris, C. (2009, August). Segmenting email message text into zones. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 919-928). Association for Computational Linguistics.

Advanced: Write a prediction algorithms that is able to recover the sender and the receivers once removed from the mail.

III. Data Science Topics

Outlier Detection

Task: Given some observations (data), find the instances that do not conform to the remainder of the observations.

Approach: Select a data-set and setting. Research appropriate algorithmic approaches, implement selected outlier detection algorithms, apply them on selected dataset and report their performance.

Suggested data-sets:

ODDS
Large collection of outlier detection datasets with ground truth: ODDS
DAMI
Datasets use in scientific publications (complete with performance results): DAMI

Advanced: Implement your own outlier detection algorithm and compare its performance against the baseline.

Sensor Analytics (Studentlab)

Task: Collect sensor data and detect certain states (only for teams)

Option A: Biosensors (e.g. heartrate, ...), detect positions

Option B: Industrial sensors (e.g. temperature, ...), estimate the numner of people within a room

Option C: Fluid & gas sensors (e.g. CO2, ...), detect certain liquids

Data-Set: Needs to be collected in course of the project, or alternatively use the one from "Databases 2" course.

Detect Causality

Task: Identification of causality relationship directly from the data (i.e., cause, effect). This is an advanced data science task with big potential.

Data-Set: Data from a past challenge (complete with prior research): Causality Challenge #1.

Papers:

Runge, J., Petoukhov, V., Donges, J. F., Hlinka, J., Jajcay, N., Vejmelka, M., ... & Kurths, J. (2015). Identifying causal gateways and mediators in complex spatio-temporal systems. Nature communications, 6, 8502.

Dataset Collection

Task: Collect datasets (e.g., open-governmental) datasets, analyse these and assess the suitability of these datasets for a number of application scenarios.

Approach: Research datasets, assess their key characteristics, apply data science methods to assess their usefulness.

Advanced: Build a database (or similar) to allow to collect and update the relevant key parameters of each dataset.

IV. Machine Learning Topics

Machine Learning

Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource.

Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time.

Suggested data-sets:

Stack Exchange
One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump
Last.fm
The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset

Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.

V. Information Retrieval

Query Completion

Scenario: A user is starting to search by typing in some words...

Task: The system should automatically suggest word completions, depending on the already entered words.

Suggested data-sets:

Wikipedia
The Wikipedia can be downloaded as dump, either as XML or as MySQL Database from the Wikimedia website.
Europeana
Instead of using a dataset to retrieve relevant items from, one can directly use a search engine. The Europeana project directly supports a JSON query interface, which can be accessed with an API key: Europeana API Portal

Framework:

For processing of the text you might use: Sensium

Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

Blog Search

Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context.

Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful.

Suggested data-sets: Same as previous task.

Framework:

For processing of the text you might use: Sensium

Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

VI. Time Series Analytics Topics

Timeseries Prediction

Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range.

Approach: Take the stream of data and build a prediction algorithm, that is able to predict the future values of the streams a accurately as possible.

Suggested data-sets:

Intel Berkeley Research Lab
Take for example the data from the sensors of the Intel Berkeley Research Labs, see Stream Data Mining Repository. The data is in a format used by many machine learning frameworks, e.g. Weka.
Powersupply Stream
Use the power supply stream from the same data source: Stream Data Mining Repository. Here the challenge is to integrate seasonality into the analysis.
UCI Repository
Repository of multiple data-sets, including timeseries.

Advanced: Try to detect events (e.g. meetings) within the data. This is a hard task, as there is no ground truth to evaluate against, thus it is part of the project to with strategies on how to measure the quality of the algorithms.

Timeseries Classification

Task: Classification of timeseries data.

Approach: Pick a dataset from the linked repository of time series datasets and try to reproduce (or surpass) the posted performance values.

Suggested data-sets:

Timeseries Repository
Web-site of timeseries dataset together with the performance of a number of different algorithms: Welcome to the UEA & UCR Time Series Classification Repository.

Advanced: Analyse how the performance (i.e., classification accuracy) drops, the fewer data is used (fewer parts of the timeseries).

Pattern Mining in Timeseries

Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series data-set (optionally apply SAX beforehand).

Suggested data-sets:

UCI Repository
Repository of multiple data-sets, including timeseries.

Advanced: Pre-process the data via piecewise linear approximation.

KDDM2

Knowledge Discovery and Data Mining 2

VU (706.715)

Instructors

About the Course Content, theoretical, and practical goals.

Content

Theoretical Goals

Practical Goals

Calender VU (707.004)

Lectures

Work Plan

Materials

Templates

Poster

Report

Web Resources

Examination

Exercise - Projects

Overview

Poster Presentations

Questions being covered by the poster

Topics

I. Participate in a challenge

TUG Data Team

II. Text Mining Topics

Document Representation

Sarcasm and Irony Detection for German

E-Mail Parsing

III. Data Science Topics

Outlier Detection

Sensor Analytics (Studentlab)

Detect Causality

Dataset Collection

IV. Machine Learning Topics

Machine Learning

V. Information Retrieval

Query Completion

Blog Search

VI. Time Series Analytics Topics

Timeseries Prediction

Timeseries Classification

Pattern Mining in Timeseries