The goal of this course is to continue and build upon the theory covered in Knowledge Discovery and Data Mining 1. But for KDDM2 the emphasis lies on practical aspects and therefore the practical exercise is integral part of this course. In addition a number of algorithmic approaches will be covered in detail, which are related to the project topics. Participants may choose one out of a number of proposed projects from different stages of the Knowledge Discovery process and different data sets.
Knowledge Discovery and Data Mining are closely related to the concepts Big Data, Data Science and Data Analytics. Data science encompasses a number of diverse skills, ranging from software engineering, data mining and statistics by following a scientific, evaluation driven approach. This course should help to develop some of these skill, with a focus on the areas of Natural Language Processing, Machine Learning and Information Retrieval. In addition, is a necessary prerequisite for many Knowledge Discovery applications to develop strong skills in analysing big data sets and preprocessing them.
The slides and resources from the previous years are available here: 2018, 2017, 2016, 2015, 2014
Course topics include:
In this course the students will learn about:
At the end of this course the students will know how to:
The lectures take place in the HS i8 (Inffeldgasse 13, ground floor), on Thursday, 12:15 - 13:45. For exceptions, please see below!
Created with Ganttproject: kddm2-project-plan-2019.gan
All students are required to register for the VU in TUGOnline until 07.03.2019, 23:59.
It is not planned that there is a written (or oral) examination, as the focus lies on the practical exercise. Therefore, the grading will entirely will depend on how the exercise is conducted and on its results.
There are several practical projects from different phases of the Knowledge Discovery process to choose from.
The work on the projects will be conducted by single students on their own (groups of one). But there is also the possibility to form groups of two people, where the project scope is then expanded appropriately (see advanced). The students are expected to present the progress of their work as a poster presentation (or alternatively, as a written report).
For all projects the evaluation is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.
There are data sets proposed for each of the project, but students are free to come up with data sets of their own, or make project proposals.
Grading is conducted on these criteria:
Task: Take part in an scientific challenge (shared task), or an Kaggle challenge of your choice. More details about the data team and how to get in touch can be found on the Data-Team Homepage.
Suggested challenges:
Task: Build and evaluate different approaches to combine multiple word vectors into a single document vector.
Data-Set: Stanford Sentiment Treebank Dataset
Advanced: Develop and evaluate your own document representation method.
Keywords: Word2Vec, Doc2Vec
Papers:
Task: Build a data-set for irony and sarcasm based on a provided web-platform and then design an algorithm to distinguish between "normal" text and irony/sarcasm.
Data-Set: Collect yourself via platform - you will need to "hire" a number of volunteers to provide sufficiently much data.
Papers:
Task: Parse a semi-structured data set with the goal to add more structure to the data. The goal is to separate correct sentences from other textual fragments, e.g. ascii tables, greetings, etc.
Use-Case: Pre-processing of textual data for further processing, e.g. for prediction of potential receivers.
Data-Set: The data being worked on are e-mails, which already are semi-structured and the header contains information like the sender.
Suggested data-sets:
Papers:
Advanced: Write a prediction algorithms that is able to recover the sender and the receivers once removed from the mail.
Task: Given some observations (data), find the instances that do not conform to the remainder of the observations.
Approach: Select a data-set and setting. Research appropriate algorithmic approaches, implement selected outlier detection algorithms, apply them on selected dataset and report their performance.
Suggested data-sets:
Advanced: Implement your own outlier detection algorithm and compare its performance against the baseline.
Task: Collect sensor data and detect certain states (only for teams)
Option A: Biosensors (e.g. heartrate, ...), detect positions
Option B: Industrial sensors (e.g. temperature, ...), estimate the numner of people within a room
Option C: Fluid & gas sensors (e.g. CO2, ...), detect certain liquids
Data-Set: Needs to be collected in course of the project, or alternatively use the one from "Databases 2" course.
Task: Identification of causality relationship directly from the data (i.e., cause, effect). This is an advanced data science task with big potential.
Data-Set: Data from a past challenge (complete with prior research): Causality Challenge #1.
Papers:
Task: Collect datasets (e.g., open-governmental) datasets, analyse these and assess the suitability of these datasets for a number of application scenarios.
Approach: Research datasets, assess their key characteristics, apply data science methods to assess their usefulness.
Advanced: Build a database (or similar) to allow to collect and update the relevant key parameters of each dataset.
Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource.
Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time.
Suggested data-sets:
Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.
Scenario: A user is starting to search by typing in some words...
Task: The system should automatically suggest word completions, depending on the already entered words.
Suggested data-sets:
Framework:
Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.
Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context.
Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful.
Suggested data-sets: Same as previous task.
Framework:
Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.
Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range.
Approach: Take the stream of data and build a prediction algorithm, that is able to predict the future values of the streams a accurately as possible.
Suggested data-sets:
Advanced: Try to detect events (e.g. meetings) within the data. This is a hard task, as there is no ground truth to evaluate against, thus it is part of the project to with strategies on how to measure the quality of the algorithms.
Task: Classification of timeseries data.
Approach: Pick a dataset from the linked repository of time series datasets and try to reproduce (or surpass) the posted performance values.
Suggested data-sets:
Advanced: Analyse how the performance (i.e., classification accuracy) drops, the fewer data is used (fewer parts of the timeseries).
Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series data-set (optionally apply SAX beforehand).
Suggested data-sets:
Advanced: Pre-process the data via piecewise linear approximation.