knowledge discovery and data mining 1 (ku) -...

11
Knowledge Discovery and Data Mining 1 (KU) Simon Walk IICM, TU Graz October 22, 2015 Simon Walk (IICM) KDDM1 October 22, 2015 1 / 11

Upload: lebao

Post on 14-Feb-2019

225 views

Category:

Documents


0 download

TRANSCRIPT

Knowledge Discovery and Data Mining 1 (KU)

Simon Walk

IICM, TU Graz

October 22, 2015

Simon Walk (IICM) KDDM1 October 22, 2015 1 / 11

KDDM 1 (KU) - Introduction

Introduction

Simon Walk

Institute for InformationSystems & Computer Media

Inffeldgasse 16c/I

Office: D.2.07

E-Mail: [email protected]

Research Interests:Knowledge & Data MiningSocial Network AnalysisSemantic Web & OntologiesDynamical Systems & ComplexNetworksMachine Learning

Simon Walk (IICM) KDDM1 October 22, 2015 2 / 11

KDDM 1 (KU) - Introduction

Course Context & Goals

Why should you be interested in KDDM1 (KU)?

To consolidate and reinforce your (theoretical) knowledge obtained inKDDM1 (VO) with practical “hands-on” experience.

Helps a LOT for the final exam!

Good preparation for KDDM2!

“Feel” like a data scientist!

If interested: Continue with Master Project or Master’s Thesis

Simon Walk (IICM) KDDM1 October 22, 2015 3 / 11

KDDM 1 (KU) - Organization

Course Organization

You have to

1. form small groups of up to two students.

2. choose one of two practical assignments.

3. work on your chosen assignment.

4. give two presentations (in english) on the progress and results of yourassignment.

After forming a group, send one e-mail to [email protected] andinclude the names and student ids (Matrikelnummern) of the group. Alle-mails have to include [KDDM1] in the subject!

Simon Walk (IICM) KDDM1 October 22, 2015 4 / 11

KDDM 1 (KU) - Project Descriptions

Project 1 - Crawling, Cleaning and Clustering

Objective: Group (semantically) similar pages of a website according totheir most relevant terms!

Write a web-crawler to collect pages/documents that contain text.

Clean the crawled pages from all markup languages and unwantedcontent (e.g., HTML, JavaScript, etc.).

Calculate similarities between the pages (i.e., by calculatingsimilarities between the TF-IDF Vectors for each page)

Group similar pages (i.e., by using a clustering algorithm, such ask-means)

Hint: Python, scikit-learn1, SciPy2 and NumPy3 already provide you withmost of the functionality required to solve this task!

1http://scikit-learn.org/stable/2http://www.scipy.org3http://www.numpy.org

Simon Walk (IICM) KDDM1 October 22, 2015 5 / 11

KDDM 1 (KU) - Project Descriptions

Project 1 - Crawling, Cleaning and Clustering

A word of warning: Be careful when crawling websites!Don’t hammer the servers or you might risk getting banned!

Either select smaller websites for crawling (complete crawl) or choose anappropriate sampling strategy for selecting the pages to analyze!

Rule of thumb: Your datasets should consist of, at least, 1,000 pages!

Simon Walk (IICM) KDDM1 October 22, 2015 6 / 11

KDDM 1 (KU) - Project Descriptions

Project 2 - Movie Recommender

Objective: Recommend similar movies to users, using matrix factorization!

Crawl or download4 a movie-ratings dataset.

Create/Extract the required utility matrix and minimize noise (e.g.,subtract averages).

Perform UV Decomposition to obtain U ∈ Rn×d and V ∈ Rd×m withd = 2 or d = 3.

Plot and interpret findings.

Hint: Python, scikit-learn, SciPy and NumPy already provide you withmany of the functions and tools required to solve this task!

4We suggest to use MovieLens 100khttp://grouplens.org/datasets/movielens/

Simon Walk (IICM) KDDM1 October 22, 2015 7 / 11

KDDM 1 (KU) - Project Descriptions

Project Presentations

Will take place after Partial Exam 2 & 3 on 03.12.2015 and 21.01.2016.

For 03.12.2015 prepare a 5-minute presentation (strict) with 3 slides:

First slide: Dataset

Second slide: Experimental Setup

Third slide: preliminary results

For 21.01.2016 prepare a 10-minute presentation (strict) with 5 slides:

First slide: Introduction/Motivation

Second slide: Methodology

Third slide: Experimental setup

Fourth slide: Results

Fifth slide: Discussion

Simon Walk (IICM) KDDM1 October 22, 2015 8 / 11

KDDM 1 (KU) - Project Descriptions

Project Presentations

Send the slides to [email protected] as PDF until 02.12.2015 23:59 forpresentation 1 and 20.01.2016 23:59 for presentation 2.

Subject of the e-mail must include [KDDM1].

Note that presentations that take longer than 5 or 10 minutes will beinterrupted and stopped!

Grading for the KU depends on your presentation and your results!

Simon Walk (IICM) KDDM1 October 22, 2015 9 / 11

KDDM 1 (KU) - Project Descriptions

Questions?

Simon Walk (IICM) KDDM1 October 22, 2015 10 / 11

KDDM 1 (KU) - Project Descriptions

Thanks!

Simon Walk (IICM) KDDM1 October 22, 2015 11 / 11