cs246 by john cho1 cs246: web information systems junghoo john cho spring 2014

25
CS246 by John Cho 1 CS246: Web Information Systems Junghoo “John” Cho Spring 2014

Upload: johan-verry

Post on 29-Mar-2015

259 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 1

CS246: Web Information Systems

Junghoo “John” Cho

Spring 2014

Page 2: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 2

Course Information

Web page: http://oak.cs.ucla.edu/cs246/ Topic: Web information management Time: MW 2:00 -- 3:50 pm Place: Boelter Hall 5422 Instructor: Junghoo “John” Cho

office: 3531H Boelter Hall email: [email protected]

please use subject “CS246: …” office hours: Mon 1-2 pm.

Page 3: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 3

Who is this class for?

Strong interest in research Interest in Web information systems Time commitment:

Around 2-3 papers every week Typically one full day of paper reading

One indepedent project Similar to paper writing

In fact we read papers from past student projects! Or interesting application implementation

Page 4: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 4

Today’s Topics

Overview of the course topics Course logistics

Paper reading assignments Class project

Page 5: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 5

Prerequisite

Introductory database, e.g., CS143 e.g.: query? SQL?

Basic algorithms and data structures Basic probability and statistics

P(A|C), Bayes rule, … Design and implementation experience

Basic C++ Quick test: Grab a sample paper

See if you can read, understand and build it

Page 6: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 6

Tell Us About You Name Department & Program Before coming to UCLA Brief history at UCLA Technical/research interests Expectation from the class

Page 7: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 7

Legacy database Plain text files

Biblio sever

Information Galore

Page 8: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 8

Central Problem

How to manage/access information on the Web?

Three major approaches Central indexing

E.g., Web search engine Dynamic integration

E.g., comparison shopping services Data extraction

E.g., spamming companies

Page 9: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 9

Topic: Web Search (Central Indexing)

Central Index

Page 10: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 10

Topic: Web Search (Central Indexing)

Web: collection of passive HTML pages Find Web pages relevant to a query

Traditional Information Retrieval: Web = collection of HTML pages HTML page = a bag of words

More than that? Links, structure of the Web User access patterns HTML tags (markups)

Page 11: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 11

Topic: Dynamic Integration

Cars.com Amazon.com

Apartments.com401carfinder.com

Page 12: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 12

Topic: Dynamic Integration

Mediator

Wrapper

Source 1

Wrapper

Source 2

Wrapper

Source n

Page 13: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 13

Topic: Data Extraction

WWWBeatles $10Madonna $20NSync $20

Structured data

How can we extract “structured data” from free text automatically?

Page 14: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 14

Main Course Workload Paper reading

Paper reading assignments Class discussion We mainly focus on “central indexing”

Independent projects

Page 15: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 15

High-Level Goal

Learn core ideas and techniques Some of the techniques can be useful for other

fields Learn how to read papers Hopefully learn what it is like to do research

Sometimes very frustrating but often very rewarding

Page 16: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 16

Paper Reading Why:

Something that you will do all the time as a researcher Learn to be critical and communicate well Acquire knowledge to conduct research/project

About 20 papers from Conferences: SIGMOD, VLDB, WWW, and …

Before the class: Everyone: read and review the paper

During the class: Instructor: present his own understanding and lead class

discussion Everyone: participate!!!

Page 17: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 17

How to Get Papers

From the class homepage http://oak.cs.ucla.edu/cs246/

Some of the materials password protected User name: cs246 Password: papers

Let me know if any problem

Page 18: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 18

How to Read Papers Understand the “Big Picture” What is the problem? Why is it important? Why is it difficult? What has this paper done? What others have done?

Page 19: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 19

Paper Reviews (1) Due by the preceding Sunday

Submit through our Web submission interface on the class Web page

Required components: at most 3 paragraph Summary (1 paragraph): your own words

This paper discusses how to optimize queries with... Comments/criticisms (1-2 paragraphs): the good & the bad

It addresses a real problem and the solution is interesting … But I feel the experiments are not realistic because...

Optional: questions, as many as you wantWhy the authors assume that queries are independent?

Page 20: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 20

Paper Reviews (2) May skip 3 paper summaries without penalty Most reviews will get full score unless they are

written extremely poorly

Page 21: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 21

Class Project

Why: Work on a specific problem and learn to find a solution

40% of the class Team of up to 3 Topic: any problem related to the general problem Open style

Rigorous study of a research problem or Any interesting system implementation

Page 22: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 22

Class Project Schedule

Important Milestones Group formation: 4/09 (2nd week Wed) Project proposal: 4/20 (3rd week Sun) Project progress: 5/07 (6th week Wed) Final report: 5/21 (8th week Sun) Project presentation: 9th and 10th weeks

You are responsible to stay on track Make appointments with instructor as needed

Page 23: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 23

Project: Please Remember

Put your aims high and be realistic Expect to read at least 4-5 papers along the way Start early

Don’t do it right before the deadline Always unexpected obstacles Some students could not finish in previous quarters

Please, please start early

You are responsible to be on track

Page 24: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 24

Grading

Midterm: 40% Paper reviews: 20% Project: 40%

Page 25: CS246 by John Cho1 CS246: Web Information Systems Junghoo John Cho Spring 2014

CS246 by John Cho 25

Announcements

First review due Sunday 4/06 Three papers for class 3 and 4

Graph structure in the Web The Anatomy of a Large-Scale Hypertextual … Authoritative sources in a hyperlinked environment