ics 101 spring 2016 data management - university of …lipyeow/ics101/2016spr/ics101-20160202...ics...

33
ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informa;on & Computer Science Department University of Hawaii at Manoa 1 Lipyeow Lim -- University of Hawaii at Manoa

Upload: phamthu

Post on 28-Apr-2018

219 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

ICS101Spring2016DataManagement

Assoc.Prof.LipyeowLimInforma;on&ComputerScienceDepartment

UniversityofHawaiiatManoa

1LipyeowLim--UniversityofHawaiiatManoa

Page 2: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

SurveyQues;on:Whatdoyouthinkdatamanagementis?

A.  Howtofilepapersinafilecabinet.B.  Howtoputdatafilesintherightdirectories

onyourcomputer.C.  Howtoorganize,store,andfinddatausing

computers.D.  Howtoensurethatyourdatacannotbe

accessedoralteredbyanunauthorizedperson.

LipyeowLim--UniversityofHawaiiatManoa 2

Page 3: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Objec;vesAttheendofthislecture,thesuccessfulstudentshouldknow:•  Whatisdataandwheredoesitcomefrom?•  Whatisdatamanagement?•  Whatisadatabase•  Whatistherela;onalmodeldata•  Howtosearchadatabase•  Whatisatransac;on•  Howtosearchunstructureddata

LipyeowLim--UniversityofHawaiiatManoa 3

Page 4: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Whatis``data’’?

l  Dataareknownfactsthatcanberecordedandthathaveimplicitmeaning.

l  Threebroadcategoriesofdatal  Structureddatal  Semi-structureddatal  Unstructureddata

l  ``Structure’’ofdatareferstotheorganiza;onwithinthedatathatisiden;fiable.

LipyeowLim--UniversityofHawaiiatManoa 4

Page 5: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Wheredoesdatacomefrom?

LipyeowLim--UniversityofHawaiiatManoa 5

•  Howmuchdatadowehave?

Page 6: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

UnitsofDataMul,plier(Decimal)

Nota,on Name

1 B byte(=8bits)1000 kB kilobyte10002 MB megabyte10003 GB gigabyte10004 TB terabyte10005 PB petabyte10006 EB exabyte10007 ZB zecabyte10008 YB yocabyte

LipyeowLim--UniversityofHawaiiatManoa 6

Amusicfile

ADVDqualitymovie

Allthedatainatypicallibrary

Page 7: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Howmuchdataisthereintheworld?

LipyeowLim--UniversityofHawaiiatManoa 7

•  AccordingtoIBM:2.5exabytesofdataarewasgeneratedeverydayin2012.

•  75%ofdataisunstructuredcomingfromsourcessuchastext,voiceandvideo.

Whoownsthedata?

Page 8: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Canwehandleallthatdata?

•  How?

•  Structureddata–Databasetechnology

•  Unstructureddata–Searchenginetechnology

LipyeowLim--UniversityofHawaiiatManoa 8

Page 9: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

LipyeowLim--UniversityofHawaiiatManoa 9

Page 10: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

TheDataManagementProblem

LipyeowLim--UniversityofHawaiiatManoa 10

WhereisthephotoItooklastChristmas?

WheredidIreadabout“TuringMachines”?

Whereistheinvoiceforthiscomputer?

Whichproductisthemostprofitable?

User

Queries

Data

?

Page 11: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Whatisadatabase?l  Adatabase:acollec;onofrelateddata.

l  Representssomeaspectoftherealworld(akauniverseofdiscourse).

l  Logicallycoherentcollec;onofdatal  Designedandbuiltforspecificpurpose

l  Adatamodelisacollec;onofconceptsfordescribing/organizingthedata.

l  Aschemaisadescrip;onofapar;cularcollec;onofdata,usingtheagivendatamodel.

LipyeowLim--UniversityofHawaiiatManoa 11

Page 12: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

TheRela;onalDataModel•  Rela&onaldatabase:asetofrela&ons•  Arela&onismadeupof2parts:

–  Instance:atable,withrowsandcolumns.#Rows=cardinality,#fields=degree/arity.

–  Schema:specifiesnameofrela;on,plusnameanddomain/typeofeachcolumnoracribute.

•  E.G.Students(sid:string,name:string,login:string,age:integer,gpa:real).

•  Canthinkofarela;onasasetofrowsortuples(i.e.,allrowsaredis;nct).

LipyeowLim--UniversityofHawaiiatManoa 12

Page 13: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

ExampleInstanceofStudentsRela;on

•  Q1.Whatisthecardinalityoftherela;oninstance?(a)1 (b)2 (c)3 (d)4

•  Q2.Whatisthedegree/arityoftherela;oninstance?(a)2 (b)3 (c)4 (d)5

LipyeowLim--UniversityofHawaiiatManoa 13

sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@eecs 18 3.2 53650 Smith smith@math 19 3.8

Page 14: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Whyistherela;onalmodeluseful?•  Supportssimpleandpowerfulquerycapabili;es!

•  StructuredQueryLanguage(SQL)

LipyeowLim--UniversityofHawaiiatManoa 14

SELECT S.sname FROM Students S WHERE S.gpa>3.5

sid name login age gpa 53666 Jones jones@cs 18 3.4 53688 Smith smith@eecs 18 3.2 53650 Smith smith@math 19 3.8

Page 15: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

WhatisaDBMS?•  Adatabasemanagementsystem(DBMS)isacollec&onofprogramsthatenablesusersto– CreatenewDBsandspecifythestructureusingdatadefini;onlanguage(DDL)

– Querydatausingaquerylanguageordatamanipula;onlanguage(DML)

– Storeverylargeamountsofdata– Supportdurabilityinthefaceoffailures,errors,misuse

– Controlconcurrentaccesstodatafrommanyusers

LipyeowLim--UniversityofHawaiiatManoa 15

Page 16: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

TypesofDatabasesl  On-lineTransac;onProcessing(OLTP)l  Bankingl  Airlinereserva;onsl  Corporaterecords

l  On-lineAnaly;calProcessing(OLAP)l  Datawarehouses,data

martsl  Businessintelligence(BI)

l  Specializeddatabasesl  Mul;media

l  XMLl  GeographicalInforma;on

Systems(GIS)l  Real-;medatabases

(telecomindustry)

l  SpecialApplica;onsl  CustomerRela;onship

Management(CRM)l  EnterpriseResource

Planning(ERP)

l  HostedDBServicesl  Amazon,Salesforce

LipyeowLim--UniversityofHawaiiatManoa 16

Page 17: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

ABitofHistoryl  1970EdgarFCodd(aka“Ted”)inventedtherela;onalmodelintheseminalpaper“ARela;onalModelofDataforLargeSharedDataBanks”§  Mainconcept:rela&on=atablewithrowsandcolumns.§  Everyrela;onhasaschema,whichdescribesthecolumns.

l  Prior1970,nostandarddatamodel.l  NetworkmodelusedbyCodasyll  HierarchicalmodelusedbyIMS

l  Aser1970,IBMbuiltSystemRasproof-of-conceptforrela;onalmodelandusedSQLasthequerylanguage.SQLeventuallybecameastandard.

LipyeowLim--UniversityofHawaiiatManoa 17

Page 18: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Transac;ons•  Atransac&onistheDBMS’sabstractviewofauserprogram:asequenceofreadsandwrites.– Eg.User1viewsavailableseatsandreservesseat22A.

•  ADBMSsupportsmul;pleusers,ie,mul;pletransac;onsmayberunningconcurrently.– Eg.User2viewsavailableseatsandreservesseat22A.

– Eg.User3viewsavailableseatsandreservesseat23D.

LipyeowLim--UniversityofHawaiiatManoa 18

Page 19: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

ACIDProper;esofTransac;ons

•  Atomicity:all-or-nothingexecu;onoftransac;ons

•  Consistency:constraintsondataelementsispreserved

•  Isola;on:eachtransac;onexecutesasifnoothertransac;onisexecu;ngconcurrently

•  Durability:effectofanexecutedtransac;onmustneverbelost

LipyeowLim--UniversityofHawaiiatManoa 19

Page 20: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Q3.WhyuseaDBMS?

a)  Thedataistoolargetomanageinexcelfilesb)  Idonotwanttowritemyownprogramsto

findsomethinginthedatac)  Idonotwanttowritemyownprogramto

managemul;pleusersandtransac;onsd)  Alloftheabove.

LipyeowLim--UniversityofHawaiiatManoa 20

Page 21: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

TheDataManagementProblem

LipyeowLim--UniversityofHawaiiatManoa 21

WhereisthephotoItooklastChristmas?

WheredidIreadabout“TuringMachines”?

Whereistheinvoiceforthiscomputer?

Whichproductisthemostprofitable?

User

Queries

Data

?

Page 22: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

UnstructuredData•  Whataresomeexamplesofunstructureddata?

•  Howdowemodelunstructureddata?•  Howdowequeryunstructureddata?•  Howdoweprocessqueriesonunstructureddata?

•  Howdoweindexunstructureddata?

LipyeowLim--UniversityofHawaiiatManoa 22

Page 23: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

UnstructuredTextData•  Fieldof“Informa;onRetrieval”•  DataModel

–  Collec;onofdocuments–  Eachdocumentisabagofwords(akaterms)

•  QueryModel–  Keyword+BooleanCombina;ons–  Eg.DBMSandSQLandtutorial

•  Details:– Notallwordsareequal.“Stopwords”(eg.“the”,“a”,“his”...)areignored.

–  Stemming:convertwordstotheirbasicform.Eg.“Surfing”,“surfed”becomes“surf”

LipyeowLim--UniversityofHawaiiatManoa 23

Page 24: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

InvertedIndexes•  Recall:anindexisamappingofsearchkeytodataentries– Whatisthesearchkey?– Whatisthedataentry?

•  InvertedIndex:–  Foreachtermstorealistofpos;ngs–  Apos;ngconsistsof<docid,posi;on>pairs

LipyeowLim--UniversityofHawaiiatManoa 24

DBMS doc01 10 18 20 doc02 5 38 doc03 13

SQL doc06 1 12 doc09 4 9 doc20 12

trigger doc01 12 15 doc09 14 21 doc10 1125 55

... ...

lexicon Posting lists

Whatisthedatainaninvertedindexsortedon?

Page 25: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

LookupsusingInvertedIndexes

•  Givenasinglekeywordquery“k”(eg.SQL)–  Findkinthelexicon–  Retrievethepos;nglistfork–  Scanpos;nglistfordocumentIDs[andposi;ons]

•  Whatifthequeryis“k1andk2”?–  RetrievedocumentIDsfork1andk2–  Performintersec;on

LipyeowLim--UniversityofHawaiiatManoa 25

DBMS doc01 10 18 20 doc02 5 38 doc01 13

SQL doc06 1 12 doc09 4 9 doc20 12

trigger doc01 12 15 doc09 14 21 doc10 1125 55

... ...

lexicon Posting lists

Page 26: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

TooManyMatchingDocuments•  Ranktheresultsby“relevance”!•  Vector-SpaceModel

–  Documentsarevectorsinhi-dimensionalspace

–  Eachdimensioninthevectorrepresentsaterm

–  Queriesarerepresentedasvectorssimilarly

–  Vectordistance(dotproduct)betweenqueryvectoranddocumentvectorgivesrankingcriteria

– Weightscanbeusedtotweakrelevance

•  PageRank(later)

LipyeowLim--UniversityofHawaiiatManoa 26

Star

Diet

Doc about astronomy

Doc about movie stars

Doc about behavior

Page 27: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Q4.Whichofthefollowingisthemostsimilartoaninvertedindex?a)  Bookmarks.b)  Contentpageofabook.c)  Theindexattheendofabook.d)  Adeckofplayingcards.

LipyeowLim--UniversityofHawaiiatManoa 27

Page 28: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

InternetSearchEngines

LipyeowLim--UniversityofHawaiiatManoa 28

WorldWideWeb

WebPageRepository

InvertedIndex

WebCrawler

SearchEngineWebServer

Keyword Query

Query

Indexer

Ranked Results

Postings etc Doc IDs

Snipplets

Page 29: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

BasicWebSearch•  hcp://www.googleguide.com/advanced_operators_reference.html

LipyeowLim--UniversityofHawaiiatManoa 29

QueryExpression Whatitmeans

Bikingitaly BikingANDitaly

RecyclesteelORiron RecycleAND(steelORiron)

“Ihaveadream” “Ihaveadream”treatedasoneterm

Salsa-dance SalsaANDNOTdance

OtherniHyexpressions Whatitmeans

12+34-56*7/8 Evaluatesthearithme;cexpression

300EurosinUSD Converts300eurostoUScurrency

Page 30: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

RankingWebPages•  Google’sPageRank

–  Linksinwebpagesprovidecluestohowimportantawebpageis.

•  Takearandomwalk–  Startatsomewebpagep–  Randomlypickoneofthelinksandgotothatwebpage

–  Repeatforalleternity•  Thenumberof;mesthewalkervisitsapageisanindica;onofhowimportantthepageis.

LipyeowLim--UniversityofHawaiiatManoa 30

1

3

2

4

5

6

Ver;cesrepresentwebpages.Edgesrepresentweblinks.

Page 31: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Semi-structuredSearch

QueryExpression Whatitmeans

define:imbroglio Finddefini;onsof“imbroglio”

Halloweensite:www.census.gov Restrictsearchfor“halloween”toUScensuswebsite

Form1098-TIRSfiletype:pdf FindtheUStaxform1098-TinPDFformat

link:warriorlibrarian.com FindpagesthatlinktoWarriorLibrarian'swebsite

DanShugarintext:Powerlight Findpagesmen;oningDanShugarwherehiscompany,Powerlight,isincludedinthetextofthepage,i.e.,lesslikelytobefromthecorporatewebsite.

allin,tle:GoogleAdvancedOperators Searchforpageswith;tlescontaining"Google,""Advanced,",and"Operators"

LipyeowLim--UniversityofHawaiiatManoa 31

Web pages are not really unstructured! Click “view source” to view HTML.

Page 32: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

Summary•  DataManagementProblem

–  Howdoweposeandanswerqueriesondata?•  Structureddata

–  Rela;onalDataModel–  SQL–  Rela;onalDBMS–  Transac;ons

•  Unstructureddata–  Bagofterms–  Booleancombina;onofkeywordqueries–  InvertedIndexes(WebSearchEngines)

•  Semi-structureddata–  Couldusetechniquesfromeitherstructuredorunstructured–  Moresophis;catedkeywordqueries

LipyeowLim--UniversityofHawaiiatManoa 32

Page 33: ICS 101 Spring 2016 Data Management - University of …lipyeow/ics101/2016spr/ics101-20160202...ICS 101 Spring 2016 Data Management Assoc. Prof. Lipyeow Lim Informaon & Computer Science

SurveyQues;on:Ilearnedalotaboutdata

managementfromthislecture.

A.  StronglyAgreeB.  AgreeC.  NeutralD.  DisagreeE.  StronglyDisagree

LipyeowLim--UniversityofHawaiiatManoa 33