k236: basis of data science - jaist 北陸先端科学 …bao/k236/k236-l2.pdf" a data...

36
K236: Basis of Data Science Lecture 2. Data and Databases Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai

Upload: vandien

Post on 30-Apr-2019

233 views

Category:

Documents


0 download

TRANSCRIPT

K236: Basis of Data ScienceLecture 2. Data and Databases

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K236

1. Introduction to data science データ科学入門 6/9

2. Introduction to data science データ科学入門 6/13

3. Data and databases データとデータベース 6/16

4. Review of univariate statistics 単変量統計 6/20

5. Review of linear algebra 線形代数 6/23

6. Data mining software データマイニングソフトウェア 6/27

7. Data preprocessing データ前処理 6/30

8. Classification and prediction (1) 分類と予測 (1) 7/4

9. Knowledge evaluation 知識評価 7/7

10. Classification and prediction (2) 分類と予測 (2) 7/11

11. Classification and prediction (3) 分類と予測 (3) 7/14

12. Mining association rules (1) 相関ルールの解析 7/18

13. Mining association rules (2) 相関ルールの解析 7/21

14. Cluster analysis クラスター解析 7/25

15. Review and Examination レビューと試験 (the data is not fixed) 7/27

Outline

1. Much  more  data  around  us  than  before2.Data management3.Data quality problems

This lecture aims to provide you the idea of how data are collected, represented and organized.

3

Data collection, representation, organization and inference

4

Low  levelof abstraction

High  level              of abstraction

Generalization(inductive  learning)

§ How  data  is  collected,  represented,  and  organized?q Collection:  sample  or  all  available  dataq Representation:  vectors,  sequences,  lists,  graphs,  etc.q Organization:  databases,  warehouses,  etc.

§ Inferenceq Induction:    𝐺𝑖𝑣𝑒𝑛   𝑥( ,  𝑖𝑛𝑓𝑒𝑟  𝑓 𝑥(vs.  Deduction:  𝐺𝑖𝑣𝑒𝑛  𝑓 𝑥  𝑎𝑛𝑑  𝑥(,  𝑑𝑒𝑑𝑢𝑐𝑒  𝑓(𝑥())

Data Knowledge

5

Astronomical  data  天文学的量のデータAstronomy is facing a major data avalanche: 天文学ではデータ崩壊の危機に瀕している

Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source … 何テラバイトもの天空観測データ,何十億もの観測源,観測源ごとに何百もある属

6

Earthquake  data 地震データ

1932-199604/25/92 Cape

Mendocino, CA

Japanese  earthquakes      日本の地震1961-­1994

7

Explosion of biological data爆発的な生物学データ

10,267,507,282 bases in 9,092,760 records.

25,000  Genes

2,000,000  Proteins

3000  metabolitesMetabolomics

Proteomics

Genomics

8

A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller): 1600万文字からなるDNA配列の一部 (4570分の一)

How biological data look like?生物学データの形式など

…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…

Many other kinds of biological data

9

§ Approximately 80% of the world’s data is held in unstructured formats(source: Oracle Corporation) 世界中のデータの約80%が構造をもたないデータ(オラクルによる)

§ Example: MEDLINE is a source of life sciences and biomedical information, with nearly eleven million records 例:生命科学・生物医学の情報源であるMEDLINEには約1100万件の論文情報がある

q About 60,000 abstracts on hepatitis (うち肝炎については6万件)

Text:  huge  sources  of  knowledgeテキスト:知識の大きな源

36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.

Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals, …

10

Web  link  data        ウェブのリンクデータ

Outline

1. Much  more  data  around  us  than  before2.Data management

q Data modelsq Data typesq Structures of dataq Various kinds of databases

3.Data quality problems

11

Data models

• Model: Simplified description or abstraction of a reality.

• Data model: Data description by a set of concepts of q The structure of a database, typically include

§ elements (e.g., data types), § groups of elements (e.g., entity, record, table), and § relationships among such groups.

q The operations for manipulating these structures, specifying database retrievals and updates§ basic model operations (e.g., insert, delete operations)§ user-defined operations (e.g., compute_student_avarage_score)

q Certain constraints (restrictions on valid data) that the database should obey.

12

Approaches to data models• External model (Views): Describes how

users see the data for a particular purposeq Course_info(cid: string, enrollment: integer)

• Conceptual model: Defines logical structure*q Students(sid: string, name: string, login:

string, age: integer, gpa: real)q Courses(cid: string, cname: string, credits:

integer) q Enrolled(sid: string, cid: string, grade: string)

• Internal (physical) model: Describes how data is stored in computerq Relations stored as unordered files. q Index on first column of students.

13

View  1 View  2

Conceptual  model

Physical  model

External  Level

Conceptual  Level

Physical  Level

*  A  conceptual  model  is  an  underlying  model  that  is  capable  of  supporting  any  valid  (and  perhaps  changing)  external  view  that  falls  within  its  scope.  https://en.wikipedia.org/wiki/Data_model#cite_note-­‐MW99-­‐3

Types of data models• Flat model: a single, two-dimensional array of data

elements.

• Hierarchical model: data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting.

• Network model: two constructs: records contain fields, and sets define one-to-many relationships between records.

• Relational model: a database as a collection of predicatesover a finite set of predicate variables, describing constraints on the possible values and combinations of values.

• Object-relational model: a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.

• Star scheme: The simplest style of data warehouse

14

Data types§ SYMBOLIC

q Indexing: E.g., names, tags, case numbers, or serial numbers that identify a respondent or group of respondents.

q Binary: Two values, e.g., YES or NO, SUCCESS or FAILURE, MALE or FEMALE, WHITE or NON-WHITE, FOR or AGAINST, and so on.

q Boolean: Two values TRUE or FALSE, and may have the value UNKNOWN.

q Nominal: Character-string values (green, blue, red, …)

q Ordinal: Values for this character-string data type are linearly ordered (Small, Middle, Large,…)

§ NUMERICq Integer: Values are just integer numbersq Continuous: real numbers.

15

Symbols  or  Numbers

16

Combinatorial search in hypothesis spaces (machine learning)仮説空間における組合わせ探索

Often matrix-based computation (multivariate data analysis)通常は行列ベースの計算(多変量データ解析)

Why caring about data types?

Attribute Numerical Symbolic

No structure

¹= Places,Color

Ordinal structure

³¹= Ring

structure

Rank,Resemblance

Integer: Age,Temperature

Continuous: Income,Length

Nominal orcategorical(Binary, Boolean)

Ordinal

Measurable

´+³¹=

Posible analysis

operations (thus

methods, algorithms) depend on data types

Advances: Data Transformation

Structures of data

• Structured dataq Can be stored in database SQL

in table with rows and columns.

q Only about 5-10% of all available data.

• Semi-structured dataq Doesn’t reside in a relational

database but that does have some organizational properties that make it easier to analyze.

q XML documents and NoSQL databases documents are semi structured

17

Articls  in  a  Latex  database

Structures of data

• Unstructured dataq Unstructured data represent around 80% of data. It often include text

and multimedia content. Example: e-mail messages, word documents, videos, photos, audio files, webpages and many other kinds of business documents.

q A key issue in data science is representing unstructured dataExample: The DNA sequence“…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTC…”can be represented by different ways for computation such as sliding windows, motifs, kernel function, etc., or the web link representation

18

Databases

• The most popular format for organizing data in a database is in the form of rectangular tables (also called data arrays or data matrices)data arrayやdata matricesとも呼ばれる短形テーブルはデータベースを構築する最も有名な方法です

q Each row represents the values of all variables on a single multivariate observation, 各行は単一の多変量観測の全ての変数の要素を表し

q Each column represents the values of a single variable for each observation. 各列は各観測のための単一の変数の値を表しています

• A typical database table having n multivariate observations taken on r variables will be represented by an (r × n)-matrix 典型的なデータベースはnの多変量観測とrの変数の(r x n) - マトリックスで表される

19

Elements of database systems

§ A database management system (DBMS) is a software system that manages data and provides controlled access to the database. データベースマネジメントシステム(DBMSは)データを管理し,データベースへのアクセスを提供するソフト

§ Database system (consisting of databases, DBMS, and application programs) is typically used for managing large quantities of data, regarded as two entities: § a server (or backend), which holds the DBMS, and

§ a set of clients (or frontend), each consists of a hardware and a software component, including application programs

データベースはサーバーとクライアントからなる大量のデータを管理するシステム

20

Structured

Commercial

Open  source

Unstructured

(RDBMS)(NoSQL DB)

Source:  Cisco

Big data landscape

Structured query language (SQL)

§ Users communicate with a DBMS through a declarative query language typically SQL (Structured Query Language).ユーザーは通常SQLと呼ばれる宣言型クエリ言語を通じてRDBMSと通信を行う

§ SQL has two main sublanguages: SQLは主な2つの準言語があるq a data definition language (DDL), used by database admin to define data

structures by creating a database object, altering or destroying a database object.データ定義言語(DDL),管理者が使うデータ構造を定義する言語

q a data manipulation language (DML) is an interactive system that allows users to retrieve, delete, and update existing data from and add new data to the database.データ操作言語(DML),ユーザーがデータベース上のデータを操作するための言語

§ Examplesq create  table  <table  name>  (<table  elements>);  q select  <columns>  from  <table  name>  where  <condition>;  q select  max(<column>)  as  max,  min(<column>)  as  min  from  <table  name>   where  

<condition>;  

22

Flat  model:  labeled  data

23

H1

C3

H3 H4

H2

C2C1

C4

ID color                #nuclei          #tails       status

H1              light 1 1                  healthyH2              dark 1 1                  healthyH3              light 1 2                  healthyH4              light 2 1                  healthyC1              dark 1 2                cancerousC2              dark 2 1                cancerousC3              light 2 2                cancerousC4              dark 2               2                cancerous  

教師つきデータSupervised data (labeled)

Descriptive  attributes                                                                                  Color:  {dark,  light},  #nuclei:  {1,  2},  #tails:  {1,  2}  

Class  attributeStatus  {cancerous,  healthy}

Flat  model:  unlabeled  data

24

H1

C3

H3 H4

H2

C2C1

C4

ID color                #nuclei          #tails       status

H1              light 1 1                  healthyH2              dark 1 1                  healthyH3              light 1 2                  healthyH4              light 2 1                  healthyC1              dark 1 2                cancerousC2              dark 2 1                cancerousC3              light 2 2                cancerousC4              dark 2               2                cancerous  

教師無しデータUnsupervised data (unlabeled)

Descriptive  attributes                                                                                  Color:  {dark,  light},  #nuclei:  {1,  2},  #tails:  {1,  2}  

25

Relational  databases

A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.

Cust-­ID                            name                                      address                                                                        age                                  income                          credit-­info                        .C1                        Smith,  Sandy                  5463  E  Hasting,  Burnaby                              21                                      $27000                                  1           …

BC  V5A  459,  Canada                … … … … … … …

Item-­ID                    name                          brand                            category                              type                            price                              place-­made         supplier                              cost        I3                      high-­res-­TV            Toshiba                high  resolution                      TV                          $988.00                                Japan                     NIkoX                          $600.00I8                      multidisc-­ Sanyo                          multidisc                            CD  player          $369.00                                Japan                 MusicFont                $120.00

… CDplayer                        … … … … … … …

customer

item

Emp-­ID                      name                                    category                                                group                                        salary                                          commisionE35                        Jones,  Jane              home  entertainmentl                    manager                              $18,000                                                    2%… … … … … …

employee

Branch-­ID                    name                                                                                                      addressB1                              City  square                369  Cambie  St.,  Vancouver,  BC  V5L  3A2,  Canada… … …

branch

Trans-­ID            cust-­ID              empl-­ID              data                  time                method-­paid              amountT100                    C1                        B55                      01/21/98        15:45            Visa                                    $1357.00… .    … … … … … …

purchases

Trnas-­ID    item-­ID      sty

T100                    I3                  1T100                    I8                  2… … …

Empl-­ID    branch-­ID

E55                    B1… …

Item-­sold                                    works-­at

26

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. データウエアハウスは複数のリソースから統合されたスキーマに収集された情報のレポジトリです.通常単一のサイトにあります

Data  sourcein  Chicago

Data  sourcein  New  York

Data  sourcein  Vancouver

Data  sourcein  Toronto

CleanTransformIntegrateLoad

Data  warehouse

Query  andanalysis  tool

client

client

Data  warehouses

27

Transactional databases

§ A transactional database consists of a file where each record represents a transaction.

§ A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.

Trans_ID          list  of  item_ID

T100                        beer,  cake,  onigiriT200                        beer,  cakeT300                        beer,  onigiri            T400 beer,  onigiriT500                        cake

28

§ Object-Oriented Databases

§ Object-Relational Databases

§ Spatial Databases

§ Temporal Databases and Time-Series Databases

§ Text Databases and Multimedia Databases

§ Heterogeneous Databases and Legacy Databases

§ The World Wide Web

Advanced  database  systems

29

§ Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.

§ Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.

Spatial databases

Japanese  earthquakes      1961-­1994

30

§ They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)

§ Data analytics finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies

Temporal and time-series databases

31

§ Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.

§ Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.

Text and multimedia databases

32

The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.

Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.

The world wide web

Outline

1. Much more data around us than before2. Data management3. Data quality problems

33

Noisy, inconsistencies, outliersCommon properties of large real-world databases: 現実上の巨大のデータベースに共通する特徴

§ Incomplete: lacking attribute values or certain of interest 不完全:データの欠落.属性が適切な物かど

うか

§ Noisy: containing errors or outliers ノイズ:エラーや異常値

§ Inconsistent: containing discrepancies in codes or names 矛盾:コードや名前の不一致

No quality data, no quality data mining results!

質の良くないデータからは価値のある結果

が得られない!

34

KDD nuggets

www.kdnuggets.com is website of the data mining community

35

Homework for K236-L2

§ Carefully study the slides. You  can  consult  the  book  chapter  “Data and Databases” provided in the website. Raise your questions on what you have yet clearly seen.

§ Choose 4 datasets from www.statsci.org/datasets.html and summarize each of them (about the area where the data are collected, data type, number of features and objects, etc.). It is required that the datasets you select relating to different kinds of data (categorical, ordinal, integer, real number, etc.) and different data representations (vector, sequence, lists, graph, etc.).

§ Report of this homework will be submitted at the latest one week after the class (June 23).

36