k236: basis of data science - jaist 北陸先端科学 …bao/k236/k236-l2.pdf" a data...
TRANSCRIPT
K236: Basis of Data ScienceLecture 2. Data and Databases
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science データ科学入門 6/9
2. Introduction to data science データ科学入門 6/13
3. Data and databases データとデータベース 6/16
4. Review of univariate statistics 単変量統計 6/20
5. Review of linear algebra 線形代数 6/23
6. Data mining software データマイニングソフトウェア 6/27
7. Data preprocessing データ前処理 6/30
8. Classification and prediction (1) 分類と予測 (1) 7/4
9. Knowledge evaluation 知識評価 7/7
10. Classification and prediction (2) 分類と予測 (2) 7/11
11. Classification and prediction (3) 分類と予測 (3) 7/14
12. Mining association rules (1) 相関ルールの解析 7/18
13. Mining association rules (2) 相関ルールの解析 7/21
14. Cluster analysis クラスター解析 7/25
15. Review and Examination レビューと試験 (the data is not fixed) 7/27
Outline
1. Much more data around us than before2.Data management3.Data quality problems
This lecture aims to provide you the idea of how data are collected, represented and organized.
3
Data collection, representation, organization and inference
4
Low levelof abstraction
High level of abstraction
Generalization(inductive learning)
§ How data is collected, represented, and organized?q Collection: sample or all available dataq Representation: vectors, sequences, lists, graphs, etc.q Organization: databases, warehouses, etc.
§ Inferenceq Induction: 𝐺𝑖𝑣𝑒𝑛 𝑥( , 𝑖𝑛𝑓𝑒𝑟 𝑓 𝑥(vs. Deduction: 𝐺𝑖𝑣𝑒𝑛 𝑓 𝑥 𝑎𝑛𝑑 𝑥(, 𝑑𝑒𝑑𝑢𝑐𝑒 𝑓(𝑥())
Data Knowledge
5
Astronomical data 天文学的量のデータAstronomy is facing a major data avalanche: 天文学ではデータ崩壊の危機に瀕している
Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source … 何テラバイトもの天空観測データ,何十億もの観測源,観測源ごとに何百もある属
7
Explosion of biological data爆発的な生物学データ
10,267,507,282 bases in 9,092,760 records.
25,000 Genes
2,000,000 Proteins
3000 metabolitesMetabolomics
Proteomics
Genomics
8
A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller): 1600万文字からなるDNA配列の一部 (4570分の一)
How biological data look like?生物学データの形式など
…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…
Many other kinds of biological data
9
§ Approximately 80% of the world’s data is held in unstructured formats(source: Oracle Corporation) 世界中のデータの約80%が構造をもたないデータ(オラクルによる)
§ Example: MEDLINE is a source of life sciences and biomedical information, with nearly eleven million records 例:生命科学・生物医学の情報源であるMEDLINEには約1100万件の論文情報がある
q About 60,000 abstracts on hepatitis (うち肝炎については6万件)
Text: huge sources of knowledgeテキスト:知識の大きな源
36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.
Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals, …
Outline
1. Much more data around us than before2.Data management
q Data modelsq Data typesq Structures of dataq Various kinds of databases
3.Data quality problems
11
Data models
• Model: Simplified description or abstraction of a reality.
• Data model: Data description by a set of concepts of q The structure of a database, typically include
§ elements (e.g., data types), § groups of elements (e.g., entity, record, table), and § relationships among such groups.
q The operations for manipulating these structures, specifying database retrievals and updates§ basic model operations (e.g., insert, delete operations)§ user-defined operations (e.g., compute_student_avarage_score)
q Certain constraints (restrictions on valid data) that the database should obey.
12
Approaches to data models• External model (Views): Describes how
users see the data for a particular purposeq Course_info(cid: string, enrollment: integer)
• Conceptual model: Defines logical structure*q Students(sid: string, name: string, login:
string, age: integer, gpa: real)q Courses(cid: string, cname: string, credits:
integer) q Enrolled(sid: string, cid: string, grade: string)
• Internal (physical) model: Describes how data is stored in computerq Relations stored as unordered files. q Index on first column of students.
13
View 1 View 2
Conceptual model
Physical model
External Level
Conceptual Level
Physical Level
* A conceptual model is an underlying model that is capable of supporting any valid (and perhaps changing) external view that falls within its scope. https://en.wikipedia.org/wiki/Data_model#cite_note-‐MW99-‐3
Types of data models• Flat model: a single, two-dimensional array of data
elements.
• Hierarchical model: data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting.
• Network model: two constructs: records contain fields, and sets define one-to-many relationships between records.
• Relational model: a database as a collection of predicatesover a finite set of predicate variables, describing constraints on the possible values and combinations of values.
• Object-relational model: a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.
• Star scheme: The simplest style of data warehouse
14
Data types§ SYMBOLIC
q Indexing: E.g., names, tags, case numbers, or serial numbers that identify a respondent or group of respondents.
q Binary: Two values, e.g., YES or NO, SUCCESS or FAILURE, MALE or FEMALE, WHITE or NON-WHITE, FOR or AGAINST, and so on.
q Boolean: Two values TRUE or FALSE, and may have the value UNKNOWN.
q Nominal: Character-string values (green, blue, red, …)
q Ordinal: Values for this character-string data type are linearly ordered (Small, Middle, Large,…)
§ NUMERICq Integer: Values are just integer numbersq Continuous: real numbers.
15
Symbols or Numbers
16
Combinatorial search in hypothesis spaces (machine learning)仮説空間における組合わせ探索
Often matrix-based computation (multivariate data analysis)通常は行列ベースの計算(多変量データ解析)
Why caring about data types?
Attribute Numerical Symbolic
No structure
¹= Places,Color
Ordinal structure
³¹= Ring
structure
Rank,Resemblance
Integer: Age,Temperature
Continuous: Income,Length
Nominal orcategorical(Binary, Boolean)
Ordinal
Measurable
´+³¹=
Posible analysis
operations (thus
methods, algorithms) depend on data types
Advances: Data Transformation
Structures of data
• Structured dataq Can be stored in database SQL
in table with rows and columns.
q Only about 5-10% of all available data.
• Semi-structured dataq Doesn’t reside in a relational
database but that does have some organizational properties that make it easier to analyze.
q XML documents and NoSQL databases documents are semi structured
17
Articls in a Latex database
Structures of data
• Unstructured dataq Unstructured data represent around 80% of data. It often include text
and multimedia content. Example: e-mail messages, word documents, videos, photos, audio files, webpages and many other kinds of business documents.
q A key issue in data science is representing unstructured dataExample: The DNA sequence“…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTC…”can be represented by different ways for computation such as sliding windows, motifs, kernel function, etc., or the web link representation
18
Databases
• The most popular format for organizing data in a database is in the form of rectangular tables (also called data arrays or data matrices)data arrayやdata matricesとも呼ばれる短形テーブルはデータベースを構築する最も有名な方法です
q Each row represents the values of all variables on a single multivariate observation, 各行は単一の多変量観測の全ての変数の要素を表し
q Each column represents the values of a single variable for each observation. 各列は各観測のための単一の変数の値を表しています
• A typical database table having n multivariate observations taken on r variables will be represented by an (r × n)-matrix 典型的なデータベースはnの多変量観測とrの変数の(r x n) - マトリックスで表される
19
Elements of database systems
§ A database management system (DBMS) is a software system that manages data and provides controlled access to the database. データベースマネジメントシステム(DBMSは)データを管理し,データベースへのアクセスを提供するソフト
§ Database system (consisting of databases, DBMS, and application programs) is typically used for managing large quantities of data, regarded as two entities: § a server (or backend), which holds the DBMS, and
§ a set of clients (or frontend), each consists of a hardware and a software component, including application programs
データベースはサーバーとクライアントからなる大量のデータを管理するシステム
20
Structured query language (SQL)
§ Users communicate with a DBMS through a declarative query language typically SQL (Structured Query Language).ユーザーは通常SQLと呼ばれる宣言型クエリ言語を通じてRDBMSと通信を行う
§ SQL has two main sublanguages: SQLは主な2つの準言語があるq a data definition language (DDL), used by database admin to define data
structures by creating a database object, altering or destroying a database object.データ定義言語(DDL),管理者が使うデータ構造を定義する言語
q a data manipulation language (DML) is an interactive system that allows users to retrieve, delete, and update existing data from and add new data to the database.データ操作言語(DML),ユーザーがデータベース上のデータを操作するための言語
§ Examplesq create table <table name> (<table elements>); q select <columns> from <table name> where <condition>; q select max(<column>) as max, min(<column>) as min from <table name> where
<condition>;
22
Flat model: labeled data
23
H1
C3
H3 H4
H2
C2C1
C4
ID color #nuclei #tails status
H1 light 1 1 healthyH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous
教師つきデータSupervised data (labeled)
Descriptive attributes Color: {dark, light}, #nuclei: {1, 2}, #tails: {1, 2}
Class attributeStatus {cancerous, healthy}
Flat model: unlabeled data
24
H1
C3
H3 H4
H2
C2C1
C4
ID color #nuclei #tails status
H1 light 1 1 healthyH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous
教師無しデータUnsupervised data (unlabeled)
Descriptive attributes Color: {dark, light}, #nuclei: {1, 2}, #tails: {1, 2}
25
Relational databases
A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.
Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …
BC V5A 459, Canada … … … … … … …
Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00
… CDplayer … … … … … … …
customer
item
Emp-ID name category group salary commisionE35 Jones, Jane home entertainmentl manager $18,000 2%… … … … … …
employee
Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …
branch
Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …
purchases
Trnas-ID item-ID sty
T100 I3 1T100 I8 2… … …
Empl-ID branch-ID
E55 B1… …
Item-sold works-at
26
A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. データウエアハウスは複数のリソースから統合されたスキーマに収集された情報のレポジトリです.通常単一のサイトにあります
Data sourcein Chicago
Data sourcein New York
Data sourcein Vancouver
Data sourcein Toronto
CleanTransformIntegrateLoad
Data warehouse
Query andanalysis tool
client
client
Data warehouses
27
Transactional databases
§ A transactional database consists of a file where each record represents a transaction.
§ A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.
Trans_ID list of item_ID
T100 beer, cake, onigiriT200 beer, cakeT300 beer, onigiri T400 beer, onigiriT500 cake
28
§ Object-Oriented Databases
§ Object-Relational Databases
§ Spatial Databases
§ Temporal Databases and Time-Series Databases
§ Text Databases and Multimedia Databases
§ Heterogeneous Databases and Legacy Databases
§ The World Wide Web
Advanced database systems
29
§ Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.
§ Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.
Spatial databases
Japanese earthquakes 1961-1994
30
§ They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)
§ Data analytics finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies
Temporal and time-series databases
31
§ Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.
§ Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.
Text and multimedia databases
32
The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.
Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.
The world wide web
Noisy, inconsistencies, outliersCommon properties of large real-world databases: 現実上の巨大のデータベースに共通する特徴
§ Incomplete: lacking attribute values or certain of interest 不完全:データの欠落.属性が適切な物かど
うか
§ Noisy: containing errors or outliers ノイズ:エラーや異常値
§ Inconsistent: containing discrepancies in codes or names 矛盾:コードや名前の不一致
No quality data, no quality data mining results!
質の良くないデータからは価値のある結果
が得られない!
34
Homework for K236-L2
§ Carefully study the slides. You can consult the book chapter “Data and Databases” provided in the website. Raise your questions on what you have yet clearly seen.
§ Choose 4 datasets from www.statsci.org/datasets.html and summarize each of them (about the area where the data are collected, data type, number of features and objects, etc.). It is required that the datasets you select relating to different kinds of data (categorical, ordinal, integer, real number, etc.) and different data representations (vector, sequence, lists, graph, etc.).
§ Report of this homework will be submitted at the latest one week after the class (June 23).
36