aliis exterendum - ucla statisticscocteau/desma/lectures/meeting_two.pdf · aliis exterendum ¥...

20
Second meeting: Aliis exterendum Quetelet invited to attend a meeting of the British Association for the Advancement of Science in 1833; he decided to give a paper on the relationship between statistics of crime and age in France and Belgium Charles Babbage was so intrigued by Quetelet's presentation that he suggested that a new section of the BA be formed to deal expressly with statistics Consent was given for the new section with the proviso that the new group should deal in facts and stay away from opinion and; interpretation the BA did not want to be become mired in politics over interpretation of social statistics such as crime and social conditions The new section was an immediate success it became clear to the leading figures of the section that a statistical Fitting distributions Quetelet and others saw normal distributions everywhere; data sets of the sort used by Quetelet were published and were used to inform early statistics research From correlation to the t-test, these data were often used in an exploratory fashion or through simulation to work out results that depended on normal distributions 33 35 37 39 41 43 45 47 5,738 chest measurements 0 200 400 600 800 1000 Suppose... that one wished to make a thousand copies of a statue, say the Gladiator. Like astronomical observations of a single object, these copies would be subject to a variety of errors -- in measuring the various dimensions, in workmanship and so on. The independent errors are like terms of a binomial, and combine in a characteristic fashion. Hence the variation among the copies would be governed by a profound regularity, the error law or normal curve, with the dimensions of the original Gladiator at the mean. But this is an impossible experiment. How did Quetelet know what the result would be? “I shall perhaps astonish you very much by stating that the experiment has been already made. Yes, surely, more than a thousand copies hav ebeen measured of a statue, which I do not assert to be that of the Gladiator, but which in all cases differees little from it. These copies were living ones... “ (1846, p. 136) from The empire of Chance, p. 54

Upload: trinhduong

Post on 17-Sep-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Second meeting:

Aliis exterendum

• Quetelet invited to attend a meeting of the British Association for the Advancement of Science in 1833; he decided to give a paper on the relationship between statistics of crime and age in France and Belgium

• Charles Babbage was so intrigued by Quetelet's presentation that he suggested that a new section of the BA be formed to deal expressly with statistics

• Consent was given for the new section with the proviso that the new group should deal in facts and stay away from opinion and; interpretation the BA did not want to be become mired in politics over interpretation of social statistics such as crime and social conditions

• The new section was an immediate success it became clear to the leading figures of the section that a statistical

Fitting distributions

• Quetelet and others saw normal distributions everywhere; data sets of the sort used by Quetelet were published and were used to inform early statistics research

• From correlation to the t-test, these data were often used in an exploratory fashion or through simulation to work out results that depended on normal distributions

33 35 37 39 41 43 45 47

5,738 chest measurements

0200

400

600

800

1000

Suppose... that one wished to make a thousand copies of a statue, say the Gladiator. Like astronomical observations of a single object, these copies would be subject to a variety of errors -- in measuring the various dimensions, in workmanship and so on. The independent errors are like terms of a binomial, and combine in a characteristic fashion. Hence the variation among the copies would be governed by a profound regularity, the error law or normal curve, with the dimensions of the original Gladiator at the mean. But this is an impossible experiment. How did Quetelet know what the result would be?

“I shall perhaps astonish you very much by stating that the experiment has been already made. Yes, surely, more than a thousand copies hav ebeen measured of a statue, which I do not assert to be that of the Gladiator, but which in all cases differees little from it. These copies were living ones... “ (1846, p. 136)

from The empire of Chance, p. 54

From “On Criminal anthropometry and the identification of criminals” by W. R. MacDonell, Biometrika 1, Vol 1, 1902

From “On Criminal anthropometry and the identification of criminals” by W. R. MacDonell, Biometrika 1, Vol 1, 1902

to clarify why measurements of criminals were in the “public data domain”from the first page of the article...

and later some conclusions...

the t-distribution

• here are Gosset’s data via a

simple histogram

• keep in mind these plots

represent the entire

population; from this collection

of 3,000 numbers he draws

“cards” or elements from this

distribution to simulate a

normal random variable

Histogram of Criminal’s Heights

heights

Frequency

55 60 65 70 75

0100

200

300

400

Fitting distributions

• After we dealt with our own “bell curve” revolution, researchers started seeing other kinds of distributions in data

• The heavy-tail or fat-tail or “hockey-stick” distributions or “power laws” became objects of study in the mid-to-late 90s on their own; they seemed to crop up everywhere

• These researchers made the same kind of intellectual moves, hypothesizing different toy simulations that gave rise to these distributions and then reckoning that nature must operate similarly

Detour: Carson and Doyle

N= # of customers affected by outage

Frequencyof outages > N

104

105

106

107

100

101

102

103

US Power outages1984-1997

-6 -5 -4 -3 -2 -1 0 1 2-1

0

1

2

3

4

5

6

Size of events

Frequency

Log (base 10)

Forest fires

1000 km2

(Malamud)

WWW filesMbytes

(Crovella)

Data

compression(Huffman)

Los Alamos fire

Cumulative

Highly

Optimized

Tolerance (HOT)

• Complex systems in biology, ecology, technology, sociology, economics, …

• are driven by design or evolution to high-performance states which are also tolerant to uncertainty in the environment and components.

• This leads to specialized, modular, hierarchical structures, often with enormous “hidden” complexity,

• with new sensitivities to unknown or neglected perturbations and design flaws.

• “Robust, yet fragile!”

Robustness of HOT systems

Robust

Fragile

Robust

(to known and

designed-for

uncertainties)Fragile

(to unknown

or rare

perturbations)

Uncertainties

ComplexityRobustness

Aim: simplest possible story

Interconnection

-6 -5 -4 -3 -2 -1 0 1 2-1

0

1

2

3

4

5

6

Size of events

Frequency

Log (base 10)

Forest fires

1000 km2

(Malamud)

WWW filesMbytes

(Crovella)

Data

compression(Huffman)

Los Alamos fire

Cumulative

-6 -5 -4 -3 -2 -1 0 1 2-1

0

1

2

3

4

5

6

Size of events

Frequency

Fires

Web files

Codewords

Cumulative

Log (base 10)

-1/2

-

Detour: Carson and Doyle

Project ideas: Making data physical

In this piece, the computer is used to make a catalog, or database of every shot of many shows. Each shot is then indexed according to the categories seen on the shelf of Video CDs. In some ways, the result is scrambled, in other ways it is highly ordered according to the logic of the database.

Every shot/every episode, Jennifer and Kevin McCoy, 2001

20 episodes of “Starsky and Hutch” divided by scene into 300 categories

“Every Anvil”, Jennifer and Kevin McCoy, 2001

“448 is enough”, Jennifer and Kevin McCoy, 2002

Dark Source is an artwork that shows the inner workings of a commercial electronic voting machine, the Diebold AccuVote-TS™ touch-screen voting terminal that has recently been adopted in many U.S. states...The artwork presents over 2,000 pages of software code, a printout of 49,609 lines of C++ that constitute version 4.3.1 of the AccuVote-TS™ source code...

Calling its source code a trade secret, Diebold has asserted its proprietary interest in protecting its intellectual property. Therefore in Dark Source the code, which had been obtained freely over the internet following a 2002 security failure at Diebold, has been blacked out in its entirety in order to comply with trade secrecy laws.

Dark Source, Ben Rubin, 2005

[murmur]

• A project to help share recordings taken on the scene

• This was from a park in San Jose; the sound files were recorded by visitors to the park

(Click the red dots to hear the stories!)

D-tower

• Daily surveys assess the mood of Doetinchem, recording four states (happiness, love, fear and hate)

• The data are then represented both through hand-drawn glyphs on the web as well as...

Lars Spuybroek, D-Tower, Netherlands

and today...

Preparing for lab

• We’ll now go over some elementary database concepts; this is probably best done in the context of a particular set of data

• We will now look at two in particular: the first comes from the so-called Reality Mining project at MIT, and the second is a large text database released by the SEC

Why these?

• Much of our readings for the next two meetings deal with personal identification; physical measurements (literally, height, length of index finger, etc.) or images

• In computer mediated settings, we often leave traces that might also be used to identify us; recall the recent AOL debacle

Why these?

• At a practical level, we chose these because they are relatively complex and provide us with some room to examine database technology

• The next few slides introduce you to the data, providing you with a little background; we’ll then introduce some basic database concepts in context

Machine Perception and Learning of Complex Social Systems

Reality Mining defines the collection of machine-sensed environmental data

pertaining to human social behavior. This new paradigm of data mining makes

possible the modeling of conversation context, proximity sensing, and

temporospatial location throughout large communities of individuals. Mobile

phones (and similarly innocuous devices) are used for data collection, opening

social network analysis to new methods of empirical stochastic modeling.

The original Reality Mining experiment is one of the largest mobile phone

projects attempted in academia. Our research agenda takes advantage of the

increasingly widespread use of mobile phones to provide insight into the

dynamics of both individual and group behavior. By leveraging recent advances

in machine learning we are building generative models that can be used to

predict what a single user will do next, as well as model behavior of large

organizations.

We have captured communication, proximity, location, and activity information

from 100 subjects at MIT over the course of the 2004-2005 academic year.

This data represents over 350,000 hours (~40 years) of continuous data on

human behavior. Such rich data on complex social systems have implications

for a variety of fields. The research questions we are addressing include:

• How do social networks evolve over time?

• How entropic (predictable) are most people's lives?

• How does information flow?

• Can the topology of a social network be inferred from only proximity data?

• How can we change a group's interactions to promote better functioning?

If you have a Nokia Symbian Series 60 Phone (such as the Nokia 6600) with a

data plan, you can participate. Additionally, we have cleaned the 2004-2005

data of identifiable information and are making it available to other

researchers within the academic community. Both the mobile phone

application and the resultant dataset can be downloaded here.

Reality Mining

• The project starts with the idea that just about everyone has or will have a mobile phone; these devices act as a kind of wearable sensor

• When using a mobile phone, your location is roughly known because you “associate” (to borrow the language from a previous project) with a cell tower

• In addition, many mobile phones have the ability to contact other nearby wireless devices via the Bluetooth protocol; this means that it is possible to passively track social interactions and exhibit daily patterns of contact

Reality Mining

• In addition, mobile phones provide information about users’ communication patterns: Who did you call? Who did you text?

• Here, for example, is a representation of user locations with links indicating current calls

• We can also collect information about the device itself; Is it idle? Is the battery charging?

Machine Perception and Learning of Complex Social Systems

Reality Mining defines the collection of machine-sensed environmental data

pertaining to human social behavior. This new paradigm of data mining makes

possible the modeling of conversation context, proximity sensing, and

temporospatial location throughout large communities of individuals. Mobile

phones (and similarly innocuous devices) are used for data collection, opening

social network analysis to new methods of empirical stochastic modeling.

The original Reality Mining experiment is one of the largest mobile phone

projects attempted in academia. Our research agenda takes advantage of the

increasingly widespread use of mobile phones to provide insight into the

dynamics of both individual and group behavior. By leveraging recent advances

in machine learning we are building generative models that can be used to

predict what a single user will do next, as well as model behavior of large

organizations.

We have captured communication, proximity, location, and activity information

from 100 subjects at MIT over the course of the 2004-2005 academic year.

This data represents over 350,000 hours (~40 years) of continuous data on

human behavior. Such rich data on complex social systems have implications

for a variety of fields. The research questions we are addressing include:

• How do social networks evolve over time?

• How entropic (predictable) are most people's lives?

• How does information flow?

• Can the topology of a social network be inferred from only proximity data?

• How can we change a group's interactions to promote better functioning?

If you have a Nokia Symbian Series 60 Phone (such as the Nokia 6600) with a

data plan, you can participate. Additionally, we have cleaned the 2004-2005

data of identifiable information and are making it available to other

researchers within the academic community. Both the mobile phone

application and the resultant dataset can be downloaded here.

Bluetooth

• This protocol was introduced by Ericsson in 1994 and by in 2006 it was available in about 90% of PDAs, 80% of laptops and 75% of mobile phones; in November of 2006, the number of installed bluetooth devices crossed the one billion mark

• You might be own Bluetooth wireless headset, keyboard, or mouse; the protocol was initially designed to form ad hoc networks so that nearby devices could cooperate in some sense

• One feature of this protocol is something called device discovery; a Bluetooth phone can identify information on other Bluetooth devices within 5-10 meters

• The Media Lab group created a small piece of monitoring software that runs on certain mobile phones and records data on devices carried by people nearby

The data

• The Reality Mining study consisted of tracking 100 people for 9 months using specially prepared Nokia 6600 smart phones

• 75 of the users were students or faculty in the Media Lab, while the remaining 25 come from the Sloan Business School

• Researchers collected call logs, records of Bluetooth devices nearby and the cell towers the user “associated” with, and statistics related to application usage and phone status

The data

• So far, while we have discussed several projects that rely on data and we have read about large (albeit often antique) data collection efforts, we haven’t talk very much about formal methods for data organization

• In industry, in large (and even not-so-large) applications, you will be called upon to access data from relational databases; these are often commercial systems from vendors like Oracle or even SAS

• The data from the Reality Mining project were provided to us in the form of a MySQL database

What is a database?

• an organized body of related information

• A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult

it to answer questions. The software used to manage and query a database is known as a database management system (DBMS). The properties of database systems are studied in information science.

• Data stored on computer files or on CD-ROM. A database may contain bibliographic, textual or numeric data. The data are

usually structured so that they may be searched in a number of ways. A variety of databases is accessible via this website.

• A database is an organised collection of information records that can be accessed electronically.

• is an organized collection of information stored on a computer.

• A database is a collection of data that is organized so that its contents can easily be accessed, managed and updated.

• A collection of information that has been systematically organized for easy access and analysis. Databases typically are

computerized.

• A collection of information arranged into individual records to be searched by computer.

• Any organized collection of information; it may be paper or electronic.

• a standardized collection of information in computerized format, searchable by various parameters; in libraries often refers to

electronic catalogs and indexes.

• A collection of electronic records having a standardized format and using specific software for computer access.

• A collection of information organized and presented to serve a specific purpose. A computerized database is an updated,

organized file of machine readable information that is rapidly searched and retrieved by computer.

• A set of data that is structured and organized for quick access to specific information.

MySQL

• At a technical level, MySQL is a multithreaded, multi-user, structured query language (SQL) database management system (DBMS)

• It is distributed free under the GPL (we may or may not have a discussion about software licenses - but the GPL implies that you can take code, make changes, and redistribute it as long as you do so under the GPL and you make the source code available)

• It is owned and sponsored by a for-profit company that sells support and service contracts, as well as commercially-licensed copies of MySQL

Benefits of a database

• The data can be shared and accessed by many users; it is also possible to regulate the kind of access granted to each user, introducing a layer of security

• Redundancy is reduced in the sense that not everyone has to have their own private copy of the data; as a corollary we also have the opportunity to reduce inconsistencies in the data and maintain better control over data integrity

• Finally, the act of creating a database and deciding on how data are to be represented often forces discussions by the users about what services they require from the database; it also allows for the introduction of data standards which could enable the interchange of data with other systems and organizations

Data models

• To ground our discussion somewhat, a database provides a kind of organizational structure to the data it contains and defines operations that can be performed on the data

• A list is the simplest kind of “flat” data model; a list is essentially a list is essentially a two-dimensional array of data elements where all members of a given column have similar values, and members of each row are related in some way

Reality Mining

• As an example, the researchers at MIT needed a list to keep track of their participants

• In addition to each person’s name, password, phone number and email address, they asked each person to fill out a survey

• This provided them with information about their status with MIT (faculty, undergraduate, graduate student, staff), the hours they are usually on campus, how often they forget their phone at home, and so on

oid

name

password

email

phonenumber_oid

survey_Position

survey_Neighborhood

survey_Hours

survey_Regular

survey_Hangouts

survey_Predictable_life

survey_Forget_phone

survey_Run_out_of_batteries

survey_How_often_get_sick

survey_Sick_Recently

survey_Travel

survey_Data_Plan

survey_Provider

survey_Calling_plan

survey_Minutes

survey_Texts

survey_Like_Intros

survey_ML_Community

Reality Mining

• Here are the first five rows of the list of participants for the project; each entry is separated by a comma

• Note that there are lots of “missing values” (don’t we just love social science research?)

•,,,,

mlgrad,Fresh Pond*,9am-5pm,somewhat,working with community centers around boston*

masfrosh,MIT,F9am-5pm*,very,restaurant/bar; friends; classes

sloan,,,,

newgrad,Boston,11am-8pm,somewhat,restaurant/bar; gym; library*

Reality Mining

• We can perform simple summaries on each column in the participant list; here we show the breakdown of user’s forgetfulness as well as their position at MIT

• A list is the most basic data structure, and in some ways is almost invisible to us; we’re used to the idea of recording information in this way

survey_Position

sloan 27

student 11

mlgrad 10

newgrad 9

masfrosh 6

NA 4

1styeargrad 4

3rdyeargrad 3

staff 3

2ndyeargrad 2

6years 2

5 year media lab 1

5thyear 1

6thyeargrad 1

8thyear 1

Germany 1

Ml_urop 1

csail 1

graduated 1

ml5 1

mlUrop 1

mlfrosh 1

neil 1

prof 1

professor 1

researcher 1

urop_hd 1

survey_Forget_phone

NA 36

rarely 27

occasionally 15

never 12

Never 2

Rarely - once/month 2

often 2

-- 1

The problem with lists

• In a list, each row is intended to stand on its own; that means that the same information may be needed in several places

• For example, researchers in the Reality Mining project want to record the cell towers their participants associate with; they can create a list that records each event (the time, the participant and the cell tower)

• In a table of events, we’d end up recording the participant and cell tower information multiple times; if we decide to update the location of a cell tower, for example, we will have to change all the entries in our list that refer to that tower

The problem with lists

• In addition to replication, it may happen that certain information does not appear at all in a simple transaction list; if a known cell tower near campus is never accessed by our participants, its information would never appear

• In short, each row of a list can hold data about different “themes”; a simple list for association events is also being asked to hold data about participants and cell towers

• Redundancy and multiply themed row elements make lists difficult structures to manage

The relational model

• A relational database contains multiple tables, each similar to the flat model; the word relation is borrowed from mathematics were it essentially means table

• Next, a series of operators exist through which users query the database, deriving new tables from old ones; for example, we might extract a subset of a given table, perform simple computations, or merge the data in several tables

Reality Mining

• As we mentioned previously, the Reality Mining dataset was offered as a dump of a MySQL database; in Lab we will use a Query Browser to have a look at relations

Some terminology

• Each table is called a relation; rows in the table are called tuples and columns are called attributes

• The degree of a table refers to the number of columns, and the cardinality the number of rows

• An entity is an abstraction of an object will be represented in the database; an instance is a particular occurrence of the object

• For the person relation, this means some facts about their

phone and a series of survey results; in our database we have 97 instances, separate users

Relational databases

• In the Reality Mining example, we see several tables that are related in the sense that data in one refers might refer to data in another; to be useful, the data need to be joined back together

• Keys are used to match up rows of data in different tables; a key is simply a collection of one or more attributes

• In our example, this structure is pretty simple...

Reality Mining

• There are several tables related to the objects under study; in some cases data have been removed to protect the privacy of the participants

• cellname: oid, name, person_oid, celltower_oid

• celltower: oid, name

• device: oid, macaddr, person_oid, name

• person: name, password, email, phonenumber_oid,...

• coverspan: oid, number

Reality Mining

• In addition, there are several tables related to events that were captured during the monitoring period

• activityspan: oid, endtime, starttime, person_oid

• callspan: oid, endtime, starttime, person_oid,

phonenumber_oid, callid, contact, description,

direction, duration, number, status, remote

• cellspan: oid, endtime, starttime, person_oid,

celltower_oid

• coverspan: oid, endtime, starttime, person_oid

• devicespan: oid, endtime, starttime, person_oid,

device_oid

Relational databases

• DBMSs are optimized to handle a set of queries made by users; again, these queries essentially let us make new tables out of old by subsetting, merging or performing simple computations

• A query language allows users to interact with the database, reducing the data and summarizing it before retrieving the results

• The Structured Query Language (SQL) is widely used, and is supported by most commercial databases

Relational databases

• Kroenke identifies four components of a database system; the users, the database application, the database management system and the database itself

• The Query Browser interface we will be using is an example of a database application; it provides us with a view into the database

• The DBMS controls all activities that take place in the database, acting as an intermediary between database applications and the database; it will answer user queries, enforce access permissions, deal with concurrency issues, provide backups, handle security, ensure data integrity, and so on

• The database itself is the set of tables; in the case of the Reality Mining project, it is the set of activities, persons, cell towers and so on

SQL

• The SELECT statement is used to retrieve data from a database; you specify the table you want to draw from and various conditions you want to impose to extract a subset of the data

SELECT * FROM person;

• This will return the complete person table (which in this case is not very large)

Reality Mining

• Here is the same view of our person list, described a few slides ago; here we’ve issued a command to the database, a query, asking for all the data from the person relation

SQL

• We can restrict the columns we retrieve by replacing * with a comma-separated list

• For example, the table cellname contains the names that

users assigned to their locations when they were associated with a given cell tower; we can select just the name and the person who assigned the name with the command

SELECT name, person_oid FROM cellname;

SQL

• We can consider just the names given to the first cell tower by restricting our search

SELECT name, person_oid FROM cellname

WHERE celltower_oid = 1;

SQL

• ... or, if we want to restrict our search to the first two cell towers we would use another construction

SELECT name, person_oid FROM cellname

WHERE celltower_oid IN (1,2);

SQL

• The general form of this command is

SELECT column(s) FROM relation(s)

[WHERE constraints];

• Our constraints can become a lot more complex...

SELECT * FROM callspan

WHERE duration > 1200 AND person_oid = 1;

Reality Mining

• The next couple examples will relate to the table callspan;

each row in this table describes a call made by one of the participants

• A database schema is used to describe the structure of the tables in a database; here is the portion of the schema for the

Reality Mining database that relates to callspan

SQL Review

• SQL is primarily designed for data retrieval; it is not a computational language nor is it a statistical language

• It does, however, contain certain summarizing capabilities in the form of functions that can be applied over rows in a table; the GROUP BY can be used to identify subsets to apply these over also

• COUNT: returns the number of tuples (rows)

• SUM: the total for the attribute

• AVG: the average for the attribute

• MIN: the minimum across the attribute

• MAX: the maximum across the attribute

SQL

• We add these additional clauses as follows; here we compute the average amount of time people spent on the phone

SELECT person_oid, AVG(duration)

FROM callspan GROUP BY person_oid;

SQL

• This will find the average duration of calls for each ID

• We can take it a step further with the following construction

SELECT person_oid, AVG(duration)

FROM callspan

GROUP BY person_oid

HAVING MAX(duration) < 10000;

SQL

• As one final example, suppose we want to examine the number of incoming versus outgoing calls

SELECT description, direction, COUNT(*) AS count

FROM callspan

GROUP BY description, direction;

Enron

• The Reality Mining data set came to us in a particularly tidy form; it was already arranged as a series of tables

• The next project will not be so clean; some relevant links are

• http://www.chron.com/news/specials/enron/timeline.html

• http://www.cs.cmu.edu/~enron/

• http://www.stat.ucla.edu/~cocteau/klimt-ecml04-1.pdf

• http://www.stat.ucla.edu/~cocteau/Enron_Employee_Status.htm

Enron emails

• As part of its investigation into Enron, the Federal Energy Regulatory Commission released the emails of about 150 of its top executives

• These data were then cleaned up by groups at MIT and SRI and are now publicly available through the CMU CS Department

• To respect the privacy of the individuals involved, I have replaced the body of each email with x’s; our interest is not in what was said but who sent email to whom

Organization of the data

• The data itself is organized into a series of directories (folder), each named after an executive

• Under each directory (folder), you will find possibly more directories (folders), each representing a different mail folder

• At the lowest level, you have a series of email messages, one per file; the files in each directory are named 1., 2., 3., etc.

An example

• Here we select the ex-Vice President for Regulatory Affairs, Shelley Corman

• We see the 11 mail folders; selecting the calendar folder, we exhibit the content of mail 2.

• Note again, that all textual content has been replaced by x’s; we are only interested in (at best) the pattern of communication

Some questions

• What is the distribution of numbers of emails per user?

• Are the users organizing their email into folders?

• Are certain folders common to all users?

• What is the distribution of emails per folder?