capps programoninformationsciencebrownbag

17
Big Data activities at the U.S. Census Bureau Cavan Capps Big Data Lead U.S. Census Bureau February 13, 2014 Prepared for MIT Libraries Program on Information Science Brown Bag Talk Feb 2014

Upload: micah-altman

Post on 07-May-2015

673 views

Category:

Technology


0 download

DESCRIPTION

Our guest speaker, Cavan Capps, who is Big Data Lead services presented this talk as part of the Program on Information Science Brown Bag Series. [slideshare id] Big Data provides both challenges and opportunities for the official statistical community. The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community. Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes. Speaker Bio: Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb. The 'DataWeb' is a data library of networked statistical databases from all federal statistical data domains, with sophisticated visualization, descriptive analytics, data integration and dashboard construction tools. The DataWeb is the source of official API to Census data products.

TRANSCRIPT

Page 1: Capps programoninformationsciencebrownbag

Big Data activities at the U.S. Census Bureau

Cavan CappsBig Data Lead

U.S. Census BureauFebruary 13, 2014

Prepared for

MIT Libraries Program on Information Science Brown Bag Talk

Feb 2014

Page 2: Capps programoninformationsciencebrownbag

Big Data Challenge at the Census Bureau

2

“The world is now producing large amounts of data.. data from Internet searches, credit card transactions, retail scanners, and social media”.

“ There also are more and more digital administrative data (e.g., tax records, social security records, Medicare/Medicaid records, food stamp records, HUD records). Some of these data are not directly linked to the populations we study; some have item missing data problems; none offer a real replacement for our surveys, but many will be useful as auxiliary data sources.”

“Designed Data” vs. “Organic Data”

Page 3: Capps programoninformationsciencebrownbag

Big Data Challenge at the Census Bureau

3

Big Data is about creating information to make Big Decisions from novel, and often massive data sources.

Page 4: Capps programoninformationsciencebrownbag

Big Data creates new Statistical Agency Challenges

4

A recent meeting of International Statistical Agencies observed:

1. The volume of data generated outside the government statistical

systems is increasing much faster than the volume of data collected by the statistical systems; almost all of these data are digitized in electronic files.

2. As this occurs, the leaders expect that relative cost, timeliness, and

effectiveness of traditional survey and census approaches of the

agencies may become less attractive.

Page 5: Capps programoninformationsciencebrownbag

Big Data creates new Statistical Agency Challenges

5

A recent meeting of International Statistical Agencies observed:

3. Blending together multiple available data sources (administrative, commercial electronic transactions and internet web-page data, search frequency data, twitter, facebook etc. ) with traditional surveys and censuses (using paper, telephone, face-to-face interviewing) to create high quality, timely statistics that tell a coherent story of economic, social and environmental progress must become a major

focus of central government statistical agencies.4. This requires efficient record linkage capabilities, the building of

master universe frames that act as core infrastructure to the blending of data sources, and the use of modern statistical modeling to

combine data sources with highest accuracy.

Page 6: Capps programoninformationsciencebrownbag

Big Data creates new Statistical Agency Challenges

6

A recent meeting of International Statistical Agencies observed:

5. The Agencies will need to develop the analytical capabilities to

distill insights from more integrated views of the world and impart a stronger systems view across different government and private sector information systems to provide more geographical and industry detail.

6. There are growing demands from researchers and policy-related

organizations to analyze the micro-data collected by the agencies, to extract more timely and detailed information from the data.

Page 7: Capps programoninformationsciencebrownbag

Big Data Development Challenges for Statistical Agencies

7

The Meeting Recommended that Statistical Agencies develop:

1. High-speed, “big data” software/hardware systems for record linkage and extraction of key information from massive files.

2. Efficient and sophisticated imputations procedures needed to make the combined data sources jointly useful.

3. More use of statistical modeling for statistical estimation, to provide more:1. Timely estimates2. Small area estimates3. New measures

4. New ways to give secure access to micro-data for legitimate policy and research purposes, to increase their impact of their work.

Page 8: Capps programoninformationsciencebrownbag

In Summary, massive challenges for the Statistical Agencies:

8

1. The Internet and Private E-Transactions are generating data faster and more cheaply than Statistical agencies can afford to do.

2. To be reliable sources of information on the Demographics, Economy and Social change in the U.S., this information needs to be mashed together with traditional surveys and adjusted for bias.

3. The sizes of the files and the number of computations to mash up the data will be larger.

4. Spoiled by the Internet, users expect more timely, and detailed data provided at lower costs.

5. Privacy/Confidentiality must be maintained.

Page 9: Capps programoninformationsciencebrownbag

Big Data Projects at the Census Bureau

9

Data Collection

- Multi-Mode Data Survey Collection model

- New Data sources (Web, E-Transactions, Admin Recs)

Data Integration & Analysis- Record Linkage- Small Area Estimation modeling & “Now Casting”

Data Release- Data Review for Release- Confidentialize data for public release

The Census Bureau “Big Data”Information Life Cycle

Page 10: Capps programoninformationsciencebrownbag

Big Data Current Process Future Process (exploring)

10

• Designed & Organic Data

• Next Generation Open-Source & Proprietary Software

• More Parallel Processing

• Faster processing times

• Designed Data

• Proprietary Software

• Batch Processing

• Long processing times

Page 11: Capps programoninformationsciencebrownbag

Big Data Collection: Improving Survey Logistics & Cost

11

Improving Survey Collection and Imputation Operations(Adaptive Design)

1. Multi-modal data collection to reduce operational costs of data collection– More effective use of existing data such as

administrative records– Incorporating new data into decennial operations

• Paradata from Internet Data Capture• Information from Social Media Feeds

2. Edits and Imputations

3. Data Review

Page 12: Capps programoninformationsciencebrownbag

Big Data Collection: Evaluating Web Data as Inputs

12

Potential Internet Data Collection

1. Examine Google & Bing search frequency trend data

2. Examine “Web Scraping” of housing data, price data, local tax data, crime data, corporate profits etc.

3. Examine Twitter, and other social media trend data

Page 13: Capps programoninformationsciencebrownbag

Big Data Collection : Evaluating Commercial E-Transaction Input Data

13

1. Housing:– Foreclosures: Use vendor data on new residential properties in

foreclosure to aid analysis of data on new construction and sales.– Building Permits: Web scrape opportunity to access local jurisdictions and

state agencies posting public records online.

2. Construction: – Difficulty obtaining electronic data from numerous state and local agencies– Data are needed immediately to tabulate the monthly economic indicators.

3. Retail Sales: Evaluating electronic payment processing to fill data gaps such

as geographical detail and revenue measures by firm size– New data products – Improvements to current data quality

Page 14: Capps programoninformationsciencebrownbag

Big Data Integration & Analysis: (Current processes)

14

Data Integration Expertise:• Record linkage

– Gov’t Admin Records to other Gov’t Admin Records– Gov’t Admin Records to Gov’t Surveys – Commercial records to Gov’t Admin Records

• Model based integration–Small Area Poverty & Income Estimates–Small Area Health & Income Estimates–Longitudinal Economic & Housing Dynamics

Page 15: Capps programoninformationsciencebrownbag

Big Data Integration & Analysis: Exploring “Now Casting”

15

Exploring “Now Casting” to improve Statistical Timeliness :1. Some “real time” Internet data correlates with Official Statistics:

– Google search data modeled to match BLS unemployment & CDC Flu spread

– Univ. of Michigan Twitter unemployment– MIT Billion Price Project match to BLS CPI

2. Census experiments with Gov’t Pension data

Page 16: Capps programoninformationsciencebrownbag

Big Data Lab

1. Setting up an experimental Cluster

2. Testing performance of Hardware

3. Testing value of Software– Open Source Big Data Software:

Hadoop, Mahout, Distributed R, Hbase, Pig, Hive,Casandra, Mongo, Flume, Neo4J, I-Graph, Allegrograph

– Internally Developed software:TEA, DataWeb, Matching software

Page 17: Capps programoninformationsciencebrownbag

On the Horizon, Development of Big Data Center

1. Proposal to create a new center that will include members from academy and Census staff to:1. Help lead work Census Bureau on practices to make sense of Big Data.

Developing principles to apply Big Data to federal statistics.2. Facilitate CB as unbiased provider for information collected as Big Data3. Validate new techniques and data sources it at a low cost (field staff allow

us to do ground checks, survey questions)4. Lead on methods to integrate Big Data and develop standards5. The Center should provide a way to bring both faculty and graduate

students to Census to facilitate Big Data capacity building at the Census Bureau

2. We will explore partnerships with others doing research in this area. Universities, and Silicon Valley

Research, capacity building and economic Big Data Processing: