orchestrating collective intelligence
Post on 15-Apr-2017
274 Views
Preview:
TRANSCRIPT
@josephreisinger @premisedata
WHAT PREMISE MEASURES
Bringing visibility to the world’s hardest-to-see places. 130 cities, 30 countries.
Modernizing Economic Measurement
“I have been constantly surprised at how little quantitative information can be brought to bear on fundamental policy questions [...] This experience illustrates the need for flexibility in data collection, especially when policymakers consider extending new policies or need to evaluate them in real time for other reasons. Ideally, some sort of ‘rapid response’ data gathering capacity.”
— Alan Krueger, “Stress Testing Economic Data”
“The collection of statistics needs to be modernized; it is time to use the new technologies to start collecting data.
…particularly important in developing countries where the prevalence of mobile phones now offers an unprecedented opportunity to measure the economy.”
— Diane Coyle, “GDP”
OMGWTFGDP
“However, at this moment in survey research, uncertainty reigns. Participation rates in household surveys are declining throughout the developed world. Surveys seeking high response rates are experiencing crippling cost inflation. Traditional sampling frames that have been serviceable for decades are fraying at the edges.”
— Robert Groves, “Three Eras of Survey Research”
Orchestrating Collective Intelligence
PREMISE APP
Directed on-the-ground data acquisition
Crowdsourcing vs Orchestration
Crowdsourcing
survey
Crowdsourcing
survey survey tasks
Crowdsourcing
survey tasks workerssurvey
Orchestration
survey
Orchestration
survey survey tasks
Orchestration
survey tasks workerssurvey
Orchestration
survey tasks workerssurvey
Orchestration
survey tasks workerssurvey
Orchestration
survey tasks workerssurvey
Orchestration
survey tasks workerssurvey
survey campaign
allocation quality control
analytics
end-user
data contributor
User poses a question that is best answered by via actual, on-the-ground observation at scale.
Question is translated into an internal “specification” of the data points needed to answer the question: type, location, frequency, coverage, etc.Inventory of data points automatically allocated to data contributor pool, taking into account budget, agent profiles and geography. Data points are dynamically priced.
Contributors collect data in the field using Android phones…
… which are sent back to the Premise network.
QC is a mix of automated (outlier detection; machine learning; computer vision) and manual (directed sampling using oDesk) checks.
Automated capabilities to explore data and expose trends or patterns; hypothesize new features to explain variation; suggest specification refinement; improve automated verification.
end user
data contributor
PLATFORM
Resource Scarcity and
Access Risk
Average wait times are about ~10m longer in Maracaibo than in Caracas.
Police are present ~80% of the time in Maracaibo, but only 30-40% in Caracas.
Machine Learning
survey campaign
allocation quality control
analytics
end-user
data contributor
User poses a question that is best answered by via actual, on-the-ground observation at scale.
Question is translated into an internal “specification” of the data points needed to answer the question: type, location, frequency, coverage, etc.Inventory of data points automatically allocated to data contributor pool, taking into account budget, agent profiles and geography. Data points are dynamically priced.
Contributors collect data in the field using Android phones…
… which are sent back to the Premise network.
QC is a mix of automated (outlier detection; machine learning; computer vision) and manual (directed sampling using oDesk) checks.
Automated capabilities to explore data and expose trends or patterns; hypothesize new features to explain variation; suggest specification refinement; improve automated verification.
end user
data contributor
PLATFORM
survey campaign
allocation quality control
analytics
end-user
data contributor
User poses a question that is best answered by via actual, on-the-ground observation at scale.
Question is translated into an internal “specification” of the data points needed to answer the question: type, location, frequency, coverage, etc.Inventory of data points automatically allocated to data contributor pool, taking into account budget, agent profiles and geography. Data points are dynamically priced.
Contributors collect data in the field using Android phones…
… which are sent back to the Premise network.
QC is a mix of automated (outlier detection; machine learning; computer vision) and manual (directed sampling using oDesk) checks.
Automated capabilities to explore data and expose trends or patterns; hypothesize new features to explain variation; suggest specification refinement; improve automated verification.
end user
data contributor
allocation
PLATFORM
survey campaign
allocation quality control
analytics
end-user
data contributor
User poses a question that is best answered by via actual, on-the-ground observation at scale.
Question is translated into an internal “specification” of the data points needed to answer the question: type, location, frequency, coverage, etc.Inventory of data points automatically allocated to data contributor pool, taking into account budget, agent profiles and geography. Data points are dynamically priced.
Contributors collect data in the field using Android phones…
… which are sent back to the Premise network.
QC is a mix of automated (outlier detection; machine learning; computer vision) and manual (directed sampling using oDesk) checks.
Automated capabilities to explore data and expose trends or patterns; hypothesize new features to explain variation; suggest specification refinement; improve automated verification.
end user
data contributor
analytics
PLATFORM
survey campaign
allocation quality control
analytics
end-user
data contributor
User poses a question that is best answered by via actual, on-the-ground observation at scale.
Question is translated into an internal “specification” of the data points needed to answer the question: type, location, frequency, coverage, etc.Inventory of data points automatically allocated to data contributor pool, taking into account budget, agent profiles and geography. Data points are dynamically priced.
Contributors collect data in the field using Android phones…
… which are sent back to the Premise network.
QC is a mix of automated (outlier detection; machine learning; computer vision) and manual (directed sampling using oDesk) checks.
Automated capabilities to explore data and expose trends or patterns; hypothesize new features to explain variation; suggest specification refinement; improve automated verification.
end user
data contributor
quality control
PLATFORM
Optimizing Task Allocation
TASKS
locations
measurables
CAMPAIGN DEFINITION
locations
measurables
CAMPAIGN DEFINITION
locations
measurables
CAMPAIGN DEFINITION
locations
measurables
CAMPAIGN DEFINITION
locations
measurables
CAMPAIGN DEFINITION
locations
measurables
survey period 1
CAMPAIGN DEFINITION
locations
measurables
CAMPAIGN DEFINITION
survey period 1 survey period 2
locations
measurables
survey period 2
CAMPAIGN DEFINITION
survey period 1
locations
measurables
survey period 1
TASK ALLOCATION
user 1
user 2
user 3
allocation period:
1
locations
measurables
survey period 1
TASK ALLOCATION
user 1
user 2
user 3
allocation period:
1 2
locations
measurables
survey period 1
TASK ALLOCATION
user 1
user 2
user 3
allocation period:
1 2 3
TASK COMPLETION RATE MODEL
payout
pTCR
“uptake risk”
Model features: user-history, task-history / location-history, task-user, location-user
Issues: data sparsity in marginal vs conditional, uptake counterfactuals (non-iid sampling), path-dependence / lock-in
Linear functional model
}
Explorationvs
Survey Consistency
locations
measurables
period 1 period 2
locations
measurables
period 1 period 2
TASK REFINEMENT
ITERATIVE LOCATION DISCOVERY
Exploration vs Survey Consistency
- Campaign layers: separate discovery and survey
- Iteratively refine attribute and geospatial targeting
- Monitor correlation in item responses and appearance of new attributes
- Monitor residual endogeneity
Fraud and Coalition
Formation
Coalitions vs Referrals
- Referrals are necessary to reach most remote areas
- However we need to be able to partition the Premise graph into independent subnetworks, e.g. for re-evaluation, experimentation and sample stratification.
CONTRIBUTOR AFFINITY MODEL
Model features:
direct referralaccount featuresupload locationvisit historiesgeographic arearesponse correlation
Issues: bootstrapping affinity scores for new users, optimal scheduler is antagonistic for coalition discovery
Sampling from Large Graphs [Leskovec & Faloutsos; 2006]
weight
RECAP
- Orchestrating collective intelligence
- Optimizing task allocation via dynamic scheduling and incentives
- Exploration and discovery while maintaining survey consistency
- Fraud and coalition formation in networks
QUESTIONS?
instagram/premisedata
(all images in this talk)
joe@premise.com | @josephreisinger
PROOF PROOFAUTO QC PROOFMANUAL QC MANUAL QCREVALIDATION
“The problem of changing statistics is that you lose the ability to compare across time. The longer the time-series, the harder it is to change it, but you want to be able to compare. How do you replace GDP? And if you do, you lose the past sixty years of relevance. This has been a problem for centuries—take the Spanish silver trade. Anything you measure will become increasingly irrelevant over time.”
— Hans Rosling
[Zachary Karabell, The Leading Indicators]
“You need to focus on quality. You’ll be better off with a small but carefully structured sample rather than a large sloppy sample.”
— Hal Varian, Google
“Big Data is bullshit”
— Harper Reed
Big Data, n.: the belief that any sufficiently large pile of shit contains a pony with probability approaching one
—@grimmelm
“dividing by bieber”
top related