managing confidential information – trends and approaches
DESCRIPTION
Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most intervene through either access or anonymization at the time of data publication. Trends in information collection and management -- cloud storage, "big" data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective. This session presented as part of the the Program on Information Science seminar series, examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle.TRANSCRIPT
Prepared for
MIT Libraries Program on Information Research Brown Bag Talk
September 2013
Managing Confidential Information – Trends and Approaches
Dr. Micah Altman<[email protected]>
Director of Research, MIT Libraries
Information Privacy Across the Research Lifecycle
Standard DisclaimerThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.
Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Information Privacy Across the Research Lifecycle
Collaborators & Co-Conspirators
• Privacy Tools for Sharing Research Data Team (Salil Vadhan, P.I.)http://privacytools.seas.harvard.edu/people
• Research SupportSupported in part by NSF grant CNS-
1237235
Information Privacy Across the Research Lifecycle
Related Work. Main Project: • Privacy Tools for Sharing Research Data
http://privacytools.seas.harvard.edu/
Related publications:• Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D.,
Laevart, C., et al. (2011). Communicating Science and Engineering Data in the Information Age. Computer Science and Telecommunications. National Academies Press
• Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf
• Altman, M. (2012). “Mitigating Threats To Data Quality Throughout the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.), Curating For Quality. datacuration.web.unc.edu
These slides & most reprints available from:informatics.mit.edu
Information Privacy Across the Research Lifecycle
Level Setting
Information Privacy Across the Research Lifecycle
Identifying Information Is Common• Includes information from a variety of sources,
such as…– Research data, even if you aren’t the original
collector– Student “records” such as e-mail, grades– Logs from web-servers, other systems
• Lots of things are potentially identifying:– Under some federal laws: IP addresses, dates,
zipcodes, …– Birth date + zipcode + gender uniquely identify ~87%
of people in the U.S. [Sweeney 2002]Try it: http://aboutmyinfo.org/index.html
– With date and place of birth, can guess first five digits of social security number (SSN) > 60% of the time. (Can guess the whole thing in under 10 tries, for a significant minority of people.) [Aquisti & Gross 2009]
– Analysis of writing style or eclectic tastes has been used to identify individuals
• Tables, graphs and maps can also reveal identifiable information
Brownstein, et al., 2006 , NEJM 355(16),
Information Privacy Across the Research Lifecycle
Some Sources of Confidentiality Restrictions for University Held Research and Education Information
• Overlapping laws• Different laws
apply to different cases
• Additional data usage agreements and license terms apply
Information Privacy Across the Research Lifecycle
Different Requirements and Definitions
FERPA HIPAA Common Rule MA 201 CMR 17
Coverage Students in Educational Institutions
Medical Information in “Covered Entities”
Living persons in research by funded institutions
Mass. Residents
Identification Criteria
-Direct-Indirect-Linked-Bad intent (!)
-Direct-Indirect-Linked
-Direct-Indirect-Linked
-Direct
Sensitivity Criteria
Any non-directory information
Any medical information
Private information – based on harm
Financial, State, Federal Identifiers
Management Requirements
- Directory opt-out- [Implied] good practice
- Consent- Specific technical safeguards- Breach notification
- Consent- [Implied] risk minimization
- Specifictechnical safeguards- Breach notification
Information Privacy Across the Research Lifecycle
* 2010
*
Information Privacy Across the Research Lifecycle
Recognized Benefits of Data Sharing
• Pioneering NRC report [Fienberg, et. al 1985] on data sharing recommended:– Sharing data should be a regular practice.– Investigators should share their data by the time of
publication of initial major results of analyses of the data except in compelling circumstances.
– Data relevant to public policy should be shared as quickly and widely as possible.
– Plans for data sharing should be an integral part of a research plan whenever data sharing is feasible.
• Numerous subsequent reports recommend data sharing.
Information Privacy Across the Research Lifecycle
Private Information & Information Services
• Recommendations
• Annotations & Tagging
• Class discussion forum
• Social Highlighting
Information Privacy Across the Research Lifecycle
Access Control ModelAccess Control
ClientResource
Auth
entic
atio
n
Credentials
Auth
oriza
tion
Request/Response
Audi
ting
Log
External AuditorResource Control Model
Information Privacy Across the Research Lifecycle
Disclosure Limitation Data InputOutput Model
Published Outputs
* Jones * * 1961 021*
* Jones * * 1961 021*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
“The correlation between X and Y was large and
statistically significant”
Summary statistics
Contingency table
Public use sample microdata
Information Visualization
DATA
DATA
Information Privacy Across the Research Lifecycle
Example
Exemplar: Social Media Analysis
Information Privacy Across the Research Lifecycle
Attribute Type Examples
Data: Structure - network
Data: Attribute Types - Continuous/Discrete/- Scale: ratio/interval/ordinal/nominal
Data: Performance Characteristics
- 10M-1B observations- Sample from stream of continuously
updated corpus- Dozens of dimensions/measures
Measurement: Unit of Observation
- Individuals; Interactions
Measurement: Measurement type
- Observational
Measurement: Performance characteristic
- High volume- Complex network structure- Sparsity- Systematic and sparse metadata
Management Constraints - License; Replication
Analysis methods - Bespoke algorithms (clustering); nonlinear optimization; Bayesian methods
Desired Outputs - Summary scalars (model coefficients)- Summary table- Static /interactive visualization
More Information• Grimmer, Justin, and Gary King. "General purpose computer-
assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in China allows government criticism but silences collective expression." APSA 2012 Annual Meeting Paper. 2012.
• Lazer, David, et al. "Life in the network: the coming age of computational social science." Science (New York, NY) 323.5915 (2009): 721.
Information Privacy Across the Research Lifecycle
What’s wrong with this picture?
Name
SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341 01011961 02145 M Raspberry 0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolate 0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. Smith-Jones
12354 04041974 64800 F Allergic 256
Managing Confidential Data 17
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
A. Jones 12341 01011961 02145 M Raspberry 0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolate 0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. Smith 12354 04041974 64800 F Allergic 256
What’s wrong with this picture?
v. 23 (7/18/2013)
HIPPA & MAIdentifier
Identifier&
Sensitibe
HIPAAdentifier
HIPAAIdentifier
Sensitive
Unexpected Response?
Mass resident
FERPA too?
Californian
Twins, separated at birth?
IndirectI Identifier
Help, help, I’m being suppressed…
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
[Name 1] 12341 *1961 021* M Raspberry .1
[Name 2] 12342 *1961 021* M Pistachio -.1
[Name 3] 12343 *1972 940* M Chocolate 0
[Name 4] 12344 *1972 940* M Hazelnut 0
[Name 5] 12345 *1972 940* F Lemon .6
[Name 6] 12346 *1972 021* F Lemon .6[Name 7] 12347 *1989 021* * Peach 64.6
[Name 8] 12348 *1973 632* F Lime 3
[Name 9] 12349 *1973 633* M Mango 3
[Name 10] 12350 *1973 634* M Coconut 37.2
[Name 11] 12351 *1974 645* M * 37.2
[Name 12] 12352 *1974 646* M Vanilla 37.2
[Name 13] 12353 *1974 647* F * 64.4
[Name 14] 12354 *1974 648* F Allergic 256Row
VarSynthetic Global Recode Local Suppression Aggregation+
Perturbation
Information Privacy Across the Research Lifecycle
Information Privacy Across the Research Lifecycle
k-anonymous – but not protected
Name SSN Birthdate Zipcode Gender FavoriteIce Cream
# of crimescommitted
* Jones * * 1961 021* M Raspberry 0
* Jones * * 1961 021* M Pistachio 0
* Jones * * 1972 9404* * Chocolate 0
* Jones * * 1972 9404* * Hazelnut 0
* Jones * * 1972 9404* * Lemon 0
* Jones * * 021* F Lemon 1* Jones * * 021* F Peach 1
* Smith * * 1973 63* * Lime 2
* Smith * * 1973 63* * Mango 4
* Smith * * 1973 63* * Coconut 16
* Smith * * 1974 64* M Frog 32
* Smith * * 1974 64* M Vanilla 64
* Smith * 04041974 64* F Pumpkin 128
* Smith * 04041974 64* F Allergic 256
Law, policy, ethics
Research design …
Information security
Disclosure limitation
Additional background
Homogeneity
Sort Order/Structure
Information Privacy Across the Research Lifecycle
Climate
Information Privacy Across the Research Lifecycle
Commercial Data Breaches
• Data from 100 million individuals exposed this year…
• Only a portion of breaches are reported
• Difficult to trace impacts… but estimated 8.3M identity thefts in 2005
Source: http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
Information Privacy Across the Research Lifecycle
Cloud computing risks• Cloud computing decouples
physical and computing infrastructure
• Increasingly used for core-IT, research computing, data collection, storage, and analysis
• Confidentiality issues– Auditing and compliance– Access and commingling of data– Location of data and services
and legal jurisdiction– Vulnerabilities of network
communication using single well-known key
– Vulnerability of key storage
Information Privacy Across the Research Lifecycle
Legal & Cultural Challenges
• EU right to be forgotten; French “le droit à l'oubli”;California social media privacy act
• Consumer privacy bill of rights;Do not track; Privacy Icons
• Evolving case law on locational privacy• Public records, mug shots, and revenge porn• State-level action on privacy regulation• Attitudes towards sharing; surveillance
Information Privacy Across the Research Lifecycle
New Data – New Challenges
• How to limit disclosure without completely destroying utility? – The “Netflix Problem”: large, sparse datasets that
overlap can be probabilistically linked [Narayan and Shmatikov 2008]
– The “GIS”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008]
– The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007]
– The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Tomkins et. al 2004]
[For more examples see Vadhan, et al 2010]
Source: [Calberese 2008; Real Time Rome Project 2007]
Information Privacy Across the Research Lifecycle
Weather
Information Privacy Across the Research Lifecycle
Possible Legal/Regulatory Changes for 2013-15
• Likely– New information privacy laws in selected states– Increased open data requirements
from federal funders– Adoption of data availability
requirements by increasing numbers of journals
Law, policy, ethics
Research design …
Information security
Disclosure limitation
Information Privacy Across the Research Lifecycle
Information Privacy Across the Research Lifecycle
Research
Information Privacy Across the Research Lifecycle
Traditional approaches are failing• Modal traditional approach:
– removing subjects’ names– storing descriptive information in a locked filing cabinet– publishing summary tables– (sometimes) release a public use version that suppressed and
recoded descriptive information• Problems
– law is changing – requirements are becoming more complex– research computing is moving towards the cloud, other
distributed storage– researchers are using new forms of data that create new privacy
issues– advances in the formal analysis of disclosure risk imply the
impracticality of “de-identification” as required by law
Information Privacy Across the Research Lifecycle
A National Science Foundation Secure and Trustworthy Cyberspace ProjectSupported by award #1237235
Privacy Tools for Sharing Research Data
The Dataverse Network will Distribute and Manage Confidential Databases
Policy tools Guide Information Management Across the Research Lifecycle
Differentially Private Algorithms Shield Individuals in Databases
Information Privacy Across the Research Lifecycle
Approaches• Policy
– Legal Reforms– Information Accountability– Economic rights– Information transparency
• Aboutmydata.com– Privacy Nudges – Privacy Icons
• Cryptography– Multiparty computation– Zero knowledge protocols– Functional encryption– Homomorphic encryption
• Statistics– Synthetic data– Reidentification risk– K-anonymity; homogeneity– Differential privacy
• Information Lifecycle & Infrastructure– Open consent– Metadata frameworks– Information accountability– Policy aware filesystems
• IRODs– Data Vaults
• Project VRM– Secure data enclave– Standardized Data Use Agreements
Information Privacy Across the Research Lifecycle
Recent Work –Economics & Public Policy Research/Outreach
• March 2013 – Dwork & Vadhan lead roundtable in Differential Privacy and Law and Policy (conference), Cardozo Law School
• March 2013 – Altman provided oral comments (recorded) on Public Workshop on Revisions to the Common Rule, National Academies, on limits of HIPAA approach to privacy.
• May 2013 – Altman & Crosas submitted written testimony to Public Access to Federally-Supported Research and Development Data, National Academies; including approaches to management of privacy for data sharing.
• June 2013 – Dwork, Sweeney, & Vadhan invited & participated in Privacy Law Scholars Conference, George Washington Law School/Berkeley Law School
• June 2013 -- Yiling Chen, Stephen Chong, Ian Kash, Tal Moran, and Salil Vadhan. “Truthful Mechanisms for Agents that Value Privacy”, Proceedings of the 14th ACM Conference on Electronic Commerce (EC), June 2013.
• September - Integrating Approaches to Privacy across the Research Lifecycle Workshop
• In Progress – Rewrite and expansion of, Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”, proposing framework for integrating modern privacy concepts in to Human Subjects protections.
Information Privacy Across the Research Lifecycle
Information Life Cycle Model
Creation/Collection
Storage/Ingest
Processing
Internal Sharing
Analysis
External dissemination/public
ation
Re-use• Scientometric• Education• Scientific• Policy
Long-term access
Research methods
Data ManagementSystems
Legal / Policy Frameworks∂
∂
Statistical / Computational
Frameworks
Information Privacy Across the Research Lifecycle
Example: Stakeholder Concerns Across Lifecycle
Research sources:- Research Subjects.- Owners of subject material- Owners of supplementary data
Research sponsors:- Home institution- Funding sources
Project Personnel:- Investigators- Research Staff
Research Publishers- Print publishers- Research archives
Research Consumers- Readers- Secondary researcher
LicensingCopyrightDMCAInformed ConsentPrivacyTrade secrets
LicensingFreedom of InformationCopyright
Copyright
CopyrightLicensing
Fair Use
InformationTransfer
PrivacyConfidentialityIntellectual Property
Replicable ResearchPolicy RelevanceAccessibility of ResearchProtect IPAvoid third party IP/Privacy Issues
Replicable ResearchPublishPromote use of PublicationsTrack use
Replicable researchPromote use of their publicationsProtect publisher IPAvoid third party IP/Privacy Issues
Replicate and extendSecondary analysisLink research
Stakeholder Concerns Legal Issues
Information Privacy Across the Research Lifecycle
Modeling Features
Features Characteristics
Data - Structure; Source; Unit of observation; Attribute types; Dimensionality; Number of observations; homogeneity; frequency of updates; quality characteristics
Analytic Results - Form of output; analysis methodology; analysis/inferential goal; utility/loss/quality
Disclosure scenario - - Source of threat; areas of vulnerability; attacker objectives, background knowledge, capability; Breach criteria/disclosure concept
Stakeholders - Stakeholder types; capacities; trust relationships; budgets
Lifecycle characteristics - Lifecycle stages controlled/in scope; policies used; stakeholders involved at each stage
Current privacy management approach - Regulation/policy; legal controls; statistical/computational disclosure methods; information security controls
Legal/Policy FrameworksContract Intellectual Property
Access Rights Confidentiality
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual Attribution
Trade Secret
Patent
Trademark
Common Rule45 CFR 26
HIPAA
FERPA EU Privacy DirectivePrivacy Torts
(Invasion, Defamation)
Rights of Publicity
Sensitive but Unclassified
Potentially Harmful
(Archeological Sites,
Endangered Species, Animal
Testing, …)
Classified
FOIA
CIPSEA
State Privacy Laws
EAR
State FOI Laws
Journal Replication
Requirements
Funder Open Access
Contract
License
Click-WrapTOU
ITAR
Export Restrictions
Information Privacy Across the Research Lifecycle
Risk Assessment
• [NIST 800-100, simplification of NIST 800-30]
Law, policy, ethics
Research design …
Information security
Disclosure limitation
System Analysis
Threat Modeling
Vulnerability Identification
Analysis- likelihood- impact- mitigating controls
InstituteSelected Controls Testing and
Auditing
Information Security Control Selection Process
Information Privacy Across the Research Lifecycle
• Infrastructure requirements analysis– Data acquisition, storage, dissemination– Identification, authorization, authentication– Metadata, protocols
• System design: potential implementation cost of interactive privacy:– Information security -- hardening– Information security – certification & auditing– Model server development, provisioning, maintenance, reliability, availability
• System design: information security tradeoffs of Interactive privacy mechanisms:– Availability risks: denial of service attack– Availability/integrity risks: privacy budget exhaustion attacks– Integrity risks: modification of delivered results (e.g. man-in-the-middle attacks)– Secrecy/privacy: breach of authentication/authorization layer
• System design: optimizing privacy & utility across lifecycle– When does limiting disclosive data collection dominate methods at the data analysis stage– When does restricted virtual data enclaves + public synthetic data dominate interactive mechanisms
• System design: Information use/reuse– Support of scientific analysis use cases (model diagnostics, exploratory data analysis, integration of external
data) within interactive privacy systems.– Align informational assumptions across stages & incorporating informative priors? – Requirements for scientific replication/verification of results produced by model servers?
Systems Policy Research questions deriving from Information Lifecycle Analysis
Information Privacy Across the Research Lifecycle
Legal Policy Research questions deriving from Information Lifecycle Analysis
• Legal requirements across lifecycle stages• Legal instruments
-- capturing scientific privacy concepts in legal instruments consistently across lifecycle– service level agreements– consent terms– deposit agreement– data usage agreements– Regulatory language
Information Privacy Across the Research Lifecycle
• Where does market fail for sharing confidential research data?– What market conditions are theoretically violated?– What is the empirical evidence of the degree of violation? – How do degree of violation vary by policy context & use case?
• Policy equlibria– What are contribution and privacy equilibria for data sharing
under different privacy concepts? • Interventions
– How do proposed interventions (e.g. advise & consent; “privacy icons”, uniform regulations, breach notification, information accountability, anonymization ) correspond to sources of market failures?
Public Policy Research Questions
Information Privacy Across the Research Lifecycle
Beyond Legal Research -- Market Theory• Condition on Markets
– No political/legal distortions[See, e.g., Posner 1978]
– Common knowledge– No barriers to entry
• Conditions on agents[See e.g. Acquisti 2010; Tsai, Egelman, Cranor & Aquisti 2010]
– Perfect rationality– Self-interested– Infinitely many agents– Stable preferences
• Conditions on goods– Consumptive goods– Excludable goods– Decreasing returns to
scale– Transferability
– No externalities• Conditions on exchange
[See e.g., Benisch, Kelley, Sadeh, & Cranor 2011; McDonald & Cranor 2010]
– No transaction costs– No information
asymmetries• Conditions on
equilibrium valuation– Pareto optimality vs.
economic surplus– Ignorability of
distributional concern
Private Goods• Excludable• Consumable• No
externalities
Commons• Non
excludable• Consumable• Negative
externalities
Public Good• Non-
excludable• Non
consumable• Positive
externalities
Toll Good• Partially non-
excludable• Non-
consumable• Positive
externalities
Bibliography (Selected)
• L. Willenborg and T. D. Waal. Elements of Statistical Disclosure Control, volume 155 of Lecture Notes in Statistics. Springer Verlag, New York, NY, 2001.
• Higgins, Sarah. "The DCC curation lifecycle model." International Journal of Digital Curation 3.1 (2008): 134-140.www.dcc.ac.uk/resources/curation-lifecycle-model
• ESSNET, Handbook on Statistical Disclosure Control. 2011.neon.vb.cbs.nl/casc/SDC_Handbook.pdf
• Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey of recent developments." ACM Computing Surveys (CSUR) 42.4 (2010): 14.
• Altman, M. (2012). “Mitigating Threats To Data Quality Throughout the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.), Curating For Quality. datacuration.web.unc.edu
Information Privacy Across the Research Lifecycle
Information Privacy Across the Research Lifecycle
Questions?
E-mail: [email protected]:informatics.mit.edu