poster: hye-chung kum, phd, darshana pathak, gautam sanka
TRANSCRIPT
QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)
This PowerPoint 2007 template produces a 48”x72” professional
poster. You can use it to create your research poster and save valuable
time placing titles, subtitles, text, and graphics.
We provide a series of online tutorials that will guide you through the
poster design process and answer your poster production questions.
To view our template tutorials, go online to PosterPresentations.com
and click on HELP DESK.
When you are ready to print your poster, go online to
PosterPresentations.com.
Need Assistance? Call us at 1.866.649.3004
Object Placeholders
Using the placeholders
To add text, click inside a placeholder on the poster and type or paste
your text. To move a placeholder, click it once (to select it). Place
your cursor on its frame, and your cursor will change to this symbol
Click once and drag it to a new location where you can resize it.
Section Header placeholder
Click and drag this preformatted section header placeholder to the
poster area to add another section header. Use section headers to
separate topics or concepts within your presentation.
Text placeholder
Move this preformatted text placeholder to the poster to add a new
body of text.
Picture placeholder
Move this graphic placeholder onto your poster, size it first, and then
click it to add a picture to the poster.
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
QUICK TIPS (--THIS SECTION DOES NOT PRINT--)
This PowerPoint template requires basic PowerPoint (version 2007 or
newer) skills. Below is a list of commonly asked questions specific to
this template. If you are using an older version of PowerPoint some
template features may not work properly.
Template FAQs
Verifying the quality of your graphics
Go to the VIEW menu and click on ZOOM to set your preferred
magnification. This template is at 100% the size of the final poster. All
text and graphics will be printed at 100% their size. To see what your
poster will look like when printed, set the zoom to 100% and evaluate
the quality of all your graphics before you submit your poster for
printing.
Modifying the layout
This template has four different
column layouts. Right-click your
mouse on the background and click
on LAYOUT to see the layout options.
The columns in the provided layouts are fixed and cannot be moved
but advanced users can modify any layout by going to VIEW and then
SLIDE MASTER.
Importing text and graphics from external sources
TEXT: Paste or type your text into a pre-existing placeholder or drag
in a new placeholder from the left side of the template. Move it
anywhere as needed.
PHOTOS: Drag in a picture placeholder, size it first, click in it and
insert a photo from the menu.
TABLES: You can copy and paste a table from an external document
onto this poster template. To adjust the way the text fits within the
cells of a table that has been pasted, right-click on the table, click
FORMAT SHAPE then click on TEXT BOX and change the INTERNAL
MARGIN values to 0.25.
Modifying the color scheme
To change the color scheme of this template go to the DESIGN menu
and click on COLORS. You can choose from the provided color
combinations or create your own.
© 2013 PosterPresentations.com 2117 Fourth Street , Unit C Berkeley CA 94710 [email protected]
Student discounts are available on our Facebook page.
Go to PosterPresentations.com and click on the FB icon
THREE MODELS OF DATA ACCESS
2. MINIMUM INFORMATION SHARING
Information Suppression during clerical review
1. DECOUPLED DATA SYSTEM
4 STEPS IN DECOUPLING DATA
3. CHAFFING & UNIVERSE MANIPULATION
• A simple but powerful data system for Privacy
Preserving Interactive Record Linkage.
• Decouples (i.e. isolates) sensitive data (SD) from the
personally identifying information – PII.
• Provides both error management in the data
integration and the privacy protection by blocking
attribute disclosure and minimizing identity disclosure.
1. Split data set into two tables: One for the
identifying information – PII - and the other for
remaining – mostly sensitive – information.
T = TPII
+ TSD
2. Shuffling: Randomly shuffle rows in PII table, TPII
.
3. Chaffing: Add fake rows of PII to TPII
.
4. Encryption: Apply asymmetric encryption to lock
the row association between the TPII and T
SD.
Identity disclosure without sensitive attribute
disclosure has a little potential for harm
We evaluate three methods for
information disclosure:
1. Chaffing
2. Manipulation of universe
2.1 Fabrication
2.2 Non-disclosure
• Why is record linkage (RL) important?
There is a constant need for record linkage to create
a coherent „Big Data‟ system for the data originating
from heterogeneous uncoordinated systems.
• Why is record linkage challenging?
Redundant and fragmented datasets are split over
multiple systems. Missing and erroneous attribute
values with no unique, error-free identifiers require
approximate record linkage, which result in error
from false matches or uncertain matches3,4.
• What is Privacy Preserving Record linkage?
To identify the records in one or more datasets that
represent the same real world entity, without
compromising the privacy of subjects involved5,8.
• What is Interactive Record linkage?
Record linkage with people tuning and managing the
false matches from the approximate record linkage
algorithms. We define the properly tuned output from
a hybrid human-machine data integration system as
high quality record linkage7.
ABSTRACT
Ambiguous links must be manually reviewed during
approximate record linkage to enable accurate data
integration. This requirement would seem to make it
impossible for researchers to protect patients’ privacy
when integrating health informatics data. To address
this problem, we propose a novel decoupled data
system that blocks sensitive attribute disclosure via
encryption and chaffing. We also evaluate three
methods—Chaffing, Display control for clerical review
and Manipulation of universe around the data—that can
minimize identity disclosure.
INTRODUCTION
• First generation: Hash based exact match (2003)1.
• Second generation: Improve the quality of linkages by
allowing approximate match utilizing privacy preserving
approximate string comparison operations such as
bloomfilters (2009)6.
• Third generation [our model]: High quality RL using a
hybrid human-machine data integration system for
privacy preserving interactive record linkage (2012)5.
PRIVACY PRESERVING RECORD LINKAGE
PROBLEM STATEMENT: PRIVACY PRESERVING INTERACTIVE RECORD LINKAGE (PPIRL)
Decoupled Information System for
Privacy Preserving Interactive Record Linkage
A tractable computational model for privacy preserving
interactive record linkage (PPIRL) focusing on protection
against attribute disclosure.
Three techniques SDLink utilizes for privacy protection:
1. Strict decoupling via TPM – Trusted Platform Module
based encryption (pseudonym method)
2. Minimum information sharing during human
interaction via information suppression.
3. Chaffing – adding fake data to block attribute
inference from group membership
Approximate Record Linkage
Human in the loop to resolve ambiguous links
Threat of sensitive attribute disclosure
Let
IPPIRL
= the category of information I in the Minimal
Sharing model;
h = a person tuning the false matches manually;
α, ε = respective error terms;
such that,
• InteractiveRL(h, α) is the minimum amount of
information the person, h, needs to make decisions on
linkage with high confidence
• Disclosure(h, ε) is the level of information disclosed to
the honest-but-curious user, h, then,
Privacy Preserving Interactive Record Linkage (PPIRL) is
defined as the query operation PPIRL(DR, D
S, I
PPIRL, h) in
the minimal sharing model* where DR and D
S are the two
tables to be linked, h is a honest-but-curious human in the
loop making a final judgment on linkage, and IPPIRL
is the
minimal information to be shared with the human h.
METHOD: SECURE DECOUPLED LINKAGE (SDLink)
KEY INSIGHT
• The innovation in decoupling data is the focus on
revealing information rather than hiding it.
• The key is to understand the minimum information
required for quality linkage. Then to design protocols
to reveal, in a secure manner, only that information.
• The survey results confirmed that chaffing and either falsifying or
not defining the universe around the data were effective in
introducing uncertainty to the information disclosed.
• Under non-disclosure of universe, 56% of the participants were
uncertain about the identity given a common name.
• Even for rare names, if the list is chaffed and the universe is not
defined, 66% of the participants were uncertain on the identity.
*Minimal Sharing Model [Agrawal 2003]
Let there be two parties R (receiver) and S (sender) with databases
DR and D
S respectively. Given a database query Q spanning the
tables in DR and D
S, and some categories of information I, compute
the answer to Q and return it to R without revealing any additional
information to either party except for information contained in I.
InteractiveRL(h, α) <= IPPIRL
< Disclosure(h, ε) It is important to note that the current norms for data integration
in the US are full disclosure of all information to a fully trusted
human entity. For e.g., full disclosure of both attribute and identity
to certain trusted parties is HIPAA compliant.
CONCLUSION
Information suppression is essential during clerical
review to avoid sensitive attribute disclosure.
Furthermore, when chaffing is used in combination
with non-disclosure of the universe, even rare names
can be displayed with minimum risk of attribute
disclosure during clerical review. Our proposed
methods are effective in the presence of missing and
erroneous data.
REFERENCES 1. Agrawal R, Evfimievski A, and Srikant R, Information sharing across private
databases. In SIGMOD 2003, pp 86-97, New York, NY, USA, 2003. ACM.
2. Boyd A, Saxman P, Hunscher D, et al. The University of Michigan Honest Broker:
A Web-based Service for Clinical and Translational Research and Practice. J Am
Med Inform Assoc. 2009 Nov-Dec; 16(6): 784–791.
3. Elfeky M, Verykios V, Elmagarmid A, TAILOR: A Record Linkage Tool Box. In
ICDE 2002. IEEE Computer Society, Washington, DC, USA.
4. Elmagarmid K, Panagiotis GI, Verykios SV, Duplicate record detection: A survey.
IEEE Trans. Knowl. Data Eng. 2007;19(1):1-16.
5. Kum H.C., Ahalt S, Pathak D. Privacy Preserving Data Integration Using
Decoupled Data. Security and Privacy in Social Network, by Y. Elovici, Y.
Altshuler, A. Cremers, N. Aharony, A. Pentland (Eds), Springer 2012.
6. Schnell R, Bachteler T and Reiher J, Privacy-preserving record linkage using
Bloom filters. BMC Medical Informatics and Decision Making. 2009; 9(41).
7. Wang J, Kraska T, Franklin MJ, and Feng J, “CrowdER: Crowdsourcing Entity
Resolution”, Proceedings of Very Large Data Bases (PVLDB) 5(11), 2012
8. Vatsalan D, Christen P, Vassilios S, Verykios, A taxonomy of privacy-preserving
record linkage techniques, Information Systems, Available online 27 Nov 2012
CONTACT / ACKNOWLEDGMENTS Hye-Chung Kum, PhD ([email protected])
We thank Mike Reiter and Ashwin Machanavajjhala for their insightful comments,
Fabian Monrose for supporting the research, and Ian Sang-Jun Kim and Ren Bauer
for their assistance with the experiment. This research was supported in part by
funding from the NC Department of Health and Human Services, NIH CTSA
UL1TR000083, and NSF award no. CNS-0915364.
FUTURE WORK: POPULATION INFORMATICS
Today, nearly all of our activities from birth until death
leave digital traces in large databases. Together, these
digital traces collectively capture our social genome,
the footprints of our society. Like the human genome,
the social genome data has much buried in the massive
almost chaotic data. If properly analyzed and
interpreted, this social genome could offer crucial
insights into many of the most challenging problems
facing our society (i.e. affordable and accessible
quality healthcare). The burgeoning field of population
informatics is the systematic study of populations via
secondary analysis of massive data collections (termed
“big data”) about people. In particular, health
informatics analyzes electronic health records to
improve health outcomes for a population.