innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · innovating...
TRANSCRIPT
Innovating compliantly and transparently ‐
road blocks, myths and solutions
Jana Diesner, PhDAssistant Professor
School of Information Sciences/ The iSchool at IllinoisUniversity of Illinois at Urbana‐Champaign
Enablers And Benefits of Openness
Regulations and Norms
TransparencyOpenness
Reproducibility
Trust
Value Added
Incentive mechanisms,
business models
Infra‐structures
Open what? It’s complicated
• Privacy statements– Users pay little attention, hard to understand (McDonald &
Cranor 2008, Acquisti & Grossklags 2005)• Regulations for human‐centered and online data
– Researchers pay little attention, hard to understand– IRB (1979): “To protect the rights and welfare of humans
participating as subjects in the research" • Respect for people, beneficence, minimize risk• For intervention or interaction with living individuals and/or identifiable private information
– Golden times? Listen, don’t ask (passive measurement, Zevenbergen et al. 2015) and measure/ don’t estimate
Diesner J (2015) Small Decisions with Big Impact on Data Analytics. Big Data & Society, special issue: Assumptions of Sociality.
Working with Human‐Centered and Online Data: Some Practical Questions
• Awareness: – If an IRB does not apply to our project, is there an ethics or privacy review board, protocol or process?
– What governs data use in commercial settings?• Knowledge:
– What's the relationship between copyright, terms of service and privacy? What trumps what?
– Does “personal use” include “research use”? – We got different answers from the IRB, legal, and the library. How to make a decision? (Digital literacy)
• Skills: – How do we practically implement terms of use?– How can we anonymize social network data?How can be guarantee non‐consumptive use?
• So what makes this all complicated?
What Types of Regulations are out there?
1. Institutional and organizational norms and regulations– Health Insurance Portability and Accountability Act (HIPAA), Fair
Information Practice Principles (FIPPs), Menlo Report (Ethical Principles Guiding Information and Communication Technology Research) (2012)
2. Privacy regulations and law3. Security regulations and law4. Intellectual property law, copyright
– Snippets of (appropriated) content5. Terms of use/ service6. Technical constraints (robots.txt, APIs)7. Personal values
– People apply them consciously or unconsciously– Depend on gender (Gilligan 1987), culture (Graham at el. 2011)– 16+: Conventional morality (comply with (group) norms) versus
10‐15% post‐conv. morality (own principles) (Kohlberg 1984)
Different people/ fields driven by different practical/ real‐world approaches
Driven by pragmatics
• Utilitarian ethics • Technical feasibility • E.g., some of Web Science
Driven by rule compliance
• Vs. learning from examples and common practice• 73% (N = 263 from academia, industry, gov) permissible to “scrape data from online forums”, 21% with neutral opinion = 94% (Vitak, Shilton & Ashktorab 2016)
• Set quasi standards • Lack of standards
Driven by ethics/ personal values
• Shweder 1997: • Autonomy (protect individual rights and justice)
• Communityoriented (preserve institutions and social order, effect: sense of duty, respect, loyalty)
• Divinity (protect people from degradation, e.g. due to selfishness)
Open,Free!?
• Open Science, Open Intelligence, Open you name it…– Gratis versus libre (Floss, Stallman, GNU)– User‐generated data from 3rd party platforms often “free to
see” (alternat. copyright models, Lessig, Creative Commons)• Browsewrap agreements not enforceable:
– “Terms of Use” hyperlinks “not sufficiently conspicuous” (obvious) for “reasonably prudent internet consumer” (plaintiff did not manifest unambiguous assent to be bound by Terms of Use“) (Long v. Provide Commerce, Inc., 2016 WL 1056555, Cal Ct. App., 03/17/2016)
• Working with online data is kind of like archival research (Kosinski et al. 2015)– No consent needed if 1) users consciously made their data
public, 2) collected data anonymized, 3) researchers do not interact with participants, 4) no identifiable user information published
Solutions
1. Education
2. Compliance
3. Technology (room for improvement)
4. Policy, Legislation
5. For pay models, subscriptions
Accuracy and Transparency at Scale
“In viel weiterem Umfange, als man sich klar zu machen pflegt, ruhtunsre moderne Existenz von der Wirtschaft, die immer mehrKreditwirtschaft wird, bis zum Wissenschaftsbetrieb, in dem dieMehrheit der Forscher unzählige, ihnen gar nicht nachprüfbareResultate anderer verwenden muß, auf dem Glauben an dieEhrlichkeit des andern.““To a much wider degree than we often think, our modernexistence from business and trade, which is turning more and moreinto a credit economy, to the pursuit of science, where the majorityof researchers has to work with results that were produced byothers and that cannot be verified by the researcher, relies uponthe believe in the honesty of other people.“
Simmel, G. (1908). Das Geheimnis und die geheime Gesellschaft Soziologie. Untersuchungen über die Formen der Vergesellschaftung (pp. 256‐304). Berlin Duncker & Humblot.
Entity Resolution in Graphs
10Mark Newman, UMich Mark Newman, UMich
• Splitting– Same surface form, different social entities
– 46,157 John Smith,562 Mark Newman
• Merging/ consolidation– Collect all references to same unique entity
– Aka co‐reference resolution, record linkage
– Craig Evans, Craig S. Evans, C.S. Evans, …
Diesner J, Evans C, Kim J (2015) Impact of entity disambiguation errors on social network properties. International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK
Why Bother?
• Impact and propagation of errors (magnitude, upper and lower bound) on (robustness of) (network) data, properties, findings, conclusions largely unknown
• Worth the efforts and costs?• Highly accurate algorithmic solutions exist (90ies % range)
• Payoff from incremental improvements?
11
• Big deal in bibliometrics: heuristics, rules– First initial based disambiguation: M. Newman = M. Newman
– All initial based disambiguation: M. E. Newman != M. W. Newman
– Justification: upper and lower bound of true number of nodes (Newman, 2001): Is that true?
Disambiguation:What do we know already?
12
DataEnron MEDLINE
Time Range 10/1999-07/2002 01/2005-12/2009Number of documents 520,458 101,162
Domain Email Co-publishing
Context Corporate, internal communication
Scientific, external/ public communication
Mainly subject to Merging Splitting
13
Data Preparation: Enron: Consolidation
• Semi‐automated and manually vetted mapping of email addresses to people, incl. full names, job histories, locations (Diesner et al. 2005)
• “Service learning assignment” in graduate courses 14
# email addresses/
person
# people with that #
of addresses
Person (* indicted)
26 1 Kenneth Lay, Chairman*
11 3Jeffrey Skilling, CEO*David Delainey, Energy Trader*Vince Kaminski, MD Research
10 3Susan ScottSteven Kean, EVP, Chief of StaffMark Haedicke, General Counsel
9 4Mark Taylor, Asst Gen CounselGrant Masson, VP ResearchPatrice MimsJeff Dasovich, Exec - Gov Affairs
8 5 1,523 > 1 email address,average 2.4, median 2
7 136 175 364 633 1602 1,2181 21,753
Data Preparation: Enron: Networks
• Raw (worst):– Simple directed graph– Baseline for no effort
• Disambiguated (better):– Actual social entities – Only @enron.com
email addresses• Scrubbed (best for
now): – More consolidation
and verification – No mailing lists
15
Number of Raw Disambig.(Diff to Raw)
Scrubbed(Diff to Raw)
(Diff to Disambig.)
Senders19,466 6,205
(-68%)5,441(-72%)(-12%)
Receivers72,713 19,700
(-73%)15,297(-79%)(-22%)
Addresses81,811 20,332
(-75%)15,526(-81%)(-24%)
Edges332,683 212,768
(-36%)188,045(-43%)(-12%)
Data Preparation: MEDLINE: Disambiguation
• From National Library of Medicine (1950 onwards)
• 2012: 20 mio publications• Medical subject heading
(MeSH): brain, 2005‐2009, ~110k articles from 3,700 journals
• Disambiguation: Authority database (Torvik & Smalheiser 2009, 98‐99% accurate), 101K pubs.
• 3 networks– Algorithmic (best)– All initial (worse)– First initial (worst)
16
Algo-rithmic
All-initials(Diff to alg.)
First-initials(Diff to alg)(Diff to all-
initials)Name
Instances
557,662 557,662 557,662
Unique Entities 258,971 207,256
(-20.0%)
182,421(-29.6%)(-11.9%)
Edges 1,335,366 1,317,894(-1.3%)
1,303,957(-1.6%)(-1.1%)
Email Networks Co-publishing networksNetwork
Properties Raw(worst)
Manual Disamb. (better)
Scrubbed (best)
Algorithmic(best)
All-initials(worse)
First-initial(worst)
Consolidation of nodesElimination of errors
Splitting up of nodes Introduction of errors
No. of Vertices 81,811 20,332(-75.15%)
15,526(-81.02%) 258,971 207,256
(-19.97%)182,421
(-29.56%)
No. of Edges 332,683 212,768(-36.04%)
188,045(-43.48%) 1,335,366 1,317,894
(-1.31%)1,303,957(-2.35%)
Density 4.97E-05 5.14E-04(+9.34%)
7.80E-04(+14.69%) 3.98E-05 6.14E-05
(+54.27%)7.84E-05
(+96.98%)Clustering Coefficient 0.07637 0.09421
(+18.94%)0.10698
(+28.61%) 0.39 0.20(-48.72%)
0.19(-51.28%)
Diameter 18 (Directed)15 (Undirected)
10 (Directed)10 (Undirected)
10 (Directed)7 (Undirected) 22 19
(-13.64%)18
(-18.18%)Avg. Shortest Path Length 4.33 3.56
(-17.78%)3.56
(-17.78%) 6.70 5.21(-22.24%)
4.78(-28.66%)
No. of Components 978 10
(-98.98%)5
(-99.49%) 10,182 5,028(-50.62%)
3,100(-69.55%)
Ratio of Largest Component 96.82% 99.91%
(+3.09%p)99.95%
(+3.13%p) 80.91% 90.47%(+9.56%p)
93.63%(+12.72%p)
Degree Centralization N/A N/A N/A 1.83E-03 6.98E-03
(+281.42%)8.40E-03
(+359.02%)In Degree
Centralization 0.01635 0.03052(+86.67%)
0.03561(+117.80%) N/A N/A N/A
Out Degree Centralization 0.01909 0.07858
(+311.63%)0.07858
(+311.63%) N/A N/A N/A
Eigenvector Centralization 0.99588 0.98552
(-1.04%)0.98213(-1.38%) 0.212 0.195
(-8.02%)0.187
(-11.79%)Betweenness
Centralization 0.01041 0.02014(+93.47%)
0.02728(+164.65%) 9.85E-03 2.26E-02
(+129.44%)2.09E-02
(+112.18%)Cl 0 228 0 238
17
Results: Most powerful/ influential individuals
18
Degree Centrality Rank Enron MEDLINE
Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Beck, Sally Beck, Sally Krause, W Wang, Y Wang, J 2 [email protected] OUTLOOK TEAM Lay, Kenneth Fulop, L Wang, J Wang, Y 3 [email protected] Forster, David Forster, David Nawa, H Wang, X Lee, J 4 [email protected] Lay, Kenneth Jones, Tana Su, Y Chen, Y Kim, J 5 [email protected] TECHNOLOGY Kaminski, Vince Medarova, Z Li, X Wang, X
Closeness Centrality Rank Enron MEDLINE
Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Lay, Kenneth Beck, Sally Trojanowski, JQ Wang, J Wang, J 2 [email protected] Beck, Sally Lay, Kenneth Kretzschmar, HA Wang, Y Wang, Y 3 [email protected] OUTLOOK TEAM Kitchen, Louise Toga, AW Wang, X Wang, X 4 [email protected] Kitchen, Louise Kean, Steven Thompson, PM Li, X Lee, J 5 [email protected] Lavorato, John Lavorato, John Barkhof, F Zhang, J Zhang, J
Betweenness Centrality Rank Enron MEDLINE
Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Beck, Sally Beck, Sally Toga, AW Wang, J Wang, J 2 [email protected] Kaminski, Vince Lay, Kenneth Kretzschmar, HA Wang, Y Lee, J 3 [email protected] Lay, Kenneth Kaminski, Vince Thompson, PM Wang, X Wang, Y 4 [email protected] Skilling, Jeffrey Jones, Tana Trojanowski, JQ Li, J Wang, X 5 [email protected] OUTLOOK TEAM Hayslett, Rod Barkhof, F Lee, J Zhang, J
Eigenvector Centrality Rank Enron MEDLINE
Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Kitchen, Louise Kitchen, Louise Futreal, PA Wang, Y Wang, Y 2 [email protected] Beck, Sally Beck, Sally Stratton, MR Liu, Y Wang, J 3 [email protected] Haedicke, Mark Haedicke, Mark Edkins, S Wang, J Liu, Y 4 [email protected] Lavorato, John Lavorato, John Omeara, S Wang, X Wang, X 5 [email protected] Forster, David Forster, David Stevens, C Li, X Zhang, J
Results: Differences in Topologies
19Enron (left): log‐log plot of node degree (in, out)MEDLINE (right): log‐log plot of node degree
Email networks: • Duplicates ‐> network seems bigger, less
coherent, less integrated • Overestimates need for interaction
Co‐publishing networks:• Missing to split nodes ‐> scientific sector
seems more dense, integrated, cohesive, and authors more productive, collaborative, diverse
• Underestimates need for (inter‐disciplinary) collaboration and support
Conclusions
20
• Majority of metrics heavily biased, topologies misidentified, key players more robust
• Big Data does not fix this issue• Data preparation and analysis loaded with decisions
– Inherent in data collection, tools, algorithms, …– Decisions sometimes not considered or not made explicit– Poor awareness for and understanding of their impact
• Data quality key ingredient for reliable results • Silver lining/ possible positive side: Closely interacting
with data and forcing ourselves to understand them can help us to move from being able to precisely model and formally describe effects in society to also understand and explain them.
Acknowledgement
• Regulations: Supported the Ford Foundation and the National Center for Supercomputing Applications (NCSA).
• Disambiguation: Supported by KISTI (Korea Institute of Science and Technology Information). The disambiguated MEDLINE dataset was provided by Vetle Torvik and Brent Fegley from iSchool/ UIUC.
• Chie‐Li (Julian) Chin and Jinseok Kim from my lab
References Citations• Acquisti, A., & Grossklags, J. (2005). Privacy and rationality in individual decision making. IEEE Security &
Privacy, 3(1), 26‐33.• Dittrich, D. and Kenneally, E. (2012). The Menlo Report: Ethical Principles Guiding Information and
Communication Technology Research, Tech. rep., U.S. Department of Homeland Security.• Gilligan, C. (1987). Moral orientation and moral development.• Graham, J., Nosek, B. A., Haidt, J., Iyer, R., Koleva, S., & Ditto, P. H. (2011). Mapping the moral domain.
Journal of personality and social psychology, 101(2), 366. • Kohlberg, L. (1984). The psychology of moral development: The nature and validity of moral stages (Vol. 2):
Harpercollins College Div. • Kosinski, M., Matz, S. C., Gosling, S. D., Popov, V., & Stillwell, D. (2015). Facebook as a research tool for the
social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. American Psychologist, 70(6), 543‐556.
• McDonald, A. M., & Cranor, L. F. (2008). The cost of reading privacy policies. ISJLP, 4, 543.• Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National
Academy of Sciences of the United States of America, 98(2), 404‐409.• Shweder, R. A., Much, N. C., Mahapatra, M., & Park, L. (1997). The" Big Three" of Morality (Autonomy,
Community, Divinity) and the" Big Three" Explanations of Suffering. In A. M. Brandt & P. Rozin (Eds.), Morality and Health, 119‐172.
• Torvik, V. I., & Smalheiser, N. R. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1‐29.
• Vitak, J., Shilton, K., & Ashktorab, Z. (2016). Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community. Paper presented at the 9th ACM Conference on Computer‐Supported Cooperative Work and Social Computing (CSCW 2016) San Francisco, CA.
• Zevenbergen, B., Mittelstadt, B., Véliz, C., Detweiler, C., Cath, C., Savulescu, J., & Whittaker, M. (2015). Philosophy meets Internet Engineering: Ethics in Networked Systems Research. GTC workshop outcomes paper: Oxford Internet Institute, University of Oxford.
References Images
• World clock: https://en.wikipedia.org/wiki/File:Globe‐with‐clock.svg• Free speech: http://www.gbcnv.edu/rights_
responsibilities/free_speech.html• Free beer: https://openclipart.org/detail/73603/beer• Flowers: http://publicdomainpictures.net/view‐
image.php?image=119670&picture=&jazyk=pt
Publications on Regulatory Issues and Impact of Pre‐Processing on Network Analysis
• Diesner J, Chin C (2016) Seeing the forest for the trees: considering applicable types of regulation for the responsible collection and analysis of human centered data. Human‐Centered Data Science (HCDS) Workshop at 19th ACM Conference on Computer‐Supported Cooperative Work and Social Computing (CSCW 2016), San Francisco, CA.
• Diesner J, Chin C (2016) Gratis, Libre, or Something Else? Regulations and Misassumptions Related to Working with Publicly Available Text Data, ETHI‐CA² Workshop (ETHics In Corpus Collection, Annotation & Application), 10th Language Resources and Evaluation Conference (LREC), Portoroz, Slovenia.
• Diesner J, Chin C (2015) Usable Ethics: Practical Considerations for Responsibly Conducting Research with Social Trace Data. Workshop: Beyond IRBs: Ethical Review Processes for Big Data Research, Future of Privacy Forum, Washington DC.
• Diesner J, Evans C, Kim J (2015) Impact of entity disambiguation errors on social network properties. International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK
• Kim J, Diesner J (accepted) Less than expected: Over‐time measurement of triadic closure in scientific collaboration networks. Journal for Social Network Analysis and Mining (SNAM).
• Kim J, Diesner J (2015) The Effects of Data Pre‐Processing on Understanding the Evolution of Collaboration Networks. Journal of Informetrics, 9(1), 226‐236.
Thank you!• Questions, comments, feedback, follow‐up: Jana DiesnerEmail: [email protected]: http://jdiesnerlab.ischool.illinois.edu/