pathways to technology transfer and adoption: achievements and challenges
DESCRIPTION
Dongmei Zhang and Tao Xie. Pathways to Technology Transfer and Adoption: Achievements and Challenges. In Proceedings of the 35th International Conference on Software Engineering (ICSE 2013), Software Engineering in Practice (SEIP), Mini-Tutorial, San Francisco, CA, May 2013. http://people.engr.ncsu.edu/txie/publications/icse13seip-techtransfer.pdfTRANSCRIPT
Pathways to Technology Transfer and Adoption Achievements and Challenges
Dongmei Zhang
Microsoft Research Asia
Tao Xie
North Carolina State University
ICSE 2013 SEIP Mini-Tutorial
May 23 2013
taoxiegmailcomdongmeizmicrosoftcom
Successful Samples Research Practice
ICSE 2013 SEIP 2
hellip
MSR SAGE
ASTREacuteE
Statechart
MSRA MSRA
SPIN
ACM SIGSOFT Impact Project
httpwwwsigsoftorgimpact
Goals of the Impact Projectbull Scholarly objective case-based evaluation
bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise
bull Community building
bull Prospective for future research investment
bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice
(there are other measures of research success)
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
An Argument ResearchProduct Timing SCM
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Successful Samples Research Practice
ICSE 2013 SEIP 2
hellip
MSR SAGE
ASTREacuteE
Statechart
MSRA MSRA
SPIN
ACM SIGSOFT Impact Project
httpwwwsigsoftorgimpact
Goals of the Impact Projectbull Scholarly objective case-based evaluation
bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise
bull Community building
bull Prospective for future research investment
bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice
(there are other measures of research success)
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
An Argument ResearchProduct Timing SCM
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ACM SIGSOFT Impact Project
httpwwwsigsoftorgimpact
Goals of the Impact Projectbull Scholarly objective case-based evaluation
bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise
bull Community building
bull Prospective for future research investment
bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice
(there are other measures of research success)
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
An Argument ResearchProduct Timing SCM
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Goals of the Impact Projectbull Scholarly objective case-based evaluation
bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise
bull Community building
bull Prospective for future research investment
bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice
(there are other measures of research success)
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
An Argument ResearchProduct Timing SCM
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
An Argument ResearchProduct Timing SCM
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Impact Trace Graph Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ICSE Papers Industry vs Academia
Sourcecopy Carlo Ghezzi
OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees
ICSM 11 KeynoteICSE 09 Keynote
MSR 12 KeynoteMSR 11 Keynote
SCAM 12 Keynote
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Redwine and Riddle Study (1985)
bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years
bull75 years from developed technology to wide availability
SourcecopyS L Pfleeger
Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in products
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Technology Maturation Middleware
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
15-20 years between first
publication of an idea and widespread availability in productsShall we just stay in our comfort zone
to wait for 15-20 years for our research to (or not to) produce
practice impact How about the research that we did
15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
NSF Workshop on Formal Methods
bull Goal to identify the future directions in research in formal methods and its transition to industrial practice
bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools
httpgotoucsdedu~rjhalaNSFWorkshop
Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too
December 2012
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Researcherrsquos View -SCM Impact Study Findings
bullResearchers tend to consider thathellipbull precedence
bull concepts
bull prototypes
bull are sufficient as impact and ignorehellipbull efficiency
bull usability
bull reliability
bulldismissing them as ldquoengineering common senserdquo
SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
A Researchers Observation in HCI Research Community
bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
A Researchers Observation in HCI Research Community
bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo
bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo
ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay
Does our research community
have similar issues
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Evaluation of DesignPLldquoResearch in Programming Languagesrdquo
bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby
bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo
bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo
Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Why Do Some Programming Languages Live and Others Die
bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem
bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like
documentation for their language
bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it
httpwwwwiredcomwiredenterprise201206berkeley-programming-languages
Wiredcom
SourcecopyC Garling
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Industrial Evaluations= Real Adoption
bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties
bull Non-target users (such as students)
bull Target users but not developers of the industrial code
bull Developers of the industrial code
bull Apply one-time (hitamprun) or continuous adoption
Need to value real adoption (eg in reviewing papers)
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
MS Academic Search ldquoPointer Analysisrdquo
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
23
ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]
24
Section 43 Designing an Analysis for a Clientrsquos Needs
ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is
to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo
Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
MS Academic Search ldquoClone Detectionrdquo
Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on
intermediate steps
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]
26
Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004
MSRAXIAO
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012
httppatterninsightcom
httpwwwblackducksoftwarecom
httpresearchmicrosoftcomen-usgroupssa
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Mindset Changing is Needed for Our Community
bullNeed to get out of comfort zone
bullNeed to value (and pursue) ldquorealnessrdquo
bullNeed to aim for ultimate tasks
bullNeed to value (and pursue) tech readiness
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Example Dimensions of Tech Readiness
bull Scalability
bullComplexity
bullApplicability
bullUsability (human in the loop)
bullCost-Benefit Analysis
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ScalabilitybullAcademia
bull Rarely ask ldquoWhen scale is up will my solution still workrdquo
bull Tend to focus on small or toy scale problems
bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution
bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable
(performance maintenance hellip)
bull Academia tend to value sophistication gt simplicity
bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]
httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
ComplexitybullAcademia
bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)
bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry
bullReal-worldbull Often has high complexity violating these assumptions
bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data
structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]
httpdlacmorgcitationcfmid=2048083
httpdlacmorgcitationcfmid=1595725
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Applicabilitybull Academia
bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution
bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution
bull Real-worldbull Need a comprehensive solution that would work generally (at least
not compromising too much other situations)
bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]
bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight
bull Industry adoption of open source tools
httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf
httpresearchmicrosoftcomjump175199
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Usabilitybull Academia
bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)
bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE
bull too much to include both the approachtool itself and usabilityits evaluation in a single paper
bull Real-worldbull Often has human in the loop (familiar IDE integration social effect
lack of expertisewillingness to write specshellip)
bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]
bull Debugging user study [ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258
httpdlacmorgcitationcfmid=2001445
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Are Automated Debugging [Research] Techniques Actually Helping Programmers
bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers
ldquo
rdquo[ParninampOrso ISSTArsquo11]
httpdlacmorgcitationcfmid=2001445
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Cost-Benefit Analysisbull Academia
bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)
bull Real-worldbull Consider many dimensions of measurement
bull Cost eg human cost (inspecting false positives)
bull Benefit eg bug severity
bull Killer apps eg bull MSR SLAM Device driver verification
bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]
bull PatternInsightMSRA XIAO Known-bug detection
bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]
httpresearchmicrosoftcomen-usprojectsslam
httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Industry Academia Collaboration
bullAcademia (research recognitions eg papers) vs Industry (company revenues)
bullAcademia (research innovations) vs Industry (likely involving engineering efforts)
bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)
bull Industry problems infrastructures data evaluation testbeds hellip
bull Academia educating students hellip
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
MSRA Software Analytics Group
Mission
Utilize data-driven approach to help create highly
performing user friendly and efficiently developed
and operated software and services
Founded
May 2009
Group members
12
httpresearchmicrosoftcomen-usgroupssa
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Software Analytics
Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services
Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Research topics amp technology pillars
Microsoft Confidential
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Research Topics
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Research topics amp technology pillars
Microsoft Confidential
Software Development
Process
Software Systems
Software Users
Information Visualization
Analysis Algorithms
Large-scale Computing
Research Topics Technology Pillars
Vertical
Horizontal
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Connection to practice
MSR 2012 39
bull Software Analytics is naturally tied with software development practice
bull Getting real
RealData
RealProblems
RealUsers
RealTools
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Creating real impact
Code Clone Analysis [Dang et al ACSACrsquo12]
bull Detecting near-duplicated code
bull Released with Visual Studio 2012
StackMine [Han et al ICSErsquo12]
bull Performance debugging in the large
via mining millions of stack traces
bull Helping improve Windows performance
httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Experience sharing
bull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 42
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Real world is not that prettyhellip
bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything
workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip
ICSE 2013 SEIP 43
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
What does ldquogetting realrdquo mean
ICSE 2013 SEIP 44
Making real impact
Building real technologies
Solving real problems
Software engineering is naturally tied with software development practice
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Technical readiness
bull Assumptions
bull Scalability
bull Complexity
bull Usability
bull Cost-Benefit Analysis
bull Walking last mile
ICSE 2013 SEIP 45
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Example project ndash XIAO
bull Token-based code clone analysis technique
bull Characteristics
bull Technology transfersbull Three-year journey fromVisual Studio 2012
bull Code clone search service within Microsoft
bull research to impact
ICSE 2013 SEIP 46
curren High tunability curren High scalability
curren High compatibility curren High explorability
Prototype development
Early adoptionTechnology
transfer
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Scalability
bull Four-step analysis process
bull Easily parallelizable based on source code partition
ICSE 2013 SEIP 47
Pre-processingCoarse
Matching
Fine MatchingPruning
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
What you tune is what you get
MSR 2012 48
bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code
snippets
bull Tunable at fine granularitybull Statement similarity
bull of inserteddeletedmodified statements
bull Balance between code structure and disordered statements
for (i = 0 i lt n i ++)
a ++
b ++
c = foo(a b)
d = bar(a b c)
e = a + c
for (i = 0 i lt n i ++)
c = foo(a b)
a ++
b ++
d = bar(a b c)
e = a + d
e ++
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Explorability
ICSE 2013 SEIP 49
1 Clone navigation based on source tree hierarchy
2 Pivoting of folder level statistics
3 Folder level statistics
4 Clone function list in selected folder
5 Clone function filters
6 Sorting by bug or refactoring potential
7 Tagging
1 2 3 4 5 6
7
1 Block correspondence
2 Block types
3 Block navigation
4 Copying
5 Bug filing
6 Tagging
1
2
3
4
1
6
5
How to navigate through the large number of detected clones
How to quickly review a pair of clones
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Collaboration
bull Collaboration models
bull Communication
bull Champion in product teams
bull Getting engineering support
ICSE 2013 SEIP 50
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Collaboration models
ICSE 2013 SEIP 51
Pull
Push
Join
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Communication ndash getting connected
bull Reaching-out to practitioners
bull Understanding their business
bull Speaking practitionersrsquo languages
bull Finding out their pain pointsbull Understanding their scenarios
bull Experiencing their pain
bull Articulating their problems
ICSE 2013 SEIP 52
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Communication ndash forming partnership
bull Finding and defining shared goals
bull Setting the right expectation
bull Building a roadmap
bull Forming virtual team (creating an email alias)
bull Adopting a milestone approach
bull Conducting regular sync-up
ICSE 2013 SEIP 53
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Example project ndash XIAO
bull Tons of papers published in the past 10 years
bull 6 years of International Workshop on Software Clones (IWSC) since 2006
bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)
bull Duplication Redundancy and Similarity in Software (2006)
bull No code clone analysis tools in MS
bull No product offering
ICSE 2013 SEIP 54
Source httpwwwdagstuhlde12071
Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012
httpresearchmicrosoftcomjump175199
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Motivation
bull Copy-and-paste is a common developer behavior
bull A real tool widely adopted internally and externally
ICSE 2013 SEIP 55
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Reaching out (1)
bull Demonstrating XIAO at TechFest
bull Posting XIAO at internal website
bull Active ldquosellingrdquo to various teams
bull What we gainedbull Opportunities to run XIAO on different codebases and
produce rich results
bull Feedback to improve both algorithm and system
bull Expanded network
ICSE 2013 SEIP 56
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Reaching out (2)
bull What did not land well internallybull Wide interest but no concrete takers
bull Why no takersbull What exactly is the valuable proposition
bull Long way to go from code clones to bugs
bull High cost for code refactoring
bull Product prioritization
bull Lessons learnedbull Killer scenarios needed for value proposition
bull Security is a big stick
ICSE 2013 SEIP 57
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Potential 0day vulnerability disclosure
ICSE 2013 SEIP 58
Initial vulnerability reported in product A
Patch release of product B
Potential 0day attack
Security bulletin released
Similar vulnerability found in product B by attackers
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Tech transfer to MSRC
bull Search scenario vs detection scenariobull Code snippet as input
bull Much larger scale of codebases
bull Near-real-time response
bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases
bull Deployed in used by and transferred to MSRC
bull Champion in MSRC worked with us all the waybull Providing feedback and update
bull Prompting within MSRC
ICSE 2013 SEIP 59
Microsoft Security Response Center
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Vulnerability investigation workflow
ICSE 2013 SEIP 60
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
Team A
MSRC
Manual amp ad hoc investigation
Code snippet
Team B
Team C
Code clones
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Vulnerability investigation workflow
ICSE 2013 SEIP 61
Clone search service
Completeness is the key Web service API for
automation
Code snippet
Code clones
Automated Investigation
Code snippet
Code clones
DesignImplementTest fix
Variants finding
Root cause investigation amp
source location
Issue reproducing
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
More secure Microsoft products
ICSE 2013 SEIP 62
Automated laborious manual efforts
Faster response time critical in security context
Code clone search service integrated into vulnerability investigation process of MSRC
Real security issues proactively identified
and addressed
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012
3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32ksys
Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer
Microsoft Security Research amp Defense Blog about this bulletin
ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo
ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Transfer to Visual Studio (1)
bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool
bull Two reorgs in Visual Studio
bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS
bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners
bull One program manager in VSbull MSRA Innovation Engineering Group
ICSE 2013 SEIP 64
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Transfer to Visual Studio (2)
bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting
bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)
bull Active planning as part of VS 2012 release
bull Weekly sync-up
bull Timely feedback from VS partners
ICSE 2013 SEIP 65
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Benefiting developer community
ICSE 2013 SEIP 66
Searching similar snippets for fixing bug once
Finding refactoring opportunity
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79
Summary
bull Mindset changing needed for community bull Get out of comfort zone
bull Value (and pursue) ldquorealnessrdquo
bull Aim for ultimate tasks
bull Value (and pursue) tech readiness
bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset
bull Technical readiness
bull Collaboration
ICSE 2013 SEIP 79