source code clone search (iman keivanloo phd seminar)
DESCRIPTION
Source Code Clone Search and Detection (SeClone is a Real-time and Internet-scale Clone Search and Detection). *There are some animations in the presentation, to see them download and run it locally.TRANSCRIPT
Internet-scale Source Code Search and Analysis Framework
Iman Keivanloo
Advisor:Dr. Juergen Rilling
PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011
2
Agenda
• Research Context
• Major questions & answers
• Next step
• Conclusion
• Time Table
3
Research Context
“is searching the Internet for source code to help solve a software development problem”
Internet-Scale Code Search
[Gallardo, SUITE’09]
4
How to search for Source Code?
• Free-form Query:
– “how to write into file in Java”
• Structural Query: – “select col1 from table1 where col1=“%write”
[Keivanloo, ICSM’10][Keivanloo, SUITE’11]
5
Research Focus
Suggested simplified query:Select line which has
(1) a method call statement on the trigger method.
...11: CSVReadFile csvData=new CSVReadFile(“input.csv”);12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();…
...59: Event e=new Event(50);60: e.trigger();61: e.update();...
...133: Listener res=new Listener();134: res.trigger(“warm-up”);135: res.close();...
...55: Window r=new Window();56: long timestamp=System.Now();57: System.out.println(“Start reasoning...”);58: XMLStream xmldata=new XMLStream(io);59: r.trigger(xmldata);60: OutputStream o=new OutputStream();61: r.flush(o);…
…89: Window var=new Window();90: XMLReadFile r=new XMLReadFile (“k.xml”);91: OutputStream o=new OutputStream();92: var.trigger(r);93: var.flush(o);…
Gapped clone
Unordered core
The pattern is similar but it uses
XMLStream instead of XMLFile as the
input
This match is acceptable, even if
the order is different from the 1:1 match
Internet-Scale Structural Code Search Engine
This line looks like a match, however it uses .CSV instead of .XML. We can use our clone search engine to find now other similar code fragments to this one.
Real-time Clone Search Engine...10: Window myWindow=new Window();11: CSVReadFile csvData=new CSVReadFile(“...12: myWindow.trigger(csvData);13: OutputStream o=new OutputStream();14: myWindow.flush(o);15: myWindow.close();...
Step 2: Input [the selected fragment in the first step and its target line (red)]
Step 1: Input [the simplified structural query]
XMLReadFile inFile=new XMLReadFile(“kb.xml”);Window myWindow=new Window();myWindow.trigger(inFile);OutputStream result=new OutputStream();myWindow.flush(result);
The ideal expected asnwer
Similar Fragment Search
6
Research Challenge
7
The Web Search Challenge
8
But Often Still Fail to Deliver the Expected Results After 10 Years of Research
9
No Ambiguity!
10
Early Conclusion
Source Code Search is similar to Web Search
11
Early Conclusion
Source Code Search is similar to Web Search
1. Search techniques = ?
2. Ambiguity resolution techniques = Code AnalysisAnalysis (Ambiguity resolution)
Search
12
Research Approach Overview
Internet-scale Source Code Search and Analysis FrameworkAnalysisSearch
Semantic Web-based Code Analysis
Code Clone Search
Definitions & Requirements
Search
14
Clone (Source Code Clone)
• Similar code fragments
• Type 1: Identical except whitespaces …• Type 2: Identical except variable names ...• Type 3: Identical except a few missing…• Type 4: Similar functionality
[Roy, C. K., Cordy, J. R., & Koschke, R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 2009.]
for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
16
Clone Search
Query Code Database
for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");
for (A
ttribute
attrib
ute:e
xample
Set.g
etAttrib
ute
s())
Syst
em.o
ut.p
rintln
(“Hi!"
);
for
for
(Att
ribut
eEnti
ty
theA
ttrib
uteE
ntity
:aTa
bleE
ntity
.ge…
Syst
em
.out.
pri
ntl
n(“
Hello
!");
for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
for (A
ttribu
teEn
tity
theA
ttribute
Entity:a
Ta
ble
Entity.g
e…
Syste
m.o
ut.p
rintln
(“H
ello
!");
for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");
for (
Attrib
ute attr
ibute:exampleSet.g
etAttr
ibutes()
)
Syste
m.out.prin
tln(“T
he end");
17
Clone Search
Query
Answer
18
Internet-scale Clone Search
Query
for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“Hello!");
for (A
ttribute
attrib
ute:e
xample
Set.g
etAttrib
ute
s())
Syst
em.o
ut.p
rintln
(“Hi!"
);
for
for
(Att
ribut
eEnti
ty
theA
ttrib
uteE
ntity
:aTa
bleE
ntity
.ge…
Syst
em
.out.
pri
ntl
n(“
Hello
!");
19
Internet-scale Real-time Clone Search
20
Internet-scale Real-time Clone Search
Requirements?
21
Internet-scale Real-time Clone Search
Millions LOC~ 300 MLOC
Requirements:
22
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100
Milliseconds
23
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100
Milliseconds
•Precision• Recall•Type-1, 2, 3…
for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
for (JAttribute attribute:formType.getAttributes()) System.out.println(“Test");
for (AttributeEntity
theAttributeEntity:aTableEntity.ge
…System.out.println(“Hello!");
for (Attribute attribute:es1.getAttributes())
System.out.println(“Test");
24
Internet-scale Real-time Clone Search
Millions LOC
Requirements:100 Milliseconds
Precision RecallType-1, 2, 3…
Is it actually possible?Real-time answer (faster than 100 ms)
Rese
arch
Que
stion
#1
26
• SeClone: An Internet-scale Real-time Clone Search Engine
Our Initial Analysis
Search
AnalysisPhase 1 Phase 2
[Keivanloo, ICPC’11]
27
Inside SeClone
Phase 1• Syntactical Pattern matching
Phase 1 Phase 2Phase 1Pattern Matching
28
Inside SeClone
Phase 2• Information Retrieval & Clustering algorithm
1 for (Attribute attribute:exampleSet.getAttributes()) System.out.println(“The end");
2 for (Attribute attribute:es1.getAttributes()) System.out.println(“Test");
3 for (AttributeEntity theAttributeEntity:aTableEntity.ge…System.out.println(“Hello!");
4 for (JAttribute attribute:formType.getAttributes()) {System.out.println(“Test");
5 for (IAttribute att:source.getAttributes()) {System.out.println("Please do not read me");
Phase 1Pattern Matching
Phase 2Semantic Matching
The DilemmaHow to distribute the 100 milliseconds between
phases?
Pattern Matching Semantic Matching
0 25 50 75 100
Rese
arch
Que
stion
#2
[Keivanloo, WCRE’11]
30
Our Further Analysis [WCRE’11]
• 100 Milliseconds• Millions LOC• Precision• Recall• Type-1, 2, 3…
Pattern Matching Semantic Matching
0 25 50 75 100
The Dilem
maCo
nstr
aint
s
Requ
irem
ents
SeCl
one
[ICPC
11]
Dat
a Ch
arac
teris
tics
O ( p * log n )
31
Source Code Characteristics
32
Analysis of the Data Characteristics: Dataset preparation
• Name: IJaDataset– Comprehensive (Inter-project)
• To avoid project-specific result
– ~18,000 Projects– 1,500,000 unique Java classes
• No duplicate, empty, buggy file
– ~300 MLOC
• online at http://aseg.cs.concordia.ca/seclone
33
Analysis of the Data Characteristics: Granularity Effect
• Three Level Similarity (TLS): Set of similar three-line fragments
• First Level Similarity (FLS): single-line patterns
34
Analysis of the Data Characteristics: Clone frequency
• How many code fragment are analyzed by each query?
• Answer: 3 (Average)
35
Analysis of the Data Characteristics: Clone frequency
• Observation result:– TLS distributes the candidates into 3.9 times more groups– Its group size is 6 times smaller than FLS
36
Analysis of the Data Characteristics: Clone frequency
• Conclusion:– TLS heuristic is practical for real-time clone search,
as long as the outliers are handled properly– Why?• (1) each TLS group has 2.37 members on average• (2) it distributes candidates in small-size groups• (3) for each query, only one group must be evaluated
37
What Does an Outlier Look Like?
• Outlier Definition: patterns with more than 2,000 occurrences
• Observation result:• Only ~1000 patterns out of 30M• ~ 0.01% patterns• Mostly insignificant code patterns
38
Analysis of the Data Characteristics: Sampling efficiency
• Can sampling be used to reduce the amount of data being analyzed?
• Answer: Yes (e.g., 33% contains 91% of popular patterns)
39
Analysis of the Data Characteristics: Indexing
• Can 32bit Hash keys (versus MD5) be used without affecting index quality?
abc 123 abc 123 aXc 456 aXc 123
• Answer: Yes 0.002% error rate
Only 10 cases for same key for three distinct strings
40
Method Names Are Reliable?
• Input Data: Koders 1-year query log– ~10M records
• Observation purpose:– Importance of method names
• Observation result:– 98% success rate vs. 69%
• Result interpretation:– Method names in this context are reliable source of information– They must be preserved to increase precision
41
Source Code Search Framework
42
Internet-scale Real-time Code Clone Search via Multi-level Indexing
– Internet-scale & Speed• 32-bit Hash values
– Type-3 clone• Multi-level indexing
– Customized for Internet-scale Code Search• Special transformation rule
43
Response Time (Pattern Matching) [WCRE’11]
• Regular queries– 25 microseconds
• 99.99% queries– 900 microseconds
44
Conclusion
45
Answer:Research Question #1
Internet-scale Real-time Code Search Is Possible?
YES
The DilemmaHow to distribute the 100 milliseconds between phases?
Pattern Matching Semantic Matching
0 25 50 75 100
1 millisecond 99 milliseconds
Answer:
Answer:Research Question #2
Pattern Matching Semantic Matching
0 25 50 75 100
99 milliseconds
Research Opportunity
Analysis
48
SummaryStep 1
• Studied characteristics of source code on the Internet– unique patterns distribution (sampling application)– Pattern frequencies (multi-level search)– 32-bit hashing strength (code pattern)– Outlier patterns– Method name importance
Step 2• Designed an Internet-scale clone search
– Customized for code search (precision)– Fine granularity– Multi-level Indexing approach (Type-3 clone)– Microsecond range response time (up to 10 times faster)
49
PublicationCode Clone Search and Detection (http://aseg.cs.concordia.ca/seclone/)
• Iman Keivanloo, Juergen Rilling, Philippe Charland. Internet-scale Real-time Code Clone Search via Multi-level Indexing. 18th Working Conference on Reverse Engineering (WCRE 2011), Lero, Limerick , Ireland.
• Iman Keivanloo, Juergen Rilling, Philippe Charland. SeClone – A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. 19th IEEE International Conference on Program Comprehension (ICPC 2011), Kingston, Ontario, Canada.
Source Code Sharing using Linked Data (secold.org)• Iman Keivanloo, Chris Forbes, Juergen Rilling, and Philippe Charland, "Towards Sharing Source Code Facts Using
Linked Data," ICSE Workshop on Search-Driven Development: Users, Infrastructure, Tools and Evaluation (SUITE). 2011.
Source Code Search (http://aseg.cs.concordia.ca/codesearch)• Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. Semantic Web-based Source Code Search. 6th
International Workshop on Semantic Web Enabled Software Engineering (SWESE 2010), June 35, San Francisco, USA. • Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, Juergen Rilling. SE-CodeSearch: A Scalable Semantic Web-based
Source Code Search Infrastructure. 26th IEEE International Conference on Software Maintenance (ICSM), Early Research Achievements (ERA) Track, Sept. 12-18, Timișoara, Romania.
50
QUESTION?Thank you for your kind attention
PhD SeminarComputer Science and Software Engineering DepartmentNovember-17-2011