embedded lab. park yeongseong. introduction state of the art core values design experiment ...
TRANSCRIPT
Value-Based Program Characterization and Its Application to Software Plagiarism De-
tection
Embedded Lab.Park Yeongseong
ICSE 2011
Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University
Xiaoqi JiaState Key Laboratory of Information Security, Institute of Software,
Chinese Academy of Sciences
Introduction State of the art Core values Design Experiment Discussion Conclusion Q&A
Contents
Identifying same or similar code is very im-portant
Previous works◦ Static source code comparison – C1◦ Static excutable code comparison – C2◦ Dynamic control flow based methods – C3◦ Dynamic API based methods – C4
Introduction
Three highly desired requirements◦ R1 – Resiliency◦ R2 - Ability to directly work on binary executables◦ R3 – Platform independence
BUT!!!! Not satisfy requirement◦ Static source code comparison – C1 R1 R2◦ Static excutable code comparison – C2 R1◦ Dynamic control flow based methods – C3 R1 R3◦ Dynamic API based methods – C4 R3
Introduction
Introduce new approach◦ Core-values
5 optimization options (-O0 ~ -O3, -Os) 3 Compilers ( GCC, TCC, WCC ) KlassMaster, Thicket, Loco/Diablo Obfusca-
tors
Introduction
Code Obfuscation Techniques◦ data obfuscation, control obfuscation, layout obfus-
cation and preventive transformations◦ indirect branches, control-flow flattening, function-
pointer aliasing
Static Analysis Based Plagiarism Detection◦ String-based◦ AST-based◦ Token-based◦ PDG-based◦ Birthmark-based
State of the arts
Dynamic Analysis Based Plagiarism Detec-tion◦ Whole program path based (WPP)◦ Sequence of API function calls birthmark(EXESEQ)◦ Frequency of API function calls
birthmark(EXEFREQ)◦ System call based birthmark
State of the arts
Runtime values◦ The output operands of the machine instructions ex-
ecuted
Core values◦ Constructed from runtime values
Eliminate non-core values◦ If is not derived form , is not a core-value of ◦ If is not in the set of runtime values of is not a core-
value of
Core values
Core values
Not all values associated with the execution of a program are core-values◦ Value-updating instruction◦ Related to the program’s semantics
Design-Value Sequence Extrac-tion
To refine value sequences◦ Sequential refinement – reduction rate 16%~34%◦ Optimization-based refinement – 5 optimization◦ Address removal – exclude pointer values
Design-Value Sequence Refinementand Similarity Metric
Design-Overview
Intel Quad-Core 2.00 GHz CPU 4GB RAM Linux machin QEMU 0.9.1
Questions1. resilient 2. false accusation3. credible
Experiment
Obfuscation techniques◦ SandMark, KlassMaster : Java bytecode obfusca-
tors
Test application : Jlex◦ Lexical analyzer
Experiment-Obfuscation tool(resiliency)
Test Application◦ 5 individual XML pasers:expat, libxml2, Parsifal,
rxp,xercesc
Experiment-Similar Programs(false accusation)
Test application◦ Bzip2, gzip, oggenc, 9 of 11 programs
Result◦ Similarity scores between 0 and 0.27◦ zip and gzip similarity scores are 1.0
Same compression algorithm : deflate◦ zip and bzip2 similarity scores are 0.01 to 0.03
Different compression algorithm : block sorting
Experiment-Different Programs(credible)
introduce a novel approach to dynamic characterization of executable programs.
The value-based method successfully dis-criminates 34 plagiarisms by SandMark, KlassMaster, Thicket.
Conclusion
Q&A