osspolice-identifying open-source license violation and 1-day … · 2017-10-31 · greendao...
TRANSCRIPT
OSSPolice - IdentifyingOpen-SourceLicenseViolationand1-daySecurityRiskatLargeScale
Ruian Duan,AshishBijlani,Meng XuTaesoo Kim,Wenke Lee
ACMCCS2017
1
Background
• OpenSourceSoftware(OSS)isgainingpopularity,e.g.GitHubreported20Musersand57Mrepos
• Mobileappmarketgrowsfastwithover2MappsonPlayStore
• DevelopersreuseOSSasisforlotsofbenefits
• Legalrisksandsecurityrisksarise
2
RisksinOSSuse
• OSSlicenseshaveconstraints(e.g.GNUGPLrequiresderivativeworkstoopensource)
• 1-dayvulnerabilitiesinstaleOSSversionsareexploitedbyhackers
3
Fornow,GNUGPLisanenforceablecontract,saysUSfederaljudge!
Artifex SlapsPalmwithPDFReaderCopyrightSuit
Equifaxblamesopen-sourcesoftwareforitsrecord-breakingsecuritybreach
CommunityHealthSystemsBreachPossibleduetoHeartbleedVulnerability
Goal
• Designatool,OSSPolice,toanalyzeAndroidappsforopen-sourcelicenseviolationand1-daysecurityriskbydetectingreuseofOSSandtheirversionsatlargescale
• Requirements• AccuratedetectionforhundredsofthousandsofOSS• Accurateversionpinpointing• Efficientresourceusage• FastsearchtosupportvettingalargenumberofAndroidapps
4
Overviewandchallenges
• Featureselection• Sourcevsbinary:automaticallybuildingsourcecodeishard,duetodependencies,variousbuildconfigs etc.
• CompareAppagainstOSS• Fusedappbinaries:multipleOSScanbelinkedorcompiledintoasinglefile• Partialbuildsandinternalcodeclones:notallOSSfeaturesarebuiltintolibrariesandOSSreusesotherOSS
• IdentifyOSSversions• Cross-matchofuniqueversionfeatures:fusedappbinariesandinternalcodeclonescanconfusetheprovenanceofuniquefeatures
5
Sourcevsbinary
• C/C++OSSarebuiltintostrippednativesharedlibraries(sofiles)
• JavaOSSarebuiltintoobfuscateddalvik executables(dex files)
6
SourceCode SharedLibrary StrippedSharedLibraryFoo.c
voidfoo(){w=“hello”…}
.text.dynsym
.rodata.symtab
.debug_info
Bar.cstaticbar(){w=“world”}
.text.dynsym
.rodata
Sourcecode Dalvik Bytecode ObfuscatedDalvikBytecode.classedu/gatech/Foo
.methodbarconst-stringv1,"HelloWorld”invoke-virtual{v0,v1},println
packageedu.gatech;classFoo{bar(){println(“helloworld”)};}
.classa .methodaconst-stringv1,"HelloWorld”invoke-virtual{v0,v1},println
Featureselection
• C/C++OSSvssofiles• Stringliteral
• Clang-basedlexer forOSSand.rodata forlibraries• Exportedfunction
• Clang-basedparserforOSSand.dynsym forlibraries
• JavaOSSvsdex files• Stringconstant• Normalizedclass
• Capturesinteractionwithframework• Functioncentroid
• Capturesintra-proceduralcontrolflow 7
Fusedappbinaries
• AnappusesmultipleOSS• !"#∩%&&
!"#
• %&&∩!"#%&&
• Iterate𝑁 OSShas𝑂(𝑁) timecomplexity
• FlagallOSSbeingusedatthesametime• IndexOSSandtheirversions!
8
edu.gatech.example
MuPDFOpenCV
OpenSSL OkHttpMoPubLog4j
Flatindexingandmatching
• Indexing:MapsfeaturestoOSS• Matching:Lookupfeature->OSSmappingtoidentifyOSSreuse
• Flatindexingblowuptableto90Gafterindexing7KOSS• IndexingmultipleversionsofOSSfurtheraddstotheproblem• Given𝑁 OSSwith𝐹 featuresand𝑉 versions,𝑂(𝑁𝐹𝑉) spacecomplexity
9
feature1
feature2
feature3
MuPDF
OpenCVedu.gatech.example
Partialbuildsandinternalcodeclones
10
repodir file
LibJPEG LibPNG
MuPDF OpenCV
source thirdparty 3rdparty modules/core
test-dev.cpppdf-lex.c opengl.cpp test-io.cpp
pdf fitz testsrc
jpeglib.hpngtest.c
png.c…… … … …
Internalcodeclonesconfusesthird-partywithcoreandrequires
highmatchratiotofilter
Partialbuilds (e.g.examples,tests)causesthematchratio
tobelow
Hierarchicalindexingandmatching
• HierarchicalIndexing• Recordssourcehierarchytotrackinternalclones• UsesSimhash algorithmtogenerateidsfornon-leafnodesfordeduplication• Recorduniquefeaturesacrossversionsviaseparatelists
• HierarchicalMatching• NormScore (TF-IDFbased)topromoteuniquepartswhencomputingmatchingratioofanode• Allow partialbuildsbyskippingnodeswithlowratio• Drop internalcodeclonesbyskippingnodeslikelytobethird-party
11
feature1
feature2
feature3
file1
file2
file3
dir 1
dir 2
dir 3
dir 4
dir 5MuPDFOpenCVLibPNG
edu.gatech.example
Cross-matchofuniqueversionfeatures
12
1.5.0
1.6.0
1.2.46
foo_string
int bar_func()
MuPDFV 1.5
V1.6
LibPNGV 1.2.46
V1.6.0
edu.gatech.exampleMuPDF V1.6
LibPNG V1.2.46
Context-basedfiltering
• Leveragecontextinformationinhierarchicalindexingtable• UseNormScore toassigndifferentweightstofeatures
13
MuPDF V1.6
LibPNG V1.6.0
pdf.c
1.6.0
int pdf_read()
png.c
1.6.0
int png_read()
edu.gatech.exampleMuPDF V1.6
LibPNG V1.2.46
Evaluation
• FDroid Apps• 4,469apps,579withnativelibraries• 295C/C++OSSuses,7,055JavaOSSuses
• BAT:internalcodeclones• LibScout:partialbuilds(coderemoval)
14
55matches
020406080100
Precision (%) Recall (%) VersionPrecision(%)
C/C++OSSEvaluationResults
OSSPolice BAT
478matches
295matches
020406080100
Precision (%) Recall (%) VersionPrecision(%)
JavaOSSEvaluationResults
OSSPolice LibScout
Dataset
• C/C++OSSfromGitHub• 3,119popularreposand60,450OSSversions• 29%reposareGPL/AGPL• 11%reposarevulnerablewith5,611severeCVEs(𝐶𝑉𝑆𝑆 ≥ 4.0)
• JavaOSSfromMavenandJCenter• 4,777popularartifacts,77,308artifactversions• 2.3%artifactsareGPL/AGPL• 1.7%artifactsarevulnerablewith452severeCVEids
• AndroidAppsfromGooglePlay• 1.6Mapps,515,812withnativelibraries
15
PerformanceandScalability
• Indexing• 60,450C/C++repos and 77,308Javarepos• Timecost is 1000svs.40sonaverage• Memorygrows sublinearly to 30GBand 9GB
• Matching• Sampled10,000GooglePlayapps• 80%ofdex andsofilesfinishwithin100sand200s
16
0 10 20 30 40 50 60 70 80Number of indexed repos(Thousands)
0.004.669.31
13.9718.6323.2827.9432.6037.25
Mem
ory
usag
e(G
B)
C/C++ Memory UsageJava Memory Usage
Popularlibraries
• Long-taileddistributionofOSSuses
17
050000100000150000
Top10detectedJavaOSSexcludingAndroidandGoogleOSS
#Usesaggregatedbytypes
Utils Network Social Image Codec
020,00040,00060,00080,000100,000
Top10detectedC/C++OSS #Usesaggregatedbytypes
Codec Game Font
Network Audio Viewer
LegalRisks
• Morethan40KpotentialGPLviolators• MoreviolatorsusingC/C++thanJavaandencodinglibrariesdominate
18
0
500
1000
1500
iTextPDF JavaConnector
GreenDAOGenerator
Proguard Weka-Dev
Top5offendedJavaOSS
010000200003000040000
MuPDF FFmpeg PJSIP VLCandX264
BZRTP
Top5offendedC/C++OSS
#Usesaggregatedbytypes
Codec Utils Compiler
#Usesaggregatedbytypes
Codec Communication
LegalRisks
• WhyviolatingGPL/AGPL?• MuPDF andiTextPDF areusedduetolackoffreealternatives
• OSSdevelopersresponses• MuPDF gotnewcustomersJ• FFmpeg andVideoLANhaveinterest,butFFmpeg cannotenforceJ• PJSIPnotinterestedduetoNDA,iText didnotreplyL
• AwarenessofOSSlicensingterms• NoneoftheappdevelopersprovidedsourcecodeyetL
19
SecurityRisks
• Morethan100KappsusingvulnerableOSSversions• MoreappsusingvulnerableC/C++OSSthanJava
20
050001000015000200002500030000350004000045000
Top6C/C++and4JavavulnerableOSS
C/C++ Java
1,244LibPNG and4,919OpenSSLusesarenotdetectedbyAppSecurityImprovementProgram(ASIP)
SecurityRisks
• WhichversionsofOSSdoappdeveloperschoose?• BothvulnerableandpatchedOSSarebeingused
• WhatcausestheupdateofOSSversions?• ASIPmitigatesvulnerableOSSusage,butstillremainsaproblem
21
0250500750
MoP
ub
0200400600800
Ope
nSSL
0800
16002400
OkH
ttp
2013-05-122013-11-28
2014-06-162015-01-02
2015-07-212016-02-06
2016-08-24
Date
080
160240
FFm
peg
# Vuln. Usage# Patched Usage
ASIP DeadlineASIP Notification
TimelineofOSSusageforthetop10Kapps,300Kappversions
Discussion
• Checkinglicensecompliancerequiresmanualefforts
• Obfuscationandoptimization• Stringencryptionindex files• Functionhidinginsofiles
• Versionpinpointing• Notallversionscanbeuniquelyidentified
• Moreprogramminglanguages(i.e.JS,Python)andplatforms(i.e.iOS)22
Conclusion
• OSSPolice:anaccurateandscalabletooltoidentifylicenseviolationsand1-daysecurityrisks• Hierarchicalindexingandmatchingscheme• Context-baseduniquefeaturefiltering
• Alargescalemeasurement• 1.6MfreeGooglePlayStoreapps• 40KcasesofpotentialGPL/AGPLviolationsand100KappsusingvulnerableOSS
• Interestinginsights• DevelopersviolateGPL/AGPLduetolackoffreealternatives• AppdevelopersusevulnerableOSSversionsdespiteeffortsfromGoogle
23