analysis and clustering of model clones: an automotive ...we apply our model clone detection tool,...

Analysis and Clustering of Model Clones:An Automotive Industrial Experience

Manar H. Alalfi, James R. Cordy, Thomas R. DeanSchool of Computing, Queen’s University, Kingston, Canada

Email: {alalfi, cordy, dean}@cs.queensu.ca

Abstract—In this paper we present our early experienceanalyzing subsystem similarity in industrial automotive models.We apply our model clone detection tool, SIMONE, to identifyidentical and near-miss Simulink subsystem clones and clusterthem into classes based on clone size and similarity threshold.We then analyze clone detection results using graph visualizationsgenerated by the SIMGraph, a SIMONE extension, to identifysubsystem patterns. SIMGraph provides us and our industrialpartners with new interesting and useful insights that improvesour understanding of the analyzed models and suggests betterways to maintain them.

I. INTRODUCTION

In todays automotive industry, models are widely used togenerate production software code. A modern automobile mayhave 100 million lines or more of production software sourcecode on board, and up to 80% of the code deployed onup to 100 embedded control units can be generated frommodels specified using domain-specific formalisms such asMatlab/Simulink [1]. The size and complexity of software inthe embedded domain, especially in the automotive softwaresystems, is expanding rapidly, while the innovation cyclelength is decreasing with high cost pressure and large numbersof product-line variants. Consequently, software developmentin this domain has adopted a highly reuse-oriented approach,where general purpose domain specific libraries with elementssuch as PID-controllers are being reused in the manufactureof many new software components. Dealing with this com-plexity and the high frequency of software reusability requiressophisticated software tools to manage the massive amountsof information used by engineers in software developmentprojects.

This rapid growth has led to challenges that are already wellknown from classic programming languages. In particular,the presence of copied program elements or “clones”, whichcan affect productivity and software maintenance, is alsomanifested in models. Thus the identification of common orsimilar elements in different parts of the software is importantto the model-based development process.

To address this need, we have developed a method andtoolset called SIMONE [2] that uses clone detection for theanalysis and formalization of subsystem similarity in industrialmodels. In this paper we present our first experience in theanalysis and pattern extraction of subsystem patterns in a setof production Simulink automotive models, using visualizationtechniques to provide insight. Our method is intended to assist

reuse in model development in a number of ways: standardsand consistency analysis or enforcement in model mainte-nance; failure and change propagation in model maintenance;and verification and test optimization in model testing. Weenvisage this work as a fundamental enabler for the futureof higher level modeling, model transformations and meta-modeling frameworks, customized to specific domains.

II. APPROACH

This paper describes our first experience in applying ouranalysis method in an empirical study of an set of actualproduction models from our industrial partners at GeneralMotors, using pattern mining and clone detection technologiesto discover a catalog of repeated subsystem patterns. Wehave automated much of this first step using subsystem cloneclasses from our tool SIMONE, a near-miss clone detectorfor Simulink models [2], from which we derive a first ap-proximation of the pattern set. Our plan is to organize thesediscovered patterns into a taxonomy with the goal of coveringall of the subsystem patterns in the models. In this paperwe introduce SIMGraph, a graph visualization extension forSIMONE results, which helps us to visualize and understandSimulink subsystem clones and patterns in a more intuitive andunderstandable way. In the following sections we will discussin more detail the phases of our analysis, and our case studyanalyzing a set of industrial automotive Simulink models.

III. CLONE IDENTIFICATION

Our analysis consists of three phases. In the first, “discov-ery” phase of our analysis, our primary goal is the discoveryand identification of common subsystem patterns in an exam-ple production model set obtained from our industrial partnersat General Motors.

For that purpose, we have used our previously-developedmethod for leveraging text-based code clone detectors to findnear-miss clones in graphical models, and have demonstratedSIMONE [2], an implementation for Simulink models basedon the NICAD clone detector [3]. In that work, we outlinedthe challenges of using a parser- and text-based method ongraphical models, described our solutions using filtering andsorting of the textual representation, and compared our resultsto a state-of-the-art graph-based method, showing that ournear-miss detection can find meaningful subsystem clones thatgraph-based methods can miss. Our approach generalizes to

978-1-4799-3752-3/14 c© 2014 IEEE CSMR-WCRE 2014, Antwerp, BelgiumIndustry Track

Accepted for publication by IEEE. c© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

375

GM Fuel System ModelsSubsystem similarity overview

FAFR models (red)

FCBR models (green)

FDBR models (blue)

Large subsystems

Midsize subsystem

Small subsystems

Many subsystems unique - not similar to any others

in these models

Fig. 1. Subsystem Similarity in Fuel System Models

other modeling languages, including UML-based ones, withonly customization of the filtering and sorting algorithms.

IV. SUBSYSTEM CLONE CLUSTERING

In the second, “clustering” phase, we aim to organize andgeneralize the identified concrete subsystem clone classes intoa formal taxonomy of generic patterns that can cover themodels of the example model set. SIMONE generates an initialclustering of similar subsystems into model patterns basedon the NICAD algorithm for clustering near-miss clones into“clone classes”, using a dynamic clustering based on the sizeand percentage difference between identified subsystem clonepairs. In this clustering, a database of clone pairs orderedby size is generated after which the clustering is formedwithout any post processing. Clone classes are then formedby selecting the largest remaining clone as an “exemplar” ordistinguished representative of its clone class, and gatheringall clone pairs within a given difference threshold (30% in thisexperiment) into the cluster.

V. SUBSYSTEM CLONE VISUALIZATION

The NICAD clone detector generates clone reports in twotextual formats, XML and HTML. It generates two kinds ofreports, one that presents cloned subsystems as pairs, andanother that presents them as clusters (“clone classes”) asdescribed above. While these textual reports are useful foranalyzing small similarities in source code, they are not theright format for understanding large-scale similarity in graphi-cal models. Thus we have developed SIMGraph, an extensionto SIMONE, which presents clone results as graphs. We usedsource transformation technology to transform SIMONE’sNICAD textual reports into graph files that can be imported byGephi [4], an open-source tool for visualizing and analyzinglarge network graphs of complex systems. Gephi has manyinteresting features, including the ability to present dynamicand hierarchical graphs, as well as interactive filtering andclustering features.

GM Fuel System ModelsRemove unique subsystems Connecting lines represent subsystem similarity -

thick lines, 90-100% similarthin lines, 70-80% similar

Fig. 2. Subsystem Similarity with Unique Subsystems Removed

A. Subsystem Clone Representation

To create a more intuitive representation of SIMONE resultswe generated graphs using the following mappings:

• Subsystem Model: Each model is represented by a colour,and its subsystems are coloured to indicate the model theybelong to.

• Subsystem Size: Each subsystem of a model is repre-sented by a circle whose size is proportional to the sizeof the subsystem, larger subsystems represented by largercircles.

• Subsystem Similarity: Subsystem clones are representedby a line joining the subsystem circles, the thicknessof which represents the level of similarity betweenthem, ranging from 90-100% (thickest lines) to 70-80%(thinnest).

B. Levels of Clone Abstraction

We generated two types of Gephi graphs to represent theresults at two levels of abstraction:

• Subsystem Clone Pairs: Subsystem clone results as linesconnecting similar subsystems (Figures 1 and 2).

• Subsystem Clone Clusters: Subsystem clone resultsgrouped into categories based on clustering and catego-rization algorithms (Figures 3 and 4).

VI. CASE STUDY

In this section we present our analysis of a productionindustrial system obtained from our industrial partners at GM,the Fuel System Simulink models. Applying SIMONE withoutany normalization other than sorting and filtering to thesemodels extracted 1,091 subsystems from which it identified245 subsystem clones, initially categorized into 35 clusters.We refined these results and clustering by running SIMONEusing a blind renaming normalization, which yields modelclones that are similar in structure, ignoring name differencesin embedded model elements. The number of clones identified

376

GM Fuel System ModelsRearrange to cluster similar subsystems Clusters reveal groups

of similar subsystems -called “clone classes”

FCBR Determine Closed Loop Fuel Pump Duty Cycle(2 similar copies)

FCBR Controller Scheduling

(2 similar copies, 75% similar copy

in FAFR)

FAFR RAM Diagnostic (several similar copies, ROM Diagnostic, FFPM Controller Watchdog Diagnostic, FDBR Fuel

Senor High Diagnostic, ...)

Identical copies of GMPT Operating System Tasks in

FAFR, FCBR, FDBR

FCBR Fuel Pressure Device Control Request,

75% similar to FCBR Fuel Pump Device Control

Request

Fig. 3. Subsystem Clone Pairs Clustered by Similarity

with similarity 70%-100% then increases to 1,147 clonesclustered into 30 clone classes.

Using SIMGraph, we generated a graph to represent sub-system clone results, which intuitively presents the level ofsimilarity between the identified subsystem clones as well assubsystem sizes. The analyzed Fuel System consists of threemain models, FAFR, FCBR and FDBR. When generatingthe graph, we set the colour attribute to be different foreach of these models, thus subsystems of FAFR are colouredin red, FCBR in green, and FDBR in blue. Subsystems ofthese models are represented as circles, where the circle sizecorresponds to the subsystem size (lines of text in the internalSimulink representation of the subsystem).

Figure 1 shows the resulting graph. Subsystems that have asimilarity relation are connected, and unique subsystems arenot connected. Figure 2 shows the graph after we filter outthese unique subsystems, since they are not of interest froma clone detection prospective. For example, the large greensubsystems on the right of Figure 1 disappear in Figure 2 sincethey are unique. We can also observe the level of similaritybetween subsystem clone pairs, represented by the thicknessof the lines connecting them. The thicker the connecting linethe more similar the subsystem pairs are.

While Figures 1 and 2 show the cloning relation at thelevel of subsystem clone pairs, a more interesting view, theclustering view, better reveals the similarity between groups ofsubsystems (Figures 3 and 4). Figure 3 is a layout of the graphof Figure 2 clustered using the Fruchterman Rheingold clus-tering algorithm to better reveal groups of similar subsystems.Figure 4 on the other hand shows the NICAD clone classes,a closure of the clone pair relations into cliques representingrepeated subsystem patterns. This better reveals the relativesize and distribution of subsystem clone classes.

GM Fuel System ModelsInfer common subsystem patterns

Patterns characterize common repeated similar subsystem

paradigms

Large groups of small to mid-sized similar subsystems

across models

Small groups of relatively large

similar subsystems both within and across models

Small differences could be worrisome - in this case Input Cct Diagnostics in

FAFR, FCBR

Anonymous subsystems handled - here Fuel Pump Duty Cycle subsystems

in FCBR similar to unnamed subsystems in

FDBR

Fig. 4. NICAD Clone Classes: Closure into Subsystem Patterns

VII. ANALYSIS INSIGHTS

SIMgraph arranges that the names and file paths of the sub-systems are stored as attributes of the nodes of the generatedgraphs. Thus the identities of the subsystems and models canbe explored in Gephi simply by clicking on the nodes of thegraph. Using this technique, we can gain deeper insight aboutwhat we are seeing.

• Inferring common subsystem patterns: The NICAD clus-tering shown in Figure 4 provides us with an initialinsight about model pattern characterization, since similarsubsystems have been partitioned into separate categories,with a distinguished element serving as the exemplarof the group. These categories will form the basis ofour pattern formalization in building a subsystem patterntaxonomy for the GM models.

• Identifying similarities across models: Visualization theclone relationships helps us to identify not only similarsubsystems within models, but also across models. This ismade possible using the colouring scheme to distinguishthe different models. For example, on the left of Figure 3we can see that there are two similar copies of DetermineClosed Loop Fuel Pump Duty Cycle in the FCBR model,since the two circles have the same green colour. On theright of the same Figure we can see three medium-sizedcircles of different colours connected to each other by athick line, which reveals that there are identical copies ofa medium sized subsystem, the GMPT Operating SystemTasks, used in all three models, FAFR, FCBR and FDBR.

• Handling anonymous subsystems: This an importantfeature of SIMONE, by which anonymous (unnamed)Simulink subsystems are identified and grouped withother similar subsystems. This allows engineers to iden-tify these subsystems as instances of other known namedsubsystems, and possibly name or refactor them to docu-ment the relation. For example, on the right of Figure 4 ananonymous subsystem in the FDBR model is identified

377

as similar to the Fuel Pump Duty Cycle subsystems ofFCBR.

• Identifying potential maintenance problems: The graphvisualization can also help engineers identify potentialmaintenance problems. For example, similar subsystemswith the same name across models connected by a thinline indicate that even though the subsystems have thesame name they are only 70% similar. The bottom leftof Figure 4 shows an example on this case, where thethin connections between the Cct Diagnostics subsystemsof FAFR and FCBR indicate that they are not nearly asimilar as might be expected.

VIII. PROPOSED USAGE SCENARIOS

Potential general usage scenarios for our subsystem patternanalysis include: the automated discovery of common idioms,both global to the company and local to a project or do-main, model optimization, potential bug discovery (inconsis-tent changes to instances of a pattern), functional and securitycompliance analysis, modeling standards enforcement, modellibrary creation and reuse, and reuse of implementation anddeployment processes and alternatives.

At a meeting to present our tools and results to GM engi-neers, they showed a keen interest in our approach, and theyfound the visual presentation useful as a first-level overviewof the results our tools uncovered. While not intended foreveryday model clone management (our SimNav interface [5]serves that purpose), team leaders found this overview usefulin assessing the overall situation in a system. The meeting alsohelped our GM industrial partners to identify a more specificlist of applications tailored to their own goals:

• Refining / adjusting / changing modeling guidelines toeither prohibit common problem constructs, or encourageother constructs.

• Identifying new candidate constructs for addition to theGM block library. The block library not only providesprimitives (native Simulink blocks), but also commonsubsystem constructs for easy re-use (e.g., basic firstorder lag filters).

• A set of common model patterns provides designers witha choice of alternatives. These alternatives can be ex-plored by designers to identify which take best advantageof MATLAB’s optimizations.

IX. RELATED WORK IN CLONE PATTERN VISUALIZATION

Dang al el. [6] report their experience at Microsoft re-search using their XIAO code clone detection approach ina real development setting. Like us, the authors present theimportance of presenting clone results to the engineers inan understandable and intuitive way, as well as the need fornear-miss clone detection techniques with sorting and filteringcapabilities, which will help engineers to identify clones ofinterest and eliminate any irrelevant clones.

Ader and Kim [7] argue the importance of using graphs tohelp understand code clone results in the context of parametersand relationships, such as relation degree, size, and colour.They have developed their tool SoftGUESS for that purpose.

Yoshimura and Mibe [8] present their experience in visualiz-ing code clones in an industrial setting, analyzing an enterprisebusiness application at Hitachi. The authors emphasized theimportance of tool support for large-scale systems and asuitable visualization of clone results to help stakeholdersunderstand their systems.

All of the above approaches present support for using visu-alization to understand code clones, while our tool, SIMGraph,is specifically aimed at near-miss model clones.

X. CONCLUSIONS AND FUTURE WORK

In this paper, we presented an initial analysis of productionautomotive models using our SIMONE model clone detectiontool and its SIMGraph visualization extension. We presenteda case study and some insights resulting from it, identifyingpotential usage scenarios with the help of the initial feedbackreceived from our industrial partners. As future work, we planto refine the initial clustering observed by our tools and to usepattern matching techniques adapted from source code analysisresearch to encode the taxonomy elements and organize theminto a partitioning engine. The pattern set will then be back-validated by applying it to the analysis of the original examplemodel set, and then applied to a larger set of industrial modelsto test completeness.

ACKNOWLEDGEMENTS

This work is supported by NSERC as part of the NECSISAutomotive Partnership with General Motors, IBM Canadaand Malina Software Corp.

REFERENCES

[1] M. Jungmann, R. Otterbach, and M. Beine, “Development ofsafety-critical software using automatic code generation,” in SAEWorld Congress, 2004.

[2] M. H. Alalfi, J. R. Cordy, T. R. Dean, M. Stephan, andA. Stevenson, “Models are code too: Near-miss clone detectionfor simulink models,” in ICSM, 2012, pp. 295–304.

[3] C. K. Roy and J. R. Cordy, “Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and codenormalization,” in ICPC, 2008, pp. 172–181.

[4] B. Hauptmann, V. Bauer, and M. Junker, “Using edge bundleviews for clone visualization,” in IWSC, 2012, pp. 86–87.

[5] J. R. Cordy, “Submodel pattern extraction for Simulink models,”pp. 7–10, 2013.

[6] Y. Dang, S. Ge, R. Huang, and D. Zhang, “Code clone detectionexperience at microsoft,” in IWSC, 2011, pp. 63–64.

[7] E. Adar and M. Kim, “Softguess: Visualization and explorationof code clones in context,” in ICSE, 2007, pp. 762–766.

[8] K. Yoshimura and R. Mibe, “Visualizing code clone outbreak:An industrial case study,” in IWSC, 2012, pp. 96–97.

378

analysis and clustering of model clones: an automotive ...we apply our model clone detection tool,...

Documents