conference review presented by: ivan matosevic

1

CGO 2006:The Fourth International Symposium on

Code Generation and Optimization

New York, March 26-29, 2006

Conference Review

Presented by: Ivan Matosevic

2

Outline

Conference overview

Brief summaries of sessions

Keynote speeches

Best paper

3

Conference Overview

Primary focus: back-end compilation techniques

• Static analysis and optimization

• Profiling

• Run-time techniques

8 sessions, 29 papers

Dominating topics: multicores, dynamic compilation

4

Overview of Session

1. Dynamic Optimization

2. Object-Oriented Code Generation and Optimization

3. Phase Detection and Profiling

4. Tiled and Multicore Compilation

5. Static Code Generation and Optimization Issues

6. SIMD Compilation

7. Optimization Space Exploration

8. Security and Reliability

5

Session 1: Dynamic Optimization

Kim Hazelwood (University of Virginia), Robert Cohn (Intel), A Cross-Architectural Interface for Code Cache Manipulation

• Pin dynamic instrumentation system with code cache

• The paper describes an API for various operations with the code cache (callbacks, lookups, statistics, etc.)

Derek Bruening, Vladimir Kiriansky, Tim Garnett, Sanjeev Banerji (Determina Corporation), Thread-Shared Software Code Caches

• Problem: sharing a code cache across multiple threads

• Authors propose a fine-grained locking scheme

• Evaluation using DynamoRIO

6

Session 1: Dynamic Optimization

Keith Cooper, Anshuman Dasgupta (Rice Univ.), Tailoring Graph-coloring Register Allocation For Runtime Compilation

• Problem: register allocation in JIT compilers

• Authors propose a novel lightweight graph-colouring technique

Weifeng Zhang, Brad Calder, Dean Tullsen (UC San Diego), A Self Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

• Extension of the Trident event-driven dynamic optimization framework (previously proposed by the same authors)

• Dynamic insertion of prefetching instructions based on run-time analysis

7

Session 2: Object-Oriented Code

Generation and Optimization Suresh Srinivas, Yun Wang, Miaobo Chen, Qi Zhang, Eric Lin,

Valery Ushakov, Yoav Zach, Shalom Goldenberg (Intel Corporation), Java JNI Bridge: An MRTE Framework for Mixed Native ISA Execution

• Use a dynamic translator for the execution of native calls to one ISA on a different ISA’s Java platform

Kris Venstermans, Lieven Eeckhout, Koen De Bosschere (Ghent University), Space-Efficient 64-bit Java Objects through Selective Typed Virtual Addressing

• Use address bits on a 64-bit architecture to encode object type in order to save memory

• Objects of the same type allocated in a contiguous (virtual) region

8

Session 2: Object-Oriented Code

Generation and Optimization Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay

Sundaresan (IBM Canada), Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

• The IBM TestaRossa JIT compiler

• This paper focuses on code patching and profiling in a multi-threaded environment with a lot of class loading/unloading

Lixin Su, Mikko H Lipasti (University of Wisconsin Madison), Dynamic Class Hierarchy Mutation

• Run-time reassignment of objects from one derived class to another, changing its virtual tables

• Offers opportunity for optimizations based on specialization

9

Session 3: Phase Detection and Profiling

Priya Nagpurkar, (UCSB), Michael Hind (IBM), Chandra Krintz, (UCSB), Peter Sweeney, V.T. Rajan (IBM), Online Phase Detection Algorithms

• Detecting phase behaviour in virtual machines

• Track dynamic program parameters (methods invoked, branch directions…) over time and apply a similarity model

Jeremy Lau, Erez Perelman, Brad Calder (UC San Diego), Selecting Software Phase Markers with Code Structure Analysis

• Portions of code whose execution correlates with phase changes

• Procedure calls and returns, loop boundaries

• Profile-based hierarchical loop-call graph

10

Session 3: Phase Detection and Profiling

Shashidhar Mysore, Banit Agrawal, Timothy Sherwood, Nisheeth Shrivastava, Subhash Suri (UC Santa Barbara), Profiling over Adaptive Ranges

• Voted best paper – details later

Hyesoon Kim, Muhammad Aater Suleman, Onur Mutlu, Yale N. Patt (UT-Austin), 2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set

• Predicts whether the prediction accuracy of each branch will vary across input sets

• Heuristic approach used to derive representative profiling results from a single input set

11

Session 4: Tiled and Multicore Compilation

David Wentzlaff, Anant Agarwal (MIT), Constructing Virtual Architectures on a Tiled Processor

• Map components of a superscalar architecture (Pentium III) onto a parallel tiled architecture (Raw) using dynamic translation

• In a way, uses Raw as a coarse-grain FPGA

Aaron Smith, (UT-Austin), J. Burrill, (UMass at Amherst), J. Gibson, B. Maher, N. Nethercote, B. Yoder, D. Burger, K. S. McKinley (UT-Austin), Compiling for EDGE Architectures

• TRIPS EDGE (Explicit Data Graph Execution) architecture

• This paper focuses on compilation of standard C and FORTRAN benchmarks

12

Session 4: Tiled and Multicore Compilation

Shih-wei Liao, Zhaohui Du, Gansha Wu, Guei-Yuan Lueh (Intel), Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

• Parallel compiler for the Brook streaming language

• An extension of C that enables specifying data parallelism

Michael L. Chu, Scott A. Mahlke (University of Michigan), Compiler-directed Object Partitioning for Multicluster Processors

• Partitioning of data in clustered architectures such as Raw

• I didn’t really understand what programming model these authors have in mind?

13

Session 5: Static Code Generation and

Optimization Issues Two papers about the HPUX Itanium compiler:

• Dhruva R. Chakrabarti, Shin-Ming Liu (Hewlett-Packard), Inline Analysis: Beyond Selection Heuristics

• Cross-module techniques for selection of inlined call sites and the choice of specialized function versions

• Robert Hundt, Dhruva R. Chakrabarti, Sandya S. Mannarswamy (Hewlett-Packard), Practical Structure Layout Optimization and Advice

• Data layout and placement on the heap to improve locality

• Structure splitting, structure peeling, dead field removal, and field reordering

14

Session 5: Static Code Generation and

Optimization Issues Chris Lupo, Kent Wilken (University of California, Davis), Post

Register Allocation Spill Code Optimization

• Authors propose a profile-based algorithm for placement of save/restore instructions handling spilled variables in function calls

• Implemented as a part of GCC

Seung Woo Son, Guangyu Chen, Mahmut Kandemir (Pennsylvania State University), A Compiler-Guided Approach for Reducing Disk Power Consumption by Exploiting Disk Access Locality

• Goal: restructure code so that disk idle periods are lengthened

• The approach targets array-based programs: disk layout of array data exposed to the compiler

15

Session 6: SIMD Compilation

Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel China Software Center), Optimizing Dynamic Binary Translation for SIMD Instructions

• Algorithms for dynamic binary translation of SIMD instructions in general-purpose architectures (such as MMX in x86)

• Evaluation using IA-32 binaries on Itanium 2

Dorit Nuzman (IBM), Richard Henderson (Red Hat), Multi-Platform Auto-Vectorization

• Implementation of automatic vectorizer for GCC 4.0

16

Session 7: Optimization-space Exploration

Felix Agakov, Edwin Bonilla, John Cavazos, Bjoern Franke, Grigori Fursin, Michael O'Boyle, Marc Toussaint, John Thomson, Chris Williams (U. of Edinburgh), Using Machine Learning to Focus Iterative Optimization

• Predictive modelling used to search the optimization space

• Targets embedded platforms – AMD Au1500 and Texas Instruments TI C6713

Prasad Kulkarni, David Whalley, Gary Tyson (Florida State University), Jack Davidson (University of Virginia), Exhaustive Optimization Phase Order Space Exploration

• Exhaustive search of the phase order space (15 phases) using aggressive pruning; takes time on the order of minutes to hours

• Targets StrongARM SA-100

17

Session 7: Optimization-space Exploration

Zhelong Pan, Rudolf Eigenmann (Purdue University), Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning

• Problem: find the optimal combination of 38 GCC O3 options, targeting Pentium IV and Sparc II

• Proposed heuristic algorithm that provides s quality solution in time on the order of several hours

18

Session 8: Security and Reliability

Edson Borin, (UNICAMP), Cheng Wang, Youfeng Wu (Intel), Guido Araujo (UNICAMP), Software-Based Transparent and Comprehensive Control-Flow Error Detection

• Addresses the problem of soft (transient) errors that cause branches to incorrect instructions

• Implemented in SW as a part of a dynamic binary translator

Tao Zhang, Xiaotong Zhuang, Santosh Pande (Georgia Tech), Compiler Optimizations to Reduce Security Overheads

• Optimizations that specifically target techniques that implement software protection with minimal HW support

19

Session 8: Security and Reliability

Susanta Nanda, Wei Li, Tzi-cker Chiueh (State University of NY at Stony Brook), BIRD: Binary Interpretation using Runtime Disassembly

• Goal: framework for automatic detection of vulnerabilities such as buffer overflows when the source code is not available

• Static and dynamic disassembly and instrumentation – targets Windows x86 application

20

Keynote Speeches

Wei Li, Principal Engineer, Intel: "Parallel Programming 2.0"

Kevin Stoodley, Fellow and CTO of Compilation Technology, IBM: "Productivity and Performance: Future Directions in Compilers"

21

Wei Li: Parallel Programming 2.0

Major technological change:

• Moore’s Law continues to increase transistor counts

• However: power, memory latency, limits to ILP are setting an effective performance ceiling

General trend towards thread-level on-chip parallelism

• SMT

• Chip multiprocessors

22


“Parallel Programming 2.0” refers to the advent of multicores

A very optimistic future vision:

23


Key issue – where will the parallelism come from?

Parallel programming needs to become more mainstream

• Consumer vs. HPC/server/database

• Inclusion into education at more elementary level

• New tools for greater ease of programming

Intel’s parallel programming tools

• http://www.intel.com/software

24

K. Stoodley:"Productivity and Performance:

Future Directions in Compilers"

Limits to traditional static compilation

Overview of IBM compiler technology

Testarossa JIT compiler, Toronto Portable Optimizer, Tobey backend

Challenges at present and near future

Software abstraction complexity – forces the scope of compilation to higher levels

Maintaining high performance backwards compatibility increasingly difficult

25

K. Stoodley:"Productivity and Performance:

Future Directions in Compilers"

Future: convergence/combination of dynamic and static compilation technologies

xlc

Toronto PortableOptimizer (TPO)

W-Code

Profile-DirectedFeedback (PDF)

xlC xlf

TOBEYBackend

StaticMachine

Code

class class jar

J9 Execution Engine(Java + Others)

TestarossaJIT Dynamic

MachineCode

CPO

Front

Ends

BinaryTranslation

26

Best Paper

Shashidhar Mysore, Banit Agrawal, Timothy Sherwood, Nisheeth Shrivastava, Subhash Suri (UC Santa Barbara): Profiling over Adaptive Ranges

27

Profiling over Adaptive Ranges

Problem: how to count specific events efficiently and accurately?

• Code segments executed

• Memory regions accessed

• IP addresses of routed packets

In all cases, impossible to maintain separate counters for the entire range of values

• Each basic block, memory address, IP address…

28

Trade-off: Precision vs. Efficiency

Unlimited counters Uniform ranges

Profiling with uniform ranges fails to distinguish hot code

29

Higher Precision for Hot Regions

Good trade-off with limited resources:

• High precision for hot regions

• Low precision for colder ones, but this affects the accuracy less

Challenge: how to determine what exactly to count with what precision?

30

Solution: Adaptive Profiling

Start with one counter; split counters as they become hot:

31



32



33

Counter Merging

Problem: what if program behaviour changes after the initialization phase?

34

Counter Merging

Problem: what if program behaviour changes after the initialization phase?

35

Counter Merging

Solution: perform counter merging along with splitting

36

Counter Merging

Counters of merged child nodes added to the parent

37

Counter Merging

Counters of merged child nodes added to the parent

38

Counter Merging

Problem: how to identify nodes for merging?

• They are by definition those ones that are not updated frequently

Solution: periodic batched merge operations

• Tree depth grows at logarithmic rate can be done at exponentially increasing intervals

39

Additional Contributions

Heuristics for splitting and merging

Theoretical analysis of accuracy guarantees

Proposal for hardware implementation

Experimental evaluation

• Memory requirements

• Average and worst-case errors on benchmarks

• Performance of HW implementation

• Accuracies on the order of 98.0-99.8% with only 8-64K of memory

40

Conclusions

Highly interesting program

• My short presentation certainly doesn’t do justice to most of the mentioned works!

Readings to perhaps consider for future CARG:

• D. Wentzlaff, A. Agarwal, Constructing Virtual Architectures on a Tiled Processor

• A. Smith et al., Compiling for EDGE Architectures

• F. Agakov et al., Using Machine Learning to Focus Iterative Optimization

• (Highly subjective!)

conference review presented by: ivan matosevic

Documents