atac nsf proposal notes - massachusetts institute of ...people.csail.mit.edu/jim/temp/atac nsf...

CPA-CSA-T:ATAC: Enhancing Multicore Programmability through

All-to-All Computing

PROJECT SUMMARYThe computing world has made a generational shift to multicore as a way of addressing the Moore’s

gap, which is the growing disparity between the performance offered by sequential processors and the scaling expections set by Moore’s law. Two- or four-core multicores are commonplace today, with scaling to 1000’s of cores expected by the middle of the next decade. Unfortunately, because even two- and four-core multicores (let alone thousand-core multicores) are extremely hard to program, multicore’s widespread acceptance is threatened [60]. The multicore programming challenge is a serious issue and requires us to think about bold new approaches to architecture, programming, and software.

The ATAC project is based on one simple idea. The idea is that a low latency, low energy broadcast mechanism from any core to all other cores can yield a big step forward in multicore programmability. The broadcast mechanism is enabled by CMOS-integrated chip-level optical interconnection using WDM (wave-division multiplexing) with multiple add/drop points. The optical interconnect augments a traditional electrical-mesh interconnect in a tiled multicore processor. We believe that although point-to-point electrical interconnect is capable of delivering performance that is competitive with on-chip optical interconnect, it does not solve the programmbility issue. Thus, in ATAC, the optical broadcast capability does not replace basic electrical interconnect, it simply augments it for programmability. Previous work of the co-PIs of this proposal has demonstrated that the on-chip optical broadcast network integrated with a standard CMOS process is feasible to build, and that the availability of the broadcast mechanism to programmers through a software API can yield significant programmability gain.

Accordingly, to drastically ease programming for multicores, we propose the ATAC computer architecture that augments an on-chip mesh network with an on-chip optical broadcast network. Such a network enables blazingly fast broadcast communication that will allow programmers to fully take advantage of the multicore opportunity, even as multicores scale to thousands of cores. Although this capability has the potential to greatly speed-up existing algorithms, its biggest appeal lies in its ability to facilitate new, easy-to-use programming models. An efficient broadcast mechanism allows for novel, distributed coherent-shared-memory architectures, as well.

We have assembled a cross disciplinary team with expertise in computer architecture, programming languages and compilers, VLSI design, and integrated microphotonics. Our team is led by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and MIT Microphotonics Center (MPC) in collaboration with the MIT Microsystems Technology Laboratories (MTL), Sandia Labs, and Intel Corp., to create a new computing platform that has the potential to significantly simplify multicore programming.

The ATAC project proposes to design and to build a prototype computer system using the ATAC approach. This includes the ATAC computer architecture design, a detailed computer system simulator, a compiler system, a runtime system, a programming model and associated APIs that leverage the broadcast capability of the underlying ATAC multicore, and architectural models and interfaces for the optical interconnect technology. The optical interconnect design and fabrication is supported by DARPA’s EPIC (Electrical and Photonic Integrated Circuits) program. The multicore simulator will be developed in collaboration with Intel. The simulator is based on the Pin dynamic instrumentation infrastructure and will enable us to simulate 1000’s of cores on our existing compute server farm with over 100 processors.

The proposed research will make five fundamental contributions. The first contribution is increased programmability for multicores due to ATAC’s seamless integration of the efficient broadcast facility enabled by an on-chip optical network along with the electrical mesh network. The second contribution is the development of Application Programming Interfaces (APIs) and programming languages that allow algorithms to best take advantage of ATAC’s on-chip communication infrastructure while minimizing the burden on the programmer. The third contribution is the development of appropriate metrics that assess programmer productivity when developing parallel applications on multicore architectures. Fourth, this research will implement a multicore simulator that can simulate thousands of cores, and can run on current multicore hardware. Finally, this research will assess how well multicores scale to thousands of cores, especially in light of performance and energy scalability.

The broader impacts of this work will include easing the multicore programming burden for future multicore systems, thereby removing the fundamental constraint to mass adoption of multicore. It will also create the “killer app” for on-chip optical interconnect, thereby bringing this technology into the mainstream. The project will also train undergraduate and graduate students in system building and multicore software and hardware technologies. As with our previous Raw and Alewife projects, we will involve several undergraduate students in our research, thereby helping to train the next generation of multicore researchers and programmers.

Results from Prior NSF SupportProject ATAC’s team includes Profs. Anant Agarwal (Computer Science and Artificial Intelligence

Lab, MIT), Saman Amarasinghe (Computer Science and Artificial Intelligence Lab, MIT), and Lionel Kimerling (Microphotonics Center, MIT).

Anant Agarwal Prof. Agarwal’s research group focuses on computer architecture, compilers, VLSI and applications. We refer to current and previous NSF grants EIA-9810173 and EIA-0071841, “Baring It All to Software: The MIT Raw Machine”, as the Raw project, and MIP-9012773, “Automatic Management of Locality in a Scalable Cache-Coherent Multiprocessor: The MIT Alewife Machine,” as the Alewife project.

Figure 1: Die photo of the Raw processor (left), and photo of the Raw prototype computer system (right).

The Raw project [2,4,5,6,10] designed and implemented [7] the Raw processor (a single-chip tiled multicore), a distributed ILP and stream compiler [9], runtime and operating systems, tools and other system software. The Raw processor, implemented in the .18 micron IBM SA-27E ASIC process, occupies an 18.2x18.2mm die and runs at 420MHz. The Raw chip was operational in October 2002. The prototype system is fully operational and many researchers from MIT as well as other institutions such as USC/ISI, Lincoln Labs, and Lockheed Martin have used it. We also built a Raw Fabric multiprocessor system consisting of 4 Raw chips (for a total of 64 tiles).

The Raw effort pioneered the tiled architecture concept. It also conceptualized the notion of a scalar operand network, or SON [8]. The Raw effort developed the notion of the on-chip distributed direct cache. The project created a characterization of on-chip networks called Astro [8], which facilitates comparison of different tiled and conventional superscalar processors such as Trips, Scale, ILDP and others.

The effort also led to many fundamental discoveries in compiler techniques for orchestrating ILP and streams including algorithms for distributed ILP (DILP), control localization, space-time instruction scheduling [9], software-serial ordering, modulo unrolling and equivalence class unification. The effort also produced early work on transactional software methods such as SUDS (Software Undo System) [12]. For more details please refer to the Raw publications site available at www.cag.csail.mit.edu/raw, or Agarwal’s retrospective on Raw and tiled processors given as a keynote at the 2007 Micro 40 conference (the talk is available from the conference web site).

The Raw effort developed several applications for multicores. It also created a new metric for the versatility of processors and distributed an associated benchmark suite called VersaBench (http://cag.csail.mit.edu/versabench/ or google versabench).

The Raw project impacted the community in several ways. First, the project produced several dozen research papers. The project also graduated several Post Docs, PhD, MS and BS students, many of whom went on to become professors at other universities (E.g., Matt Frank at UIUC, Rajeev Barua at U Maryland, Andras Moritz at U Amherst, Michael Taylor at UC San Diego).

The tiled multicore technology developed by the Raw project was also commercialized by a venture-funded startup called Tilera. Tilera has made commercially available a 64-tile multicore chip called the Tile Processor. The Tile Processor was also chosen by the NRO for US space-based applications.

The Alewife project [14] was also funded by NSF and was conducted at MIT in the early 90’s. The goal of the Alewife experiment was to discover and to evaluate mechanisms for automatic locality management in scalable multiprocessors. The MIT Alewife machine became operational on May 7, 1994.

http://cag.csail.mit.edu/versabench/

http://www.cag.csail.mit.edu/raw

Like the Raw project, Alewife involved a major system building effort. A 32-node machine was in regular use until 1998. The machines and simulators have also been used in graduate-level courses and a summer course for industry participants at MIT.

Alewife pioneered the integration of message passing and shared memory into a single coherent interface, and a flexible, software-extended, shared-memory system called LimitLESS directories. Alewife’s Sparcle processor [15] was an early demonstration of a multithreaded microprocessor. It created the concept of coarse-grain multithreading (CGMT). Several Sparcle mechanisms — including trap vector spreading for fast exception handling, rapid context switching, and user-level address space identifiers for fast, user-level messages — influenced SPARC V9.

The Alewife project produced dozens of publications. Alewife and Virtual Wires papers and pictures are available through the web sites www.cag.csail.mit.edu/alewife, www.cag.csail.mit.edu/multiscale, and www.cag.csail.mit.edu/vwires.

Saman Amarasinghe Professor Amarasinghe’s current work on the StreamIt project was supported by an NSF NGS award (0305453) “StreamIt: A Language and a Compiler for Streaming Applications” and by an NSF ITR award (0325297) “A Language, Compilers and Tools for the Streaming Application Domain”.

StreamIt [34,35,36,37,38,39,33,13] is aiming to ease the burden of programming multicore architectures by developing high-level programming idioms, compiler technologies, and runtime systems. StreamIt project has two goals: to improve programmer productivity for a streaming class of applications and to obtain high performance, portability, and scalability for StreamIt programs. Improving Programmer Productivity In StreamIt, the programmer builds an application by connecting components together into a stream graph, where nodes represent filters that carry out the computation, and edges represent FIFO communication channels between filters. As a result, the parallelism and communication topology of the application are exposed, empowering the compiler to perform many stream-aware optimizations that elude other languages.

In StreamIt, all of the processing is encapsulated hierarchically into single-input, single-output streams with well-defined modular interfaces. This facilitates development and boosts programmer productivity, as components can be debugged and verified as standalone components. High performance, portability, and scalability StreamIt attempts to expose the common properties across all multicore architectures in the lanauge while abstracting away the differences between different architectures. Thus, common properties such as are multiple flows of control and multiple local memories are directly exposed in the language, making the abstraction boundary between the language and the architecture that the compiler has to bridge as narrow as possible. Thus, unlike existing imperative languages where compilers have to do heroic to impossible tasks, StreamIt compiler is able to achieve respectable performance with relative ease.

StreamIt also hides processor specific properties such as the number of cores, communication primitives and topology, computation strength and memory layout within the cores etc. In the StreamIt compiler, we are developing the algorithms needed to effectively take advantage of these properties without loosing portability or performance.

Lionel Kimerling Prof. Kimerling’s research [16,17,18,19,22,23,24,25,26,27,28,29] has focused on microphotonic integration on the silicon platform for over 20 years. Among the group’s achievements are: 1) a monolithically integrated MOSFET driver and Si:Er LED emitting at 1550nm (operating at room temperature); 2) the first low loss, silicon channel waveguides; 3) the first omnidirectional dielectric stack reflector; 4) the first Ge-on-Si photodetector; 5) the first silicon disk, ring and race track resonators and integrated silicon bus-add/drop filters; 6) discovery of a strong Franz-Keldysh effect in Ge and application to integrated SiGe optical modulators; 7) the smallest waveguide integrated Ge-on-Si photodetectors exhibiting 100% quantum efficiency; record low loss silicon channel waveguides (0.35 dB/cm); 8) demonstrated full CMOS process flow for monolithically integrated Si/Ge-on-Si waveguide/modulator/ring resonator/photodetector circuit; 9) first monolithically integrated silicon optical RF channelizer.

In terms of NSF sponsored research, Kimerling established and led the Microphotonics IRG in the MIT MRSEC from 1997-2002. The team studied the silicon platform for HIC photonic materials and devices. They created the first photonic crystal device to operate at the wavelength of 1550nm, a waveguide-integrated, photonic crystal add/drop filter. That structure continues to hold the record for the smallest modal volume of a photonic device.

Strong Atom-Photon Interaction for Microphotonic Devices (2000-2002) We observed the first enhancement of 1550nm emission from Er2O3 in a Si/SiO2 microcavity; we observed the first evidence of THz Rabi splitting from a matched cavity structure of the same materials; we created continuously tunable (1200-1600nm with bias <12volts) MEMS microcavity devices using double resonant structures.

Agglomeration of Ultra-thin Silicon-On-Insulator Films: Understanding Dewetting in Crystalline Thin Films (2004-2006) We developed a 5-step surface-energy-driven dewetting model for SOI agglomoration based on the capillary film edge stability and the generalized Rayleigh instability. Our surface-energy-driven model was able to well explain all of the key experimental observations in the existing literature and our own new experimental results. For the first time, we observed highly anisotropic dewetting behavior that was very sensitive to the edge orientation of a patterned mesa. We demonstrated the effectiveness of a dielectric edge coverage technique for stabilizing patterned SOI structures against dewetting.

IntroductionThe trend in computer architecture for the foreseeable future is clear: microprocessor designers are

using copious silicon resources to integrate more and more processor cores onto a chip. In fact, within the next ten years, general purpose multicore processors will likely contain 1,000 cores or more. While this path towards ever-increasing parallelism will theoretically enable massive performance, it is unlikely that application developers will be able to harness this potential unless drastic improvements to programming are made [60]. Current approaches to multicore programming are barely manageable for multicores with two or four cores, but they certainly will not scale to massive amounts of parallelism. A new architectural mechanism---fast on-chip broadcast enabled by novel optical technology---will revolutionize the programmability of future multicore processors.

While parallel programming used to be somewhat of a black art reserved for the handful of rocket scientists that programmed supercomputers and clusters, multicore’s imminent rule will require most programmers to implement parallel applications. However, by incorporating powerful hardware and architectural mechanisms, such as a fast broadcast, and empowering programmers with the right interfaces to the underlying architecture via APIs and language facilities, all programmers will be able to efficiently construct programs that exploit multicore’s power.

The broadcast primitive, whereby one node in a parallel computer system communicates some data to every other core (or, some subset of the cores), is powerful and straightforward to use. Parallel algorithms often use broadcast to achieve synchronization and communicate sentinel values or data values. The popular SMP (symmetric multiprocessing) computational model can also use a cheap scalable broadcast to scale beyond a small handful of cores. However, broadcast operations can be expensive. Even a state-of-the-art electrical mesh network for a multicore of the future with 1000 cores would require 100’s of cycles to broadcast a single value to all cores. Such a latency is large enough that broadcast must still be used judiciously, or not at all, by programmers. In fact, programmers often implement otherwise straightforward algorithms in complicated ways to work around performance bottlenecks such as slow broadcasts. As an example, MPI has a broadcast feature, but it is rarely used for this reason. With an essentially “free” broadcast, however, programming parallel systems would be hugely simplified, as programmers could use broadcast freely.

Why do this research now? There are two major reasons. First, multicores have recently become mainstream and are facing a parallel programming crisis [60], so bold architectural changes are warranted. Second, recent breakthroughs in CMOS integration of nanophotonic components [25] provide the enabling technology to make the broadcast mechanism viable. Our photonic technology uses planar lightguide circuits, or PLCs, with wavelength division multiplexing (WDM). CMOS silicon offers all of the information capacity advantages of fiber and the precision planar processing of PLCs with the additional advantage of dense integration on a platform that is compatible with electronic integrated circuits. The basis of this dense integration (~106 photonic devices per cm2) capability is high index contrast. While fiber and PLCs typically utilize a core/cladding refractive index ratio of <0.01, the Si/SiO2 ratio is ~2.33. This design paradigm, named high index contrast, HIC, provides strong confinement of light in small volumes, such as for on-chip planar waveguides. Conventional optical devices utilize low index contrast that is neither compatible with silicon circuits in size or in terms of CMOS processes. The layer thickness and dimensions of silicon waveguides, 200x500nm, are similar to upper level metal interconnects in size with much higher information carrying capacity and no electromagnetic interference, EMI.

MIT’s microphotonics group (led by co-PI Prof. Kimerling) has designed, implemented and demonstrated HIC devices within a CMOS process flow. Their fabrication efforts are funded under Darpa’s EPIC program.

Our collaborative effort for this cross-discipline proposal focuses on designing computer architectures and programming for the novel broadcast interconnect enabled by the on-chip optical technology, researching the degree to which an efficient broadcast mechanism eases multicore programming, research on partitioning of function between the optical and electrical-digital domains, and design of interfaces and models for the optical components to facilitate their incorporation in computer systems.

ATAC creates a fundamental shift in multicore computing that utilizes an electronic mesh for short range intercore communications and a broadcast optical network optimized for global communication. Our early results indicate that the ATAC approach will simplify multicore programming significantly, it is scalable to 4000 cores/chip, and that it can also ease the off-chip memory bandwidth bottleneck by extending the optical network offchip.

Overview of the ATAC ApproachAs displayed in Figure 2, the proposed ATAC microprocessor is constructed in a 2-D tile layout of

computing cores, each containing data and instruction caches, communication hardware, and computational resources. The cores are interconnected via an electrial mesh network for near neighbor communication, as well as an optical broadcast network for global communication. The optical broadcast network can be thought of as a global bus whereby every core can communicate with every other core in a few cycles. However, unlike a standard electrical bus, the optical broadcast network is a global communication channel that is scalable to thousands of cores. Indeed, the ATAC architecture is being designed to scale to four thousand cores or more.

At a high level, ATAC overlays a standard a multicore processor with on-chip 2-D mesh network (e.g., Raw, Trips) with an optical broadcast network. Some applications require significant near-neighbor communication due to spacial locality inherent in the algorithm. However, many algorithms, such as search, are more easily coded if they can use global communication to broadcast the current best value. SMP architectures also benefit from an efficient broadcast because they can invalidate copies of data that are cached in multiple caches quickly. The ATAC approach is to blend the best of both electrical mesh and optical broadcast networks in a way that yields good performance and programming ease.

Fast global communication will become increasingly important as multicores scale to hundreds or thousands of cores. It is estimated that it will take a future multicore with 1,000 cores at least 100 cycles to perform a global broadcast operation. However, ATAC will be able to perform such an operation on a 1000-core chip in 10 cycles or less. Given that many applications rely heavily on global communication, such benefits will allow for performance improvements of over 40x for some applications, as seen in preliminary performance results.

Perhaps more importantly, fast global communication will become essential to enable programmers to manage the arduous task of programming hundreds or thousands of cores. Programmers have typically used global communications operations sparingly, as such operations often impose a significant performance penalty. Accordingly, parallel programmers often decompose otherwise straighforward algorithms into much more complicated forms to minimize the amount of global communication necessary to implement the algorithm. Furthermore, programmers have to account for potentially widely varying

Electrical Mesh Network Optical Broadcast Network

Figure 2: High-level view of ATAC composition consisting of a 2-D array of tiles interconnected by both an electrical mesh network and an optical broadcast network. A conceptual view of the broadcast network is shown here. The practical implementation using a set of rings is described later.

communication latencies, depending on the distance between the sending and receiving cores. Not surprisingly, this all means that getting good performance on standard multicore systems can be incredibly challenging, and can take a long time. On ATAC, programmers will be able to broadcast values at will, or use the underlying electrical mesh, without worrying about the typical performance impact of imprudent use of global communication operations.

The ATAC broadcast and select network optimizes power efficiency by sending the data to multiple places with little extra power consumption (the source modulator is the primary source of power consumption and is independent of the number of receivers to first order). The use of wavelength division multiplexing minimizes interconnect contention. ATAC is scalable to 4000 cores/chip and also to multi-chip and multi-box architectures. The ATAC solution is also monolithically integrated into standard microprocessor chip fabrication processes using CMOS thereby improving its performance/cost benefit.

Related ResearchParallel programming is a challenging problem [41]. In order for existing sequential codes to take

advantage of multicores, programming tools and novel architectures are needed to ease the transition from sequential codes to multicore codes [42]. Much academic and industry research has gone into attempting to ease the effort required for executing codes on parallel computers such as parallel programming APIs, domain specific languages, automatic parallelizing compilers, parallel performance tools, and incremental architectural enhancements to ease programming but with little impact. We believe bold architectural approaches are needed to address this pressing issue.

There are many extensions to sequential programming languages to provide parallel capabilities. Examples include threads [43], OpenMP [44], and MapReduce [45] MPI [46]. These methods work well for some set of applications, but none has shown itself capable of tackling all forms of parallelism. Many of the methods such as threads do not scale beyond a few cores. Also, some researchers believe that these extensions allow the programmer to easily introduce program errors; Lee [47] shares this view. MPI is difficult to program, and it squanders the advantage of multicore with its high operation overhead. All-to-All broadcast using optical interconnect addresses both the scalability and programmability issue.

Domain specific languages also attempt to address multicore programmability. StreamIt [48] and Brook [49] are programming languages primarily focused on signal processing and stream processing.

Parallelism can also be extracted from sequential codes. This allows the programmer to not modify their sources while still realizing performance improvements on parallel machines. Typically these are modest gains and unlikely to scale to more than 10 or 20 cores. Examples of distributed ILP (DILP) compiler efforts include Mahlke’s work [50], GREMIO [51], and our own effort on RawCC [9].

In order to ease multicore programming, we have seen the advent hardware being added to provide easier programming models. An example of this additional hardware is transactional memory systems [52, 53]. Transactional memory systems allow multiple threads to access shared memory inside of a transaction. If multiple threads access the same piece of data, then the system rolls back the transactions such that only one modifies the shared data at a time. It is conjectured that this is easier to program and less error prone than threaded programs with locks, but needs to be investigated further.

Different architectures attempt to solve the problem of organizing and connecting parallel resources on a single chip. One manner to do this is via processors designed to exploit instruction level parallelism. The Itanium processor [54] and Multiflow work [55] are examples of this. Streaming processors have attempted to solve this problem by optimizing for applications with little temporal locality. The prototypical stream processor is the Imagine processor [56]. Another organization of parallel resources is to build a SMP on a chip. The Piranha project [57] is an example of a SMP on a chip for commercial work-loads. Finally there are processors which support multiple types of parallelism, examples being Trips [58] and Raw processors which can support stream processing, thread level parallelism, and ILP.

Although other methods of using photonics for on-chip interconnect are being developed, their limited gains versus electrical interconnect rarely justify their added cost and complexity. Free space optics has attracted significant interest because it offers flexibility in terms of hybrid components. The downside lies in its limited CMOS compatibility and fabrication of reliable optical components. Another approach replaces the electrical bus with an optical bus. Unfortunately, contention still limits the optical bus. We believe that our approach using WDM, CMOS process compatibility, and broadcast uniquely leverages the strengths of photonics for the specific goal of enhancing multicore programmability in an area where electrical interconnect falls short, thereby justifying its use.

Research QuestionsThe aim of the ATAC project is to create a multicore computer system that can scale to thousands of

cores, both in terms of performance and ease of programming. ATAC attempts to achieve this goal by integrating an optical broadcast network into a tiled multicore processor with an electrical mesh network. ATAC will also provide the programmer with high-level APIs to efficiently and effectively exploit ATAC’s hardware resources. The research questions for the ATAC project center around how to best achieve and balance the two goals of performance and programmability. More specifically, our research will attempt to answer the following questions:

What is the best way to interface an optical broadcast network in a basic tiled multicore processor? What is the right balance in the power budget between energy used in the electrical mesh network

and energy used in the optical broadcast network? Furthermore, what is the right balance between the power consumption of such communication networks, the computational part of each core, and the on-chip and off-chip memory resources?

What is the best API to provide programmers to take advantage of both the optical broadcast network and the electrical mesh? Should the API’s allow the programmer to observe or control the spatial location where a particular piece of computation is run, or should the API handle this behind the scenes? Are there high-level language constructs that would help programmer productivity?

What is the best way to measure ease of programming and programmer productivity? A comparison of performance and programming effort for the baseline tiled multicore architecture,

versus the architecture with the optical network. What is the degree to which the broadcast network can make programming easier for a given level

of performance? Or conversely, what is the degree to which performance can be increased with a broadcast network for a given level of programming effort? This result will provide the justification to take the next step of actually building a physical prototype of the ATAC processor.

We will also study the extent to which a pipelined broadcast can be implemented in software in a traditional electrical network, and assess the performance achievable.

Which application classes will best take advantage of ATAC’s broadcast network? What is the best way to simulate 1000 cores at reasonable speeds and with sufficient flexibility?

The ATAC ApproachThe ATAC approach incorporates an optical broadcast network into a tiled multicore processor

architecture. The optical network is enabled by recent advances in electronic-photonic integration using standard CMOS. ATAC also seamlessly integrates an electrical mesh interconnect for high-bandwidth near-neighbor communication. Programmers will interact with ATAC via high-level APIs that leverage the system’s underlying resources. The following sections discuss details of the ATAC architecture, optical network, and software infrastructure.

The Basic ATAC Architecture

Figure 3: ATAC architecture for a 4096-core microprocessor chip.

The ATAC processor architecture is a tiled multicore architecture combining the best of current scalable electrical interconnects with cutting-edge on-chip optical communication networks. The tiled layout uses a 2-D array of simple processing cores, each containing a multiple-instruction-issue in-order RISC pipeline with an FPU and local memories. Each tile contains an L1 cache and a portion of the distributed L2 and L3 caches. The L1 and L2 caches are SRAMs while the L3 cache is embedded DRAM. Chip resources are divided approximately evenly between computation, communication, and memory (one-third to each) which has been shown to be near-optimal for multicore processors [32].

One of the important and appealing aspects of our design is that we use a modest clock speed of 1GHz for the processors, electrical network, and the optical network. Although optical networks can be clocked at much higher speeds, the power consumption at the endpoint transducers and interface circuitry can be prohibitive. Since our optical network is being used only for broadcast it has modest bandwidth requirements and we do not need to resort to clock speeds that are much higher than the base processor frequencies.

Each of the cores is connected to its four nearest neighbors using point-to-point electrical connections. The sum of all the electrical links is a complete mesh network (the “ENET” indicated in Figure 3) capable of transferring data between any two cores using multiple hops. On top of this state-of-the-art electrical substrate, ATAC adds an integrated photonic communications network to improve the performance and efficiency of operations that are costly when using the electrical mesh. These operations include broadcast/multicast communication, as well as point-to-point communication between cores that are physically far apart.

The heart of the on-chip optical interconnect is the all-to-all network (or “ANET”). The ANET provides a low-latency, contention-free connection between a set of optical endpoints, as depicted for 64 tiles in the center part of Figure 3. This highly interconnected topography is achieved using a set of optical waveguides that visit every endpoint and loop around on themselves to form continuous rings. Further, as illustrated in the right side of Figure 3, each sending endpoint can place data onto a waveguide using an optical modulator (shown as a yellow circle on each of the waveguides) and receive data from the other endpoints using optical filters and photodetectors (shown as a red circle). Because the data waveguide forms a loop, a signal sent from any endpoint will quickly reach all of the other endpoints. Thus, every transmission on the ANET has the potential to be a fast, efficient broadcast. To avoid having all of these broadcasts interfere with each other, the ANET uses wavelength division multiplexing (WDM). The processor cores in the ATAC architecture have a 32-bit word size making it desirable for them to be able to send a 32-bit word on each clock cycle. This is accomplished using a set of parallel waveguides where each waveguide carries one bit. In a baseline ATAC processor, there would be 32 waveguides, each transmitting data at the same frequency as the processor core. If chip real-estate for optical components is limited (as might be the case when scaling up to thousands of cores) serialization can be used to decrease the number of waveguides. In other words, we can reduce the number of waveguides to 16, 8 or even 4, and serialize a 32-bit word using multiple sub-word transfers.

In addition to the primary data waveguides, there are several other special-purpose waveguides. First, there is an optical “power supply” waveguide that provides the light source for the modulators. Second, there is a clock waveguide which sources use to send the clock along with the data. Third, there is a backwards flow-control waveguide that is used to throttle back a sender when a receiver is overwhelmed. Finally, we are exploring the use of several “metadata” waveguides that are used to indicate a message type (e.g., cache read, cache write, barrier, ping and raw data) or a message tag (for disambiguating multiple message streams from the same sender).

In the WDM design, all the modulators on a given sender are tuned to transmit at a unique wavelength. To receive data from any sender at any time, each receiving endpoint must contain sets of filters trimmed to the wavelengths of each of the other endpoints’ modulators. Each set of filters then feeds into a separate FIFO (First-In, First-Out buffer), allowing the data from each sender to be buffered separately. This saves the processor core from the extra step of examining each message to determine the sender and find the message it needs. Since every receiver is not necessarily interested in every message sent on the network, special-purpose hardware is used to pre-screen messages and forward only messages of interest to the FIFOs. This avoids the extra energy associated with buffering and handling unwanted data and allows the FIFOs to be kept smaller. It also frees the processing core from having to sort through messages using software. Messages can be screened based on sender (i.e., wavelength) or by the metadata transmitted with each message. This novel buffering and filtering scheme is an area of active research for this project.

The design of a single ANET optical link scales to approximately 64 endpoints. Therefore a 64-core chip (feasible using a 90nm or 65nm CMOS process) could simply place a single core at each ANET endpoint. Scaling beyond this point requires some number of cores to share a single optical endpoint. The set of cores sharing one endpoint is referred to as a “region.” For chip designs requiring only a small number of cores in each region, electrical circuits can be used to negotiate access and distribute incoming data, preserving the illusion that all cores are optically connected. As regions grow larger, it is preferable to use a ring of rings optical architecture as illustrated in the left hand side of Figure 3. In this design, we use an optical network within each region and create a two-level hierarchical design. In this design (shown in Figure 3), there is a top-level ANET (ANET1) that connects together multiple regional ANETs (ANET0). On the ANET0, there is a single core in each region. The cores connected together by each ANET0 form a region on the ANET1. Our analysis indicates that this two-level design will scale to 4096 nodes at the 11nm CMOS technology point.

As described in more detail in a later section, the ANET0 and ANET1 networks are connected by an interface tile using conventional electronics. Our design allows for a pair of values to be transmitted simultaneously between the two levels. Thus, at any given instant two broadcasts can be happening simultaneously over all 4096 cores. However, 64 simultaneous broadcasts can be happening within each 64-core regional network or ANET0.

Seamless connections to external DRAM or I/O devices are made by replacing some processing cores with memory or I/O controllers. Processing cores access off-chip resources by sending messages to these gateways. Memory cores receive requests on the ATAC network, interpret the requests electrically, and then send messages to DRAM using a separate waveguide that goes off-chip. We clock this waveguide at 2GHz. Data is transmitted on this waveguide using 64 different wavelengths to send 64 bits at a time. Replacing 4 cores in each 64-core region with memory controllers yields a memory-bandwidth-to-computation ratio of 1 byte/FLOP (assuming a 2GHz clock for the waveguides going off-chip). A 4096-core chip would require a reasonable 256 memory connections (supplying 4 TB/s bandwidth) to achieve the same ratio.

However, because the massive on-chip bandwidth of the combined electrical and optical networks encourages communication-centric rather than memory-centric computing, the traditional rule-of-thumb ratio of 1 byte/FLOP is excessive. Communication-centric computing allows processes to exchange values directly rather than storing them in memory as an intermediary. In addition, ATAC’s efficient broadcasts act as DRAM bandwidth multipliers, allowing each value fetched from DRAM to be received by multiple cores. Together, these effects can lower memory bandwidth requirements significantly.

The area required to implement the computational portion (included L1 and L2 caches) of 4096 cores is approximately 400 mm2. Using an additional 200 mm2 to implement an L3 cache using embedded DRAM will allow for over 8 GB of on-chip memory. Because communication-centric computing reduces the pressure on all levels of the memory hierarchy, this amount should be sufficient for most application domains. If additional on-chip capacity is desired, 3D integration of chips can be used to stack regular DRAM above each tile, allowing even more total “on-chip” memory.

The ANET Optical Network ArchitectureA broadcast and select approach enables ATAC communications in a simple optical network. One of

the principle philosophies we have taken is to use electronics where electronics is best and optics where optics is best. All switching and routing of data is performed in the electrical domain where switching circuitry is readily implemented. Doing so eliminates the significant limitations associated with tuning in the optical domain. Further, we show that despite the broadcast nature of this network, it is highly efficient. This is because most of the power is consumed in driving the optical modulators, not in the optical power required for transmission. We refer to this broadcast and select optical network as the All-to-All Network or ANET.

The operation of an ANET region is as follows. Each ANET region contains 64 cores. Each core is assigned a wavelength channel to transmit on and enough receive elements to read all of the data being transmitted from every other core within an ANET region. Data is transmitted at the chip clock frequency, clock = 1 GHz, so as to avoid costly serialization and deserialization steps. In order to transmit a 4-byte word within a single clock cycle, 32 ANET communication lines are required. An additional 8 lines are added for address information and parity checks bringing the total number of ANET lines within an ANET region to 40. The number of connect cores is N=64 in a standard ANET region.

A more detailed design of the optical network is shown in Figure 4. Figure 4 also summarizes our estimates of optical power consumed by the network. For an ANET with 64 cores and clock = 1 GHz, the transmit bandwidth is TBW = 2.6 Tb/s and the receive bandwidth RBW =161 Tb/s. This simply indicates that because of the broadcast capability, there is a 64X multiplier on the send bandwidth. Our analysis shows that the latency of the ANET network is approximately 3ns. Approximately 0.5ns is due to optical delay and the rest is electrical. The total off-chip power consumption, as shown in Figure 4 is 2.6W. Comparisons of ANET performance with that of an electrical mesh are given in Table 1 (3 pages ahead).

Figure 4: ANET network with memory interface and layout detail for optical power coupling to chip and distribution within the ATAC network. TBW = MxNxclock = 2.5Tb/s; RBW = MxN(N-1)xclock = 161Tb/s. The power consumption of a 64-core photonic ANET is 2.6W with only 0.6W consumed on-chip.

Scaling to 4096 cores The All-to-All photonic network can be scaled to at least 4096 cores with a hierarchical structure. At 4096 cores, there will be 32-ANET0 networks and a single ANET1 network. The regional ANET0 networks connect 128 cores using 64 optical nodes (i.e. two cores share a node). The ANET0 and ANET1 connections are made via an optical-to-electrical-to-optical (OEO) on an ANET0-ANET1 interface tile. This interface tile has the function of routing the data. The aggregate performance of the photonic network is TBW =80Tb/s (transmit bandwidth) and RBW = 5000 Tb/s (i.e. 5 Pb/s) (receive bandwidth). Off-chip memory and I/O connections will be made optically as well. The power consumption of the on-chip photonic network is approximately 26W. Off-chip, memory and chip-to-chip communications are expected to add only 10-20% to this power budget due to substantially lower off-chip bandwidth requirements and to the point-to-point nature of memory connections. So, power consumption is not a major concern. The area constraints of a 2cm x 2cm chip are a bit more stringent. In order to fit the required 5.4 million devices, each device can take up no more than 75m2.

Interfaces between Digital and Optical ComponentsA key area of innovation in this project is the interface between traditional digital logic and the novel

integrated photonic devices. This includes both the low-level details of how a digital signal is translated into an optical signal as well as the higher-level question of how the massive quantity of data received from the optical network should be screened, sorted, and buffered before it is consumed by the processor.

Figure 5 shows the path a bit takes as it is transmitted through the optical network. First, a signal stored in a flip-flop is used to activate a modulator driver (an analog electrical circuit) which, in turn, controls the optical filter/modulator component. The filter/modulator couples light of a specific wavelength

ANET Resources Wavelengths = 64, Data Waveguides = 40 # Modulators NM=2560, # Receivers NR=161,280 Clock Frequency fc = 1GHzANET Power Budget (based on 11nm node) Optical Losses (L = 7dB)

Backplane Coupling: 1dB Waveguide Propagation Losses: 2dB

Regional ANET: 1.5-6cm 0.33-to-1.3dB/cm 1-Global ANET: 10-20cm 0.1-to-0.2dB/cm

Modulator Drop Loss: 1dB Filter Drop Loss: 1dB Filter Set Thru Loss: 1dB Power Supply Splitter Losses: 1dB

Off-Chip Optical Power (POptical = 0.2W) POptical = 10(L/10) x Qg x fc x NR/Rd = 0.2W Charge to Flip a Gate: Qg = 0.25fC Responsivity of Detector Rd = 1.1A/W

On-Chip Electrical Power (POn-chip= 0.6W) Modulator Power: PM = PMod-driv+PMod=234W Receiver Power: PR = 0.14W PElectrical = NM x PM + NR x PR = 0.6W

Optical Power Supply (POptical-Supply= 2W) Laser: PL = 1.5 W 0.2W Optical (in fiber)

(JDSU DFB data sheet CQF 935.708-19050) TEC: PTEC ~ 0.5W (est., but depends on Tcase) Total Optical Power Supply = PL+ PTEC= 2W

Total ‘wallplug’ ANET Power ( PANET = 2.6W) PANET = Pon-chip+Poff-chip = 0.6W+2W = 2.6W

from the wideband source waveguide, modulates it, and transfers it to the data waveguide. When the light passes a receiver filter, part of it is drawn off and fed into a photodetector. The photodetector outputs a small electrical signal which is then converted back to a digital bit.

Properly designing the optical components, modulator driver and circuits to receive the signal from the photodetector requires a close collaboration between the architecture and photonics groups. Since these optical components are used only to transmit digital signals, their functional requirements may be significantly different from those needed to transmit analog signals. In addition, the physical characteristics of the optical devices (e.g., size, capacitance, and manufacturing variation) greatly influence the design of the electrical interface circuits.

Our proposed design uses a novel technique to transition from an optical signal to a digital electrical signal. Photons that are extracted from a data waveguide by a receiver filter are routed to a photodetector whose output is directly connected to a digital logic gate. The key to making this work is ensuring that the photodetector output is sufficient to charge the input capacitance of a digital gate to the required switching voltage. Traditionally a transimpedance amplifier (TIA) is used as the

interface to convert the tiny photodetector output current to digital voltage levels. However, TIAs are sensitive analog circuits that are difficult to design and consume large amounts of die area (500 μm2) and power (1mW) per receiver. In addition, placing these sensitive analog circuits and noisy digital circuits in close proximity decreases reliability. By using high-efficiency filters and small-footprint, waveguide-integrated photodetectors and carefully managing optical power levels, TIAs can be omitted in future process generations. As digital circuits shrink, the input capacitances decrease and this technique becomes feasible. Our research indicates that TIA-free designs will be practical for chips based on 22 nm and smaller process technologies.

Programming and Compilation for ATACGiven the multicore trend, all computer systems will be parallel systems; all programs will be parallel

programs. Yet, it is still unclear how programmers will make the shift from sequential programming to efficient parallel programming. Accordingly, a key goal of the ATAC project is to enable programmers to easily and efficiently harness the system’s computational power by establishing a clear, high-level programming interface that makes use of the broadcast capability in the short term, and a language based on broadcast in the longer term.

Our programming and compilation effort will have two facets. The first will develop an API that exposes broadcast related primitives directly to the user. We will use the experience with this API to design a language (such as Thinking Machines’ C*) where the basic broadcast primitive will drive the language design. We have experience with designing APIs and languages for novel architectures. For example, in the Raw effort, we designed an API for streaming called libStream followed by the development of a language Streamit [34].

The ATAC programming model and APIs will allow users to easily write programs employing global operations such as broadcast, scatter/gather, and reduction via the optical broadcast network. Likewise, programmers will be able to use the electrical mesh network through the high-level programming interface as well. In fact, the interface will be able to transparently choose which network is used depending on the type of communication (e.g., single point-to-point message vs. chip-wide broadcast operation).

The second will provide ways of implementing existing architectural models that are known to be reasonably easy to program, but difficult to implement, such as large-scale coherent snooping caches along with a shared memory programming model, or message passing interface MPI along with multicast.

We will also experiment with hybrid approaches in which the programming model will support both shared memory and message passing modes, both build upon the underlying on-chip networks and in-

sending core receiving core

flip-flop flip-flop

filter

photodetector

filter/modulator

modulatordriver

data waveguide

transimpedenceamplifier

wideband source waveguide

sending core receiving core

flip-flopflip-flop flip-flopflip-flop

filterfilter

photodetector

filter/modulator

modulatordriver

data waveguide

transimpedenceamplifier

wideband source waveguide

Figure 5: Path of a single bit from the digital domain, through the optical network and back into digital form. Our design does not require the traditional TIA on the receive side.

core caches. The shared memory model will build upon novel cache coherence algorithms that use the broadcast network to efficiently communicate shared cache state. The message passing model will build upon existing message passing models such as MCAPI and MPI, with an emphasis on ease of programming with the addition of a set of broadcast primitives.

Measuring Programming EaseGiven that programming efficiency is such a key driver of the ATAC project, developing appropriate

metrics for programming efficiency is important. As such, this project will develop metrics that assess “programming efficiency” and “ease of programming” by taking into account the difficulty of implementing a variety of algorithms with the ATAC programming model as well as baseline programming models such as a standard tiled multicore processor (e.g., Raw) or an MPI-based parallel system.

There are several possible ways of measuring the difficulty of implementing algorithms. One measure that has been used in the past is “lines of code”. Our implementations of applications in both the ATAC style and existing styles will directly yield the lines of code metric.

However, studies have shown [40] that Lines of Code is not a particularly good measure of programming ease. This is due to the fact that not all lines of code require the same amount of effort to write. For example, code that involves communication or synchronization between different processors is generally more difficult to write than a simple computational loop. “Lines of code” also does not capture the effort that the programmer had to expend deciding how to partition and spatially distribute an application. To measure the true benefit to ATAC programmers, more sophisticated measures will need to be created. These measures include a comparison of the actual time spent developing a given application in both programming models.

Our team has previous experience with experiments designed to measure programmer productivity. A study was conducted using the StreamIt programming language [33] where different programmers were given a variety of application development and debugging problems and the time it took them to reach a solution was measured. Similar studies might be a good way to more accurately measure the programming advantages of an ATAC system. One problem with this approach is this. Since programmers are given a problem and are asked to program it in both programming models, the second model tends to appear easier since the programmer is familiar with the problem the second time around.

A particularly appealing approach with normalizes out the “learning” bias of programmers is to divide the programmer group into two subgroups. The two subgroups are asked to implement the application in the two programming models in opposite orders, which tends to even out the learning bias. Thus, the programming time measured in this manner tends to be a good measure of programming ease.

Proof of Concept, Early Results and Current StatusTo assess the performance characteristics of an ATAC multicore processor, we compare its

performance and programmability to a leading-edge processor based on an electrical mesh network design similar to the MIT Raw processor. In general, the performance of the electrical processor is expected to be only slightly lower than the ATAC system if programming effort is not an issue. However, for applications and programming models that benefit from a fast broadcast capability, ATAC will yield a performance benefit. We estimate performance using an analytical model and also measure programming ease using lines of code.

64-core System 4096-core SystemMesh ATAC Mesh ATAC

Theoretical Peak Performance 64 GFLOPS 64 GFLOPS 4 TFLOPS 4 TFLOPSActual Performance 7.3 GFLOPS 38 GFLOPS 0.22 TFLOPS 2.3 TFLOPSChip Power 24 W 22.7 W 140 W 155 WTotal System Power (CPU + DRAM + Optical Supply) 40 W 28 W 225 W 232 W

Total System Actual Power 0.2 1.4 1.0 9.9

Efficiency GFLOPS/W GFLOPS/W GFLOPS/W GFLOPS/WTable 1: Comparison of performance, power, and efficiency of ATAC and electrical-mesh processors.

Results are presented for 64-core and 4096-core chips.

The ATAC processor is essentially the same as the baseline processor with the addition of an optical network to optimize global communications. Both processors have the same number of cores running at the same frequency and therefore have the same theoretical peak performance. However, the theoretical peak is unachievable on most applications, particularly for those applications that require large amounts of communication or coordination between the cores. A better way to compare performance is to measure useful operations performed while executing an actual application. Dividing the number of operations performed by the total time required to complete the application yields the effective performance. The effective performance numbers shown in Table 1 are based on a study of an abstracted coherent shared memory application (described in more detail in Applications Performance section below).

The actual performance of the ATAC processor is better than the baseline processor due to the increased efficiency of the global operations necessary to implement coherent shared memory. The processing cores in ATAC spend less time waiting for global communication operations to complete and therefore spend a greater fraction of their time doing useful work. Note that the difference between ATAC and the baseline is greater with a larger number of cores due to the distance-based communication costs on the electrical mesh. Even though the peak power consumption of the ATAC processor can be somewhat higher (due to the addition of the optical network), the actual power efficiency of an ATAC system is substantially better. This is due to two factors: 1) less energy is wasted waiting for global communication operations and 2) the availability of the broadcast allows a value fetched from DRAM to be sent to all the cores as opposed to having all the cores access DRAM for that value. This greatly reduces the number of DRAM accesses.

Applications Performance ATAC particularly helps those applications that have lots of global communication or irregular communication patterns. The reason is that electrical networks scale poorly in terms of communication bandwidth and coordination predictability.

ATAC performs well on global operations because of its highly efficient broadcast operations, which are an order of magnitude faster than a mesh-based multicore. Mesh-based multicores with 64 or more cores do not perform global operations efficiently because they have large, non-uniform core-to-core communication latencies. Furthermore, mesh based multicores exhibit a lot of contention during broadcast operations. ATAC, on the other hand, does not have either of these problems. Broadcast communication latency is not distance-dependent, but rather about 3ns for all communication, regardless of the endpoints. Its network’s WDM-nature enables contention-free broadcasts within a region, enabling many simultaneous broadcast operations. Furthermore, unlike typical mesh multicores, processor cores

can consume messages immediately when they arrive; ATAC’s novel WDM and buffer scheme obviates message sorting and reassembly by the receiving core.

These features of ATAC translate into significant performance improvements. Eight applications were analyzed by estimating their running time on both a mesh-based multicore and ATAC for both 64 cores and 4096 cores. As seen in Figure 6, application performance improves by up to a factor of 80.

Application speedups are calculated using analytical models based on application characteristics. These models include estimates of time spent in three categories: normal computation, core-to-core communication, and memory accesses. For all applications except Snooping Shared Memory, the models calculate the exact number of

operations in each category required to execute the application on a specific problem size. The estimated

Application StudiesATAC Speedups vs. Electrical Mesh

0.1

1

10

100

vectoradd

vectorscaling

jacobi n-body snoopingsharedmemory

matrixmultiply

spee

dup

64 cores

4096 cores

Figure 6: Comparison of performance between ATAC and electrical mesh for selected applications.

costs of each operation are then summed to calculate the total run time. Care was taken to accurately model the operation of the communications networks, including any end-point contention. A range of problem sizes were examined and representative examples were chosen for the results shown.

The Snooping Shared Memory application represents an abstract application that performs computation and makes memory accesses that sometimes result in cache coherence operations. The model includes probabilities that each instruction of an application will induce different types of coherence traffic. Modeled operations include reads and writes to a local region of cores as well as the entire chip. The time required to resolve these operations is added to the time required to execute the instructions.

Ease of Programming The primary goal of the ATAC project is to develop ways in which high performance can be achieved on ATAC with only modest programmer effort.

“Lines of code” is one possible measure of programmer effort that has been used in the past, and we present some early results based on that metric. Although we realize that it is not the best measure of programming ease, it was somewhat easier to measure for our initial results than the more sophisticated methods mentioned in our proposed research such as programming time. Lines of Code measures the amount of code that a programmer needs to write to implement a particular application. Figure 7 compares ATAC with a mesh-based multicore architecture for four applications, vector addition, jacobi relaxation, leader election, and barrier synchronization. Even on these relatively simple applications, ATAC’s code size relative to the same programs written for a mesh is smaller.

Proposed Research and Experimental ATAC SystemThe ATAC project proposes to build a prototype computer system, including a detailed system

simulator, a compiler system, a runtime system, a programming model and associated APIs. These efforts will be driven by models of the optical components that we will develop.

ATAC Architecture The ATAC chip architecture will be developed. This will include the ATAC network hierarchy, the mesh network, and the processor-to-network interface, including support for broadcast, external memory and I/O interfaces. This effort will also define the coding and clocking transmission scheme for the optical network. We will validate our assumptions that we can build message flow control and receiver-side filtering and buffering for the all-to-all broadcast network with reasonable complexity, area, speed, and energy. Selected portions of the processor-to-network interface will be implemented in Verilog, synthesized and prototyped in FPGAs to verify their feasibility.

Optical Interconnect Interfaces and Component Models As research progresses on the novel integrated photonic components, we will be developing and refining models of these components. This includes characteristics such as switching speed, propagation delay, propagation loss, insertion loss, energy consumption, and physical size. Based on these models, we will be designing the interfaces between the digital electronics of the core processors and the optical components of the broadcast interconnect. This will include some analog electrical circuitry as well as additional digital logic to pre-filter, sort, and buffer data moving between the processor pipeline and the optical network. As we gain additional information about the optical components and refine the architectural design, we will update the simulator’s functional and performance models to reflect these changes. Using these models, we will be able to accurately estimate the bit-error-rate, latency, on and off chip range, and footprint of the optical network and the performance, energy consumption, and physical size of an ATAC processor.

Pin-Based ATAC Simulator We will use simulation to evaluate and to refine the ATAC architecture and its broadcast network using parallel applications. In collaboration with Robert Cohn’s Pin group of Intel, we are using the Intel Pin dynamic binary instrumentation infrastructure to develop a massively parallel

Lines of Code Comparison

0102030405060708090

vector add jacobirelaxation

leaderelection

barrier

lines

of c

ode

ATAC

Mesh

Figure 7: Lines of code required to implement four algorithms on ATAC vs. a mesh-based multicore.

simulator for fast simulation of generic multicore systems with 1000s of cores. The Pin-based simulator can be used to develop multicore applications, compilers, and operating systems and for rapidly prototyping and evaluating multicore architectural mechanisms such as the ATAC broadcast network. The ATAC simulator will incorporate mechanisms for energy modeling at both the chip level and the system level. We have already successfully created an early version of a multicore pin simulator.

Pin allows one to insert extra code at specific points in the program at run time; the specific points and the code to be inserted at each point can be specified in a separate executable called a “Pintool.” Additionally, Pin allows function calls in the application to be replaced by calls to functions defined within the Pintool. The inserted/replaced code can be used to simulate new features: it can modify processor state, change program behavior or use a performance model to adjust a simulated clock. The simulator uses these features of Pin to implement architectural mechanisms and to model performance.

The Pin-based simulator has been designed to take advantage of the parallelism of host architectures such as multicores or clusters. It models each core within the simulated system as a separate kernel thread, independently schedulable by the OS (Figure 8). The OS maps the threads to the hardware, enabling the simulator to exploit the available parallelism. Cores (threads) communicate using calls to a simple API which represents the intrinsic capabilities of the simulated architecture. e.g., broadcast, point-to-point message-passing, etc. The simulator replaces API calls within the application with calls to functions defined within the Pintool that implement the corresponding functionality and update the simulation clock of the appropriate cores using a model of the communication cost. The implementation of the API functions within the simulator depends on what communication mechanisms are available on the host architecture. For example, the implementation of inter-core communication may use buffers in shared memory for threads on a single machine, or use sockets over Ethernet for threads running on different machines in a cluster.

We have chosen to base our multicore simulator on Pin because it offers several advantages over creating our own

simulator from scratch as in the Raw project. First, the Pin infrastructure is reliable: it is mature, robust, and well-supported. Second, it is high performance: it natively executes application code on the host hardware rather than interpreting it. Third, using Pin shortens our simulator toolchain development time: it allows us to use existing tools for compiling multicore applications (gcc, binutils, etc.) instead of having to develop them ourselves. Its major drawback is that it only allows us to model the x86 processor. We believe this is not a significant issue because in future massive multicores the specifics of the core and ISA become secondary issues to global communications, and memory and I/O systems.

ATAC API/Language Constructs As the ATAC project heavily emphasizes the importance of ease of programming, a significant part of the project will involve the development of programming APIs and high-level language constructs. The goal of the API and language construct development is to enable programmers to quickly implement reasonably complex parallel algorithms on ATAC using straightforward implementations and relatively minimal effort, while still achieving excellent performance. This goal will be achieved in part by having the API and language constructs implement all of the “heavy lifting”. For example, it will handle message management, shared memory coherence protocols, and automatically choosing between the optical broadcast network or the electrical mesh network depending on traffic patterns and current network congestion.

The ATAC Compiler The ATAC compiler will compile programs written using the aforementioned high-level language constructs into assembly code suitable for the ATAC hardware. It will generate the low-level code needed to send and receive on the optical broadcast network and the electrical mesh network. It will also incorporate profile information gathered through an application profiling mechanism into the compilation process.

Figure 8: Mapping of simulated cores to threads and physical cores.

Applications The ATAC project will include a significant application development effort to help develop and test our new API. The applications to be implemented will include standard benchmark suites (e.g., SPEC, SPLASH [21]), stream-based multimedia codes (e.g., MediaBench II, video encode/decode), and scientific codes (e.g., FFT, N-body simulation). These applications will also be used to assess programmer productivity and architectural performance.

Programmer Productivity We will assess the programmer productivity benefits of ATAC by comparing it to traditional mesh-style multicores. The ATAC ease of use study will use the following three metrics to quantify programming effort:

1. Lines of code, and lines of communication code.2. Programming time through a user study. Users will code up several simple benchmarks in C for a

traditional mesh architecture, and in the ATAC API for the ATAC architecture. Time to first result and time to achieve a given level of performance will be measured.

3. Programming gap. This metric will compare the variance in performance between a quick implementation and an optimized implementation of a benchmark. Easy-to-program architectures will have a smaller variance between the two.

Broader Impact of Proposed ResearchThe ATAC project seeks to improve the programmability of large-scale multicore processors through

a combination of architectural features and programming language development. This work has the potential to change the way multicore processors are designed and manufactured by demonstrating the value of integrated on-chip photonic devices. This will spur additional research in the area of on-chip photonics and speed the adoption of this innovative technology.

By making high-performance multicore programming easier, it also has the potential to solve an important challenge facing the computer industry today. Although processor manufacturers have realized that their future products must be multicore, no one has practical solution that allows the average programmer to harness the power of these processors. Without further advances in this area, only specially-trained expert programmers will be able write high-performance software. Thus, either application performance will cease to improve or software development costs will skyrocket as more highly-trained programmers are required. By simplifying programming, the ATAC architecture will allow the computer industry to continue to produce high-performance software using the existing base of programmers, even though the underlying hardware will be somewhat more sophisticated.

Besides the potential long-term benefits to the computer industry and thereby society at-large, this project will have a more immediate impact on the community of multicore researchers and educators. The results of our research will be published in main-stream journals and conferences, allowing acceptance or criticism from a large community of researchers and industry experts. In addition, we plan to make much of the infrastructure developed in our project publicly available for use by others. This includes our new API and languages as well as our multicore simulator in open-source form. Our simulator infrastructure will need to be flexible enough to model a variety of multicore architectures to allow comparisons between the ATAC design and other approaches. Therefore, it will be useful to many other multicore researchers who will be able to modify it for their own experiments. We hope that our simulator infrastructure will become the de-facto standard used by researchers across the world to create their own multicore simulators.

As with all of our previous projects, both graduate and undergraduate education will be an integral part of the ATAC project. Graduate students form the core of our research team, working closely with each other as well as faculty investigators, postdoctoral researchers, and industry experts. In addition to graduate students, we always include a number of undergraduates participating in MIT’s UROP (Undergraduate Research Opportunities) program. These graduates and undergraduates will be directly involved in the proposed work, learning the techniques and challenges of multicore system building and programming. In addition to directly involving students in research, this project will influence a larger group of students through graduate-level courses taught by the PIs at MIT. These courses typically include hot research topics and reflect the current research of the PIs, exposing student to cutting-edge ideas and tools, such as the massive multicore simulator. This project will thereby help train the next generation of multicore researchers, engineers, and programmers.

REFERENCES

1. Communications Technology Roadmap, The Microphotonics Center, Massachusetts Institute of Technology, 2005. http://mph-roadmap.mit.edu/

2. Anant Agarwal, “Raw Computation,'' Scientific American, vol. 281, no. 2, pp. 44-47, August 1999. http://cag.csail.mit.edu/raw/documents/RAW_Computation_SciAm.pdf

3. Anant Agarwal. “Limits on Interconnection Network Performance.'' IEEE Transactions on Parallel and Distributed Systems, October 1991.

4. Michael Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Jae-Wook Lee, Paul Johnson, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe and Anant Agarwal. “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs,'' IEEE Micro, April 2002. http://cag.csail.mit.edu/raw/documents/ieee-micro-2002.pdf

5. Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal, “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams,'' Proc. of the 31st International Symposium on Computer Architecture (ISCA’04), pp. 2-13, June 2004. http://cag.csail.mit.edu/raw/documents/raw_isca_2004.pdf

6. Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal, “Baring it all to software: Raw machines,” IEEE Computer, pp. 86-93, September 1997. http://cag.csail.mit.edu/raw/documents/Waingold-Computer-1997.pdf

7. Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Walter Lee, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Saman Amarasinghe, and Anant Agarwal, “A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network,” Proc. of the IEEE International Solid-State Circuits Conference, February 2003. http://cag.csail.mit.edu/raw/documents/isscc_2003_paper.pdf

8. Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, and Anant Agarwal, “Scalar Operand Networks: Design, Implementation, Analysis,” IEEE Transactions on Parallel and Distributed Systems (Special Issue on On-chip Networks), February 2005. Also available as MIT/CSAIL Technical Report MIT-CSAIL-TR-2004-038, June 2004. http://cag.csail.mit.edu/raw/documents/son-tm.pdf

9. Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe, “Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine,” Proc. of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), October 4-7, 1998. http://cag.csail.mit.edu/raw/documents/Lee-ASPLOS-1998.pdf

10. Theodoros Konstantakopoulos, Jonathan Eastep, James Psota, and Anant Agarwal, “Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures,” MIT CSAIL Technical Report, November 2007. http://cag.csail.mit.edu/raw/documents/Konstantakopoulos-Energy-2007.pdf

11. Rajeev Barua, Walter Lee, Saman Amarasinghe, Anant Agarwal, “Compiler Support for Scalable and Efficient Memory Systems,” IEEE Transactions on Computers (Special Issue on Advances in High Performance Memory Systems), vol. 50 no. 11, pp. 1234-47, Nov 2001. http://cag.csail.mit.edu/raw/documents/Barua-Computer01.pdf

12. Matthew Frank, C. Andras Moritz, Benjamin Greenwald, Saman Amarasinghe, and Anant Agarwal, “SUDS: Primitive Mechanisms for Memory Dependence Speculation,” MIT/LCS Technical Memo LCS-TM-591, January 6, 1999.

http://mph-roadmap.mit.edu/

13. Michael Gordon, William Thies, and Saman Amarasinghe, “Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs,” Proc. of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems, October, 2006. http://cag.lcs.mit.edu/commit/papers/06/gordon-asplos06.pdf

14. Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk L. Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Ken Mackenzie, and Donald Yeung, “The MIT Alewife Machine: Architecture and Performance,” Proc. of the 22nd International Symposium on Computer Architecture (ISCA’95), pp. 2-13, June 1995.

15. Anant Agarwal, John Kubiatowicz, David Kranz, Beng-Hong Lim, Donald Yeung, Godfrey D'Souza, and Mike Parkin, “Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors,” IEEE Micro, pages 48-61, June 1993.

16. V. Nguyen, T. Montalbo, C. Manolatou, A. Agarwal, C.-Y. Hong, J. Yasaitis, L. C. Kimerling, and J. Michel, "Silicon-based highly-efficient fiber-to-waveguide coupler for high index contrast systems," Applied Physics Letters, vol. 88, pp. 081112, 2006.

17. K. K. Lee, D. R. Lim, D. Pan, C. Hoepfner, W.-Y. Oh, K. Wada, L. C. Kimerling, K. P. Yap, and M. T. Doan, "Mode transformer for miniaturized optical circuits," Optics Letters, vol. 30, pp. 498, 2005.

18. S. Akiyama, M. A. Popovic, P. T. Rakich, K. Wada, J. Michel, H. A. Haus, E. P. Ippen, and L. C. Kimerling, "Air trench bends and splitters for dense optical integration in low index contrast," Journal of Lightwave Technology, vol. 23, pp. 2271, 2005.

19. Shoji Akiyama, Miloš Popović, Kazumi Wada, Jürgen Michel, Lionel C. Kimerling, and Hermann A. Haus, “Air Trench Waveguide Bends for Dense Integration for Low and Middle Index Contrast Integrated Optics”, Journal of Lightwave Technology 23, 2271 (2005).

20. M. R. Watts and H. A. Haus, “Integrated mode-evolution-based polarization rotators,'' Optics Letters, Vol. 30, No. 2, pp. 138-140 (2005).

21. M. R. Watts, H. A. Haus, and E. P. Ippen, "Integrated mode-evolution-based polarization splitters," Optics Letters, vol. 30, pp. 967-969, 2005.

22. L.C. Kimerling, “Silicon Microphotonics,” in Interconnect Technology and Design for Gigascale Integration, J. Davis and J.D. Miendl, Eds. (Kluwer Academic Publishers, Boston, 2003) p. 383.

23. L.C.Kimerling, L.Dal Negro, S.Saini, Y.Yi, D.Ahn, S.Akiyama, D.Cannon, J.Liu, J.G.Sandland, D.Sparacin, J.Michel, K.Wada, M.R.Watts, “Monolithic Silicon Microphotonics” in Silicon Photonics edited by L.Pavesi, D.J. Lockwood, Springer-Verlag, Berlin, 89-119, 2004.

24. S. Saini, J.Michel, LC Kimerling, “Index Contrast Scaling for Optical Amplifiers,” Journal of Lightwave Technology, 21 (10) 2368 (2003).

25. L. C. Kimerling, D. Ahn, A. B. Apsel, M. Beals, D. Carothers, Y.-K. Chen, T. Conway, D. M. Gill, M. Grove, C.-Y. Hong, M. Lipson, J. Liu, J. Michel, D. Pan, S. S. Patel, A. T. Pomerene, M. Rasras, D. K. Sparacin, K.-Y. Tu, A. E. White, and C. W. Wong, “Electronic-photonic integrated circuits on the CMOS platform”, Silicon Photonics, Joel A. Kubby and Graham T. Reed, eds. Proc. of SPIE Vol. 6125, 612502 (2006).

26. I. Dosunmu, D. D. Cannon, M. K. Emsley, L. C. Kimerling, and M. S. Unlu, "High-speed resonant cavity enhanced Ge photodetectors on reflecting Si substrates for 1550-nm operation," IEEE Photonics Technology Letters, vol. 17, pp. 175, 2005.

27. J. Liu, J. Michel, W. Giziewicz, D. Pan, K. Wada, D.D. Cannon, S. Jongthammanurak, D.T. Danielson, L.C. Kimerling, J. Chen, F.O. Ilday, F.X. Kaertner, J. Yasaitis, “High-performance, tensile-strained Ge p-i-n photodetectors on a Si Platform” Applied Physics Letters, 87 (10), p. 103501-1-3, 2005.

28. D.K. Sparacin, S.J. Spector, L.C. Kimerling, “Silicon Waveguide Sidewall Smoothing by Wet Chemical Oxidation” IEEE Journal of Lightwave Technology, 23 (8), p. 2455-2461 (2005).

29. Samerkhae Jongthammanurak, Jifeng Liu, Kazumi Wada, Douglas D. Cannon , David T. Danielson, Dong Pan, Lionel C. Kimerling, and Jurgen Michel, “Large Electro-optic Effect in Tensile Strained Ge-on-Si Films”, Applied Physics Letters 89(16) 161115 (2006).

30. Q. Xu, B. Schmidt, S. Pradhan, M. Lipson, “Micrometer-scale silicon electro-optic modulator”, Nature, 435 (7040), p. 325-327, 2005.

31. D.K. Sparacin, J.P. Lock, C. Hong, K.K. Gleason, L.C. Kimerling, J. Michel, “Trimming of Microring Resonators Using Photo-Oxidation of a Plasma-Polymerized Organosilane Cladding Material” Optics Letters, 30 (17), p. 2251-2253, 2005.

32. C.A Moritz, D Yeung and A Agarwal, “Simple Fit: A Framework for Analyzing Design Trade-offs in Raw Architectures”, IEEE Transacations on Parallel and Distributed Systems, 12, p.730 (2001). http://cag.csail.mit.edu/raw/documents/Moritz-SimpleFit-TPDS-2001.pdf

33. Kimberly Kuo, Rodric Rabbah, and Saman Amarasinghe, “A Productive Programming Environment for Stream Computing,” Second Workshop on Productivity and Performance in High-End Computing, San Francisco, February 13, 2005, pp. 35-44. http://cag.lcs.mit.edu/commit/papers/05/kkuo-sdt.pdf

34. William Thies, Michal Karczmarek, and Saman Amarasinghe, “StreamIt: A Language for Streaming Applications,” Proc. of the International Conference of Compiler Construction, 2002.

35. Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe, “A Stream Compiler for Communication-Exposed Architectures,” Proc. of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002.

36. Andrew A. Lamb, William Thies, and Saman Amarasinghe, “Linear Analysis and Optimization of Stream Programs,” Proc. of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, San Diego, California, June, 2003.

37. Janis Sermulins, William Thies, Rodric Rabbah, and Saman Amarasinghe, “Cache Aware Optimization of Stream Programs,” Proc. of the 2005 Conference on Languages, Compilers, and Tools for Embedded Systems, Chicago, Illinois, June 2005.

38. William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah, and Saman Amarasinghe, “Teleport Messaging for Distributed Stream Programs,” Proc. of the ACM SIGPLAN 2005 Symposium on Principles and Practice of Parallel Programming, Chicago, Illinois, June, 2005.

39. Matthew Drake, Henry Hoffman, Rodric Rabbah, and Saman Amarasinghe, “MPEG-2 Decoding in a Stream Programming Language,” Proc. of the 20th IEEE International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, April, 2006.

40. Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, and Marvin V. Zelkowitz, “Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers,” Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2005.

41. H. Sutter and J. Larus, “Software and the Concurrency Revolution,” ACM Queue, vol. 3, no. 7, 2005, pp. 54–62.

42. M. Creeger, “Multicore CPUs for the Masses,” ACM Queue, vol. 3, no. 7, 2005, pp. 63–64.

43. D. Butenhof, Programming with POSIX Threads, 1997, Addison Wesley Professional.

44. OpenMP Application Program Interface Specification, OpenMP Architecture Review Board, 1997; http://www.openmp.org/

45. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), Usenix Assoc., 2004, pp. 137–150.

46. MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, 1994; http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html.

47. Edward A. Lee, "The problem with threads," IEEE Computer , vol.39, no.5, pp. 33-42, May 2006.

48. W. Thies, M. Karczmarek, M. Gordon, D. Maze, J. Wong, H. Ho, M. Brown, and S. Amarasinghe, “StreamIt: A Compiler for Streaming Applications,” MIT-LCS Technical Memo TM-622, Cambridge, MA. December 2001.

49. I. Buck, 2004. Brook specification v.0.2. Tech. Rep. CSTR 2003-04 10/31/03 12/5/03, Stanford University.

50. M.L. Chu and S. A. Mahlke, "Compiler-directed data partitioning for muiticluster processors," Proc. of the Fourth International Symposium on Code Generation and Optimization, 2006 (CGO 2006), vol., no., pp. 11 pp.-, 26-29 March 2006

51. G. Ottoni and D. I. August, “Global Multi-Threaded Instruction Scheduling,” Proc. of the 40th International Symposium on Microarchitecture (MICRO-40), December 2007.

52. L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, “Transactional Memory Coherence and Consistency,” Proc. of the 31st International Symposium on Computer Architecture (ISCA’04). p102, June 2004.

53. R. Rajwar, M. Herlihy, and K. Lai, “Virtualizing Transactional Memory,” Proc. of the 32nd International Symposium on Computer Architecture (ISCA’05). pp. 494—505, June 2005.

54. IA-64 Application Instruction Set Architecture Guide, Revision 1.0, 1999.

55. R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman, “A VLIW architecture for a trace scheduling compiler,” Proc. of the Second International Conference on Architectual Support For Programming Languages and Operating Systems (ASPLOS-II). pp 180-192. 1987.

56. J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das, “Evaluating the Imagine Stream Architecture,” Proc. of the 31st International Symposium on Computer Architecture (ISCA’04), June 2004.

57. L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” Proc. of the 27th International Symposium on Computer Architecture (ISCA-2000), June 2000.

58. K. Sankaralingam, R. Nagarajan, P. Gratz, R. Desikan, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, W. Yoder, R. McDonald, S.W. Keckler, and D.C. Burger, "The Distributed Microarchitecture of the TRIPS Prototype Processor," Proc. of the 39th International Symposium on Microarchitecture (MICRO-39), December, 2006.

59. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. of the 22nd International Symposium on Computer Architecture, pp 24-36, June 1995.

60. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” University of California, Berkeley, Technical Report No. UCB/EECS-2006-183, December 18, 2006.

atac nsf proposal notes - massachusetts institute of ...people.csail.mit.edu/jim/temp/atac nsf...

Documents