1 silicon photonics in post moore’s law era: technological...

3
1 Silicon Photonics in Post Moore’s Law Era: Technological and Architectural Implications Ke Wen*, Sébastien Rumley, Payman Samadi, Christine P. Chen, Keren Bergman Department of Electrical Engineering, Columbia University, New York, *[email protected] 1 A RIPPLE EFFECT FROM CACHE REDUCTION oore’s law is ending as we enter the last years of shrinking transistors. Chip designers will thus have to use the available transistors more effectively. One may interpret a few already signs of this trend. As shown in Figure 1, area devoted to cache on a CPU chip is decreas- ing, both in terms of (a) MB per FLOPS and (b) normal- ized chip area -- cache size (MB) × features size (nm2) / die size (mm2). Especially, a sharp fall (Fig. 1a) is clear as the industry gets into the many-core era (around 2013). Interestingly, this cache cliff matches the time when Moore’s Law was said to be dead in the economic sense -- starting from 2013, the number of transistors bought per dollar has stayed stagnant [1]. The fact that chipmakers are willingly trading the cache area for more FLOPS, along with the rise of data-centric throughput computing [2], calls for significantly higher off-chip memory band- width. Fig. 1c shows this trend: the sharp increase of the off-chip memory bandwidth matches the cache cliff of Fig. 1a. This increase, however, is still not enough to bal- ance the FLOPS increase as the bytes per flop ratio con- tinues to drift away from the ideal point. There is more to this grim description. The memory bandwidth increase is also rapidly stressing the pin count limit of the processor package. For example, KNL re- quires 3647 pins in the socket, plus 1024 pins in the inter- poser for each of the eight on-package memory stacks. The pin density of standard chip package, however, can- not scale indefinitely. The ever-increasing bandwidth demand thus requires a more efficient chip I/O technolo- gy for processors beyond the Moore’s Law. 2 SILICON PHOTONICS FOR CHIP I/O Compatible with CMOS lithography fabrication, Silicon photonics (SiP) has become one of the leading solutions to the aforementioned chip I/O issue. An example of “ex- tending the power of silicon to new arenas” [3], SiP lever- ages the transparency of silicon to light with 1.2~5 "m wavelength for high-speed transmission. Each SiP wave- guide can support terabit/s bandwidth, orders of magni- tude higher than what can be achieved with conventional electrical I/O. For example, while an 8-channel (4-layer) High Bandwidth Memory (HBM) cube requires a 1024-bit bus for 100 GB/s, a single SiP waveguide can provide the same bandwidth with 32 wavelengths each at 25 Gb/s. Silicon photonic is compatible with silicon interpos- ers used to carry processor and memory chips, forming a high-bandwidth chip-to-chip interconnect on package (Fig. 1d). Components such as waveguides, modulators, photodetectors and switches can be directly fabricated on the silicon interposer with low cost. The SiP switch, con- trollable by the processor, can provide flexible and trans- parent connection between any memory stack and any processor interface. A SiP interposer fabricated by PECST of Japan was reported to achieve bandwidth density of 6.6 Tb/s/cm2 [4]. Another important aspect of SiP is extending high- bandwidth I/O off package, enabled by efficient coupling M (a) (b) (c) (d) (e) (f) (g) (h) Fig. 1. (a) Cache size normalized by GFLOPS; (b) Normalized cache are; (c) off-chip bandwidth; (d) SiP interposer based architecture; (e) SiP-enabled high-capacity HBM-based node; (f) alleviating hotspot; (g) optimizing core-memory affinity; (h) Flexfly network. 0 0.05 0.1 0.15 0.2 0.25 0 1 2 3 4 5 6 7 8 Cache Size (MB) Per GFLOPS Xeon X5670 (Tianhe) IBM Power7 IBM BQC (Sequoia) AMD Opteron 6274 (Titan) Intel Xeon E5-2692 v2 (Tianhe-2) KNC/Intel Xeon Phi KNL (Cori) SW26010 (Sunway Taihu Light) 0 10 20 30 40 50 60 70 80 0 1 2 3 4 5 6 7 8 Normalized Cache Area cache size (MB) × feature size (nm 2 ) / die area (mm 2 ) MB Per GFLOPS 0 100 200 300 400 500 600 0 1 2 3 4 5 6 7 8 Off-Chip Memory Bandwidth (GB/s) GB/s Memory Stack Memory Stack Processor SiP Switch Optical Waveguide Silicon Interposer Optical TX microring modulators TSV Photoreceivers (at front/rear ends) 6 HBM cubes per interposer Total capacity: 192 GB Total bandwidth: 6 TB/s Processor Photonic Switch intf 4 intf 2 intf 1 intf 3 Optical Waveguide Processor A Photonic Switch Core intf 4 intf 2 intf 1 intf 3 Optical Waveguide Mem Congestion Mem Mem Mem Mem Mem Mem Mem Insertion of multiple low-radix SiP switches Fixed Dragonfly One of many Flexfly instances!

Upload: dinhphuc

Post on 15-Feb-2019

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Silicon Photonics in Post Moore’s Law Era: Technological ...lightwave.ee.columbia.edu/Publications/Wen2016_PMES.pdf · Silicon Photonics in Post Moore’s Law Era: Technological

1

Silicon Photonics in Post Moore’s Law Era: Technological and Architectural Implications

Ke Wen*, Sébastien Rumley, Payman Samadi, Christine P. Chen, Keren Bergman Department of Electrical Engineering, Columbia University, New York, *[email protected]

1 A RIPPLE EFFECT FROM CACHE REDUCTIONoore’s law is ending as we enter the last years of shrinking transistors. Chip designers will thus have

to use the available transistors more effectively. One may interpret a few already signs of this trend. As shown in Figure 1, area devoted to cache on a CPU chip is decreas-ing, both in terms of (a) MB per FLOPS and (b) normal-ized chip area -- cache size (MB) × features size (nm2) / die size (mm2). Especially, a sharp fall (Fig. 1a) is clear as the industry gets into the many-core era (around 2013). Interestingly, this cache cliff matches the time when Moore’s Law was said to be dead in the economic sense -- starting from 2013, the number of transistors bought per dollar has stayed stagnant [1]. The fact that chipmakers are willingly trading the cache area for more FLOPS, along with the rise of data-centric throughput computing [2], calls for significantly higher off-chip memory band-width. Fig. 1c shows this trend: the sharp increase of the off-chip memory bandwidth matches the cache cliff of Fig. 1a. This increase, however, is still not enough to bal-ance the FLOPS increase as the bytes per flop ratio con-tinues to drift away from the ideal point.

There is more to this grim description. The memory bandwidth increase is also rapidly stressing the pin count limit of the processor package. For example, KNL re-quires 3647 pins in the socket, plus 1024 pins in the inter-poser for each of the eight on-package memory stacks. The pin density of standard chip package, however, can-not scale indefinitely. The ever-increasing bandwidth demand thus requires a more efficient chip I/O technolo-

gy for processors beyond the Moore’s Law.

2 SILICON PHOTONICS FOR CHIP I/O Compatible with CMOS lithography fabrication, Silicon photonics (SiP) has become one of the leading solutions to the aforementioned chip I/O issue. An example of “ex-tending the power of silicon to new arenas” [3], SiP lever-ages the transparency of silicon to light with 1.2~5 μm wavelength for high-speed transmission. Each SiP wave-guide can support terabit/s bandwidth, orders of magni-tude higher than what can be achieved with conventional electrical I/O. For example, while an 8-channel (4-layer) High Bandwidth Memory (HBM) cube requires a 1024-bit bus for 100 GB/s, a single SiP waveguide can provide the same bandwidth with 32 wavelengths each at 25 Gb/s.

Silicon photonic is compatible with silicon interpos-ers used to carry processor and memory chips, forming a high-bandwidth chip-to-chip interconnect on package (Fig. 1d). Components such as waveguides, modulators, photodetectors and switches can be directly fabricated on the silicon interposer with low cost. The SiP switch, con-trollable by the processor, can provide flexible and trans-parent connection between any memory stack and any processor interface. A SiP interposer fabricated by PECST of Japan was reported to achieve bandwidth density of 6.6 Tb/s/cm2 [4].

Another important aspect of SiP is extending high-bandwidth I/O off package, enabled by efficient coupling

M

(a) (b) (c) (d)

(e) (f) (g) (h) Fig. 1. (a) Cache size normalized by GFLOPS; (b) Normalized cache are; (c) off-chip bandwidth; (d) SiP interposer based architecture; (e) SiP-enabled high-capacity HBM-based node; (f) alleviating hotspot; (g) optimizing core-memory affinity; (h) Flexfly network.

0

0.05

0.1

0.15

0.2

0.25

0 1 2 3 4 5 6 7 8

CacheSize(MB)PerGFLOPS

XeonX5670(Tianhe)

IBMPower7

IBMBQC(Sequoia)

AMDOpteron6274(Titan)

IntelXeonE5-2692v2(Tianhe-2)

KNC/IntelXeonPhi

KNL(Cori)

SW26010(SunwayTaihuLight)

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7 8

NormalizedCacheArea

cache size (MB) × feature size (nm2) / die area (mm2) MB Per GFLOPS

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8

Off-ChipMemoryBandwidth(GB/s)

XeonX5670(Tianhe)

IBMPower7

IBMBQC(Sequoia)

AMDOpteron6274(Titan)

IntelXeonE5-2692v2(Tianhe-2)

KNC/IntelXeonPhi

KNL(Cori)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4 5 6 7 8

MemoryBandwidthPerGFLOPS(Bytes/Flop)

GB/s byte per flop

Memory Stack Memory Stack Processor

SiP Switch Optical Waveguide Silicon Interposer

Optical TX microring modulators

TSV

Photoreceivers (at front/rear ends)

6 HBM cubes per interposer

Total capacity: 192 GB

Total bandwidth: 6 TB/s

Processor

PhotonicSwitch

intf 4

intf 2 intf 1

intf 3

Optical Waveguide

Processor

A

PhotonicSwitch

Core intf 4

intf 2 intf 1

intf 3

Optical Waveguide

MemCongestion

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Insertion of multiple low-radix SiP switches

Fixed Dragonfly One of many Flexfly instances!

Page 2: 1 Silicon Photonics in Post Moore’s Law Era: Technological ...lightwave.ee.columbia.edu/Publications/Wen2016_PMES.pdf · Silicon Photonics in Post Moore’s Law Era: Technological

2

between waveguides and fibers. This is a much-needed capability, as the interposer area (about 700 mm2) will limit the capacity of on-package memory (OPM). The cur-rent solution is to pair the fast OPM with slow, off-package, DRAM. Such small-fast, large-slow exclusive-ness may significantly complicate application program-ming and memory management. The distance-independent transmission of photonics can solve this problem, enabling a uniform, high-capacity HBM archi-tecture as shown in Fig. 1e. With 1 Tb/s bandwidth per fiber, four fibers can supply the 256 GB/s bandwidth needed by a HBM2 cube. With 24 fibers per coupling as-sembly and four such assemblies, an interposer hosting processors can connect to a total of 24 HBM cubes, ac-counting for 192 GB memory capacity and 6 TB/s aggre-gate bandwidth. SiP technologies can thus enable a flat, easy-to-manage memory hierarchy.

3 ARCHITECTURAL IMPLICATIONS 3.1 Node Level: Optimizing Memory Locality The benefit of silicon photonics is not limited to sheer bandwidth growth. As mentioned earlier, the reconfigu-rable SiP switch can provide connection between any memory cube and any processor interface. This function-ality can help precisely deliver memory data to the con-sumer cores, without traversing the network on chip (NoC), effectively mitigating the NUMA problem faced by the many-core era [5] [6]. As shown in Fig. 1f, a recon-figuration of the SiP switch can reduce the NoC hop count from 10 (dashed yellow, as in native connection) to 1 (solid yellow). This hop decrease immediately translates into a few tens of nanoseconds less latency and a signifi-cant drop in energy dissipation. The routing of high-speed memory data out of the NoC plane may also save the NoC bandwidth for more core-to-core communica-tion, a trend as “MPI everywhere” (assigning each core with a MPI process) emerges [7]. SiP waveguides with ultra-low loss of 1.2 dB/m has been demonstrated [8], meaning nearly distance-independent energy consump-tion for chip scale, as compared to 25 pJ per 64-bits per mm in case of moving data electrically on chip [2].

Another possibility is to use the SiP switch to allevi-ate the hotspot effect on the NoC when hotspot memory access happens (Fig. 1g). In this scenario, the SiP switch can TDM select the memory interface to inject data stream from the hotspot memory, thus distributing the traffic to different NoC sections [6].

3.2 System Level: Flexible Topology SiP switching can be also utilized to form flexible system-level topology [9]. The need for flexible topology roots from the diverse spectrum of applications that run in a supercomputer. The clear difference in their communica-tion characteristics, in terms of neighboring relationship, traffic volume, etc, makes it very difficult to find a “best-for-all” topology. SiP switching, in contrast, is capable of dynamically “rewiring” the connections among a set of electronic endpoints. These electronic endpoints can be either compute nodes or electrical routers. The benefit is

directing bandwidth to where it is needed without over-provisioning it [10]. Recently, a reconfigurable Dragonfly architecture utilizing small-radix SiP switches has been demonstrated [11]. The architecture, called Flexfly, is ca-pable of concentrating the fully-connected group-to-group links of Dragonfly into, for example, a thick ring-like topology (Fig. 1h). It is shown to help applications like GTC to achieve 1.8x speedup over conventional adaptive UGAL routing.

4 PHOTONIC-ELECTRONIC INTEGRATION There are three methods for integrating SiP and electronic devices: front-end, back-end and hybrid integrations.

In front-end integration, electronic and photonic de-vices are formed on the same layer. The advantage is that nanophotonics can piggyback on the mask. However, the challenge remains to guide light with sufficient isolation, especially, separation between the waveguide and the silicon substrate. While CMOS-SOI uses a thin buried oxide (BOX) of 200 nm, photonics SOI requires a BOX of 1 μm. Approaches that utilize thicker-BOX have been pro-posed [12, 13]. This may, however, reduce the heat dissi-pation capability of electronics [14]. Methods that do not modify the standard CMOS have thus been proposed [14-17], which locally remove the underlying Si substrate to mitigate losses.

Back-end integration is another monolithic method [18, 19]. It allows deposition of sufficient isolation oxide on top of the existing CMOS-SOI. However, this method introduces additional back-end steps and thus extra cost. The back-end method may also face a stricter thermal budget in order to prevent damage to electronic CMOS. As a result, engineers have to look at using other materi-als for the photonic layer. Yet, to date, silicon nitride [18], amorphous silicon [20] and laser-annealed polysilicon [18] have been proven as feasible material.

The hybrid integration method forms photonic and electronic circuits on separate chips and bonds them through flip-chip bonding. As such, the photonic and electronic chips can be each optimized using different process flows. Hybrid integration is to date the majority choice of SiP research and development parties. Signaling speeds of 25 Gb/s [4] and 50 Gb/s [21] have recently been demonstrated using flip-chip bonding.

7 CONCLUSION The end of Moore’s Law comes at a time when efficient allocation of transistor real estate has become imperative for computing. The resulting cache reduction and the rise of data-centric throughput computing calls for efficient off-chip, off-package data movement. Silicon photonics could potentially be one of the promising solutions the computing world is looking for to continue performance growth. Yet, industry-level electronic-photonic integra-tion, and system co-design are yet to be realized, along with reducing manufacturing costs. New architectural implications of silicon photonics at both node level and system level also require further investigation into how to enable new dimensions of performance improvement.

Page 3: 1 Silicon Photonics in Post Moore’s Law Era: Technological ...lightwave.ee.columbia.edu/Publications/Wen2016_PMES.pdf · Silicon Photonics in Post Moore’s Law Era: Technological

3

ACKNOLEGMENTS This work was supported by the U.S. Department of

Energy Lawrence Berkeley National Laboratory under subcontract 7257488 and Sandia National Laboratories under contract PO 1319001.

REFERENCES 1. After Moore's Law. The Economist Technology Quarterly 2016;

Available from: http://www.economist.com/technology-quarterly/2016-03-12/after-moores-law.

2. W.J. Dally. The end of denial architecture and the rise of throughput computing. in Keynote speech at Desgin Automation Conference. 2010.

3. Intel. Expanding Moore's Law, Fall 2002 Update. 2002; Available from: http://www.cc.gatech.edu/computing/nano/documents/Intel - Expanding Moore's Law.pdf.

4. Y. Arakawa, T. Nakamura, Y. Urino, and T. Fujita, Silicon photonics for next generation system integration platform. IEEE Communications Magazine, 2013. 51(3): p. 72-77.

5. D. Unat, T. Nguyen, W. Zhang, M.N. Farooqi, B. Bastem, G. Michelogiannakis, A. Almgren, and J. Shalf. TiDA: High-Level Programming Abstractions for Data Locality Management. in International Conference on High Performance Computing. 2016. Springer.

6. K. Wen, H. Guan, D.M. Calhoun, D. Donofrio, J. Shalf, and K. Bergman, Reconfigurable Silicon Photonic Memory Interconnect in the Many-Core Era, in IEEE High Performance Extreme Computing Conference (HPEC). 2016: Waltham, MA.

7. W. Gropp. MPI at Exascale: Challenges for Data Structures and Algorithms. in European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. 2009. Springer.

8. J.F. Bauters, M.L. Davenport, M.J.R. Heck, J.K. Doylend, A. Chen, A.W. Fang, and J.E. Bowers, Silicon on ultra-low-loss waveguide photonic integration platform. Optics express, 2013. 21(1): p. 544-555.

9. S. Rumley, D. Nikolova, R. Hendry, Q. Li, D. Calhoun, and K. Bergman, Silicon photonics for exascale systems. Journal of Lightwave Technology, 2015. 33(3): p. 547-562.

10. K. Wen, D. Calhoun, S. Rumley, X. Zhu, Y. Liu, L.W. Luo, R. Ding, T.B. Jones, M. Hochberg, and M. Lipson. Reuse distance based circuit replacement in silicon photonic interconnection networks for HPC. in 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects. 2014. IEEE.

11. K. Wen, P. Samadi, S. Rumley, C.P. Chen, Y. Shen, M. Bahadori, J. Wilke, and K. Bergman, Flexfly: Enabling a Reconfigurable Dragonfly Through Silicon Photonics, in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2016: Salt Lake City, Utah.

12. S.K. Selvaraja, P. Jaenen, W. Bogaerts, D. Van Thourhout, P. Dumon, and R. Baets, Fabrication of photonic wire and crystal circuits in silicon-on-insulator using 193-nm optical lithography. Journal of Lightwave Technology, 2009. 27(18): p. 4076-4083.

13. Y. Vlasov, W.M.J. Green, and F. Xia, High-throughput silicon nanophotonic wavelength-insensitive switch for on-chip optical networks. nature photonics, 2008. 2(4): p. 242-246.

14. J.S. Orcutt, R.J. Ram, and V. Stojanović. Integration of silicon

photonics into electronic processes. in Spie opto. 2013. International Society for Optics and Photonics.

15. M. Georgas, B.R. Moss, C. Sun, J. Shainline, J.S. Orcutt, M. Wade, Y.H. Chen, K. Nammari, J.C. Leu, and A. Srinivasan. A monolithically-integrated optical transmitter and receiver in a zero-change 45nm SOI process. in 2014 Symposium on VLSI Circuits Digest of Technical Papers. 2014. IEEE.

16. J.S. Orcutt, A. Khilo, C.W. Holzwarth, M.A. Popović, H. Li, J. Sun, T. Bonifield, R. Hollingsworth, F.X. Kärtner, and H.I. Smith, Nanophotonic integration in state-of-the-art CMOS foundries. Optics express, 2011. 19(3): p. 2335-2346.

17. J.S. Orcutt, B. Moss, C. Sun, J. Leu, M. Georgas, J. Shainline, E. Zgraggen, H. Li, J. Sun, and M. Weaver, Open foundry platform for high-performance electronic-photonic integration. Optics express, 2012. 20(11): p. 12222-12232.

18. Y.H.D. Lee and M. Lipson, Back-end deposited silicon photonics for monolithic integration on CMOS. IEEE Journal of Selected Topics in Quantum Electronics, 2013. 19(2): p. 409-415.

19. I.A. Young, E. Mohammed, J.T.S. Liao, A.M. Kern, S. Palermo, B.A. Block, M.R. Reshotko, and P.L.D. Chang, Optical I/O technology for tera-scale computing. IEEE Journal of solid-state circuits, 2010. 45(1): p. 235-248.

20. K. Furuya, K. Nakanishi, R. Takei, E. Omoda, M. Suzuki, M. Okano, T. Kamei, M. Mori, and Y. Sakakibara, Nanometer-scale thickness control of amorphous silicon using isotropic wet-etching and low loss wire waveguide fabrication with the etched material. Applied Physics Letters, 2012. 100(25): p. 251108.

21. G. Denoyer, C. Cole, A. Santipo, R. Russo, C. Robinson, L. Li, Y. Zhou, B. Park, F. Boeuf, and S. Crémer, Hybrid silicon photonic circuits and transceiver for 50 Gb/s NRZ transmission over single-mode fiber. Journal of Lightwave Technology, 2015. 33(6): p. 1247-1254.