a scalable high-precision and high-throughput architecture ... · 30th ieee international...
TRANSCRIPT
Improving Emulation of Quantum Algorithms using Space-Efficient Hardware Architectures
Naveed Mahmud, and Esam El-Araby
University of Kansas (KU)
30th IEEE International Conference onApplication-specific Systems, Architectures and Processors
July 15-17, 2019Cornell Tech, New York
ASAP 2019
2 ASAP 2019 – July 16th, 2019
Outline
Introduction and MotivationRelated Work and BackgroundProposed WorkExperimental ResultsConclusions and Future Work
3 ASAP 2019 – July 16th, 2019
Introduction and Motivation
Why Quantum Computing? Efficient quantum algorithms Solving NP-hard problems Speedup over classical
Need for Quantum Emulation Difficulty of maintenance & control High-cost of access
E.g., academic hourly rate of $1,250 up to 499 annual hours
Verification and benchmarking Analysis of quantum algorithms Improving classical computing
paradigms
Emulation using FPGAs Greater speedup vs. SW Dynamic (reconfigurable) vs. fixed
architectures Exploiting parallelism Limitation → Scalability
source: https://learning.acm.org/techtalks/qiskit
source: https://learning.acm.org/techtalks/qiskit
4 ASAP 2019 – July 16th, 2019
Introduction and Motivation
Why Quantum Computing? Efficient quantum algorithms Solving NP-hard problems Speedup over classical
Need for Quantum Emulation Difficulty of maintenance & control High-cost of access
E.g., academic hourly rate of $1,250 up to 499 annual hours
Verification and benchmarking Analysis of quantum algorithms Improving classical computing
paradigms
Emulation using FPGAs Greater speedup vs. SW Dynamic (reconfigurable) vs. fixed
architectures Exploiting parallelism Limitation → Scalability
Google’s 72-qubit “Bristlecone” Intel’s 49-qubit “Tangle Lake” IBM-Q 50-qubit computer
D-Wave 2000QIonQ’s 79-qubit computerRigetti’s 16-qubit ASPEN-4
source: https://learning.acm.org/techtalks/qiskit
5 ASAP 2019 – July 16th, 2019
Outline
Introduction and MotivationRelated Work and BackgroundProposed WorkExperimental ResultsConclusions and Future Work
7 ASAP 2019 – July 16th, 2019
Related Work (Parallel SW Simulators)
Villalonga, et al., “Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation,” May 2019 Simulation of 7x7 and 11x11 random quantum circuits (RQCs) of depth 42 and 26 respectively. Summit supercomputer (ORNL, USA) with 4550 nodes 1.6 TB of non-volatile memory per node Power consumption of 7.3 MW
Li et al., “Quantum Supremacy Circuit Simulation on Sunway TaihuLight,” Aug. 2018 Simulation of 49-qubit random quantum circuits of depth of 55 Sunway supercomputer (NSC, China) with 131,072 nodes (32,768 CPUs) 1 PB total main memory
J. Chen, et al., “Classical Simulation of Intermediate-Size Quantum Circuits,” May 2018 Simulation of up to 144-qubit random quantum circuits of depth 27 Supercomputing cluster (Alibaba Group, China) with 131,072 nodes 8 GB memory per node
De Raedt et al., “Massively parallel quantum computer simulator eleven years later,” May 2018 Simulation of Shor’s algorithm using 48-qubits Various supercomputing platforms: IBM Blue Gene/Q (decommissioned), JURECA (Germany), K computer (Japan), Sunway TaihuLight (China) Up to 16-128 GB memory/node utilized
T. Jones, et al., “QuEST and High Performance Simulation of Quantum Computers,” May 2018 Simulation of random quantum circuits up to 38 qubits ARCUS supercomputer (ARCHER, UK) with 2048 nodes Up to 256 GB memory per node
List of quantum SW simulators
https://quantiki.org/wiki/list-qc-simulators
8 ASAP 2019 – July 16th, 2019
Related Work (FPGA Emulators)
J. Pilch, and J. Dlugopolski, “An FPGA-based real quantum computer emulator” December 2018 Results for up to 2-qubit Deutsch’s algorithm Details of precision used not presented Limited scalability
A. Silva, and O.G. Zabaleta, “FPGA quantum computing emulator using high level design tools,” August 2017 Results for up to 6-qubit QFT Details of precision used not presented No approach to improve scalability
Y.H. Lee, M. Khalil-Hani, and M.N. Marsono, “An FPGA-based quantum computing emulation framework based on serial-parallel architecture,” March 2016 Results of 5-qubit QFT and 7-qubit Grover’s reported Up to 24-bit fixed-point precision No optimizations to make designs scalable
A.U. Khalid, Z. Zilic, and K. Radecka, “FPGA emulation of quantum circuits,” October 2004 3-qubit QFT and Grover’s search implemented Fixed-point precision (16 bits) Low operating frequency
M. Fujishima, “FPGA-based high-speed emulator of quantum computing,” December 2003 Logic quantum processor that abstracts quantum circuit operations into binary logic Coefficients of qubit states modeled as binary, not complex No resource utilization reported
9 ASAP 2019 – July 16th, 2019
Background (Quantum Computing)
Qubits Physical implementations
Electron (spin) Nucleus (spin through NMR) Photon (polarization encoding) Josephson junction (superconducting qubits)
Theoretical representation Bloch sphere
» Basis states | ⟩0 , | ⟩1» Pure states | ⟩𝜓𝜓
Vector of complex coefficients
Superposition Linear sum of distinct basis states Converts to classical logic when measured Applies to state with n-qubits
Entanglement Strong correlation between qubits Entangled state cannot be factored Tensor (Kronecker) product representation
N = 2n basis states, where, n is number of qubits
| ⟩𝜓𝜓 = 𝛼𝛼| ⟩0 + 𝛽𝛽| ⟩1 ≡𝛼𝛼𝛽𝛽 , and
𝑝𝑝 𝜓𝜓 → | ⟩0 = 𝛼𝛼 2 ; 𝑝𝑝 𝜓𝜓 → | ⟩1 = 𝛽𝛽 2
| ⟩𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞𝑞 = | ⟩𝑞𝑞𝑞 ⊗ | ⟩𝑞𝑞𝑞 ⊗ | ⟩𝑞𝑞𝑞| ⟩𝜓𝜓 = 𝛼𝛼1𝛼𝛼2𝛼𝛼3| ⟩000 + 𝛼𝛼1𝛼𝛼2𝛽𝛽3|0 ⟩01 + 𝛼𝛼1𝛽𝛽2𝛼𝛼3| ⟩010 +
… + 𝛽𝛽1 𝛽𝛽2𝛽𝛽3|1 ⟩11NMR ≡ Nuclear Magnetic Resonance
11 ASAP 2019 – July 16th, 2019
Background (Quantum Fourier Transform)
Quantum Fourier Transform (QFT) Fundamental quantum algorithm Quantum equivalent of classical Discrete
Fourier Transform (DFT) Quadratic speedup over DFT Input
Coefficients of a quantum superimposed state
Output Coefficients of transformed superimposed
state
𝑈𝑈𝑄𝑄𝑄𝑄𝑄𝑄 ��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 =1𝑁𝑁�𝑘𝑘=0
𝑁𝑁−1
�𝑞𝑞=0
𝑁𝑁−1
𝑓𝑓 𝑞𝑞𝑞𝑞𝑞𝑞 𝑒𝑒 �2𝜋𝜋𝜋𝜋(𝑞𝑞𝑘𝑘𝑁𝑁 ⟩|𝒌𝒌 = �𝑘𝑘=0
𝑁𝑁−1
𝑪𝑪𝒌𝒌𝒐𝒐𝒐𝒐𝒐𝒐 ⟩|𝒌𝒌
𝑈𝑈𝑄𝑄𝑄𝑄𝑄𝑄 =1𝑁𝑁
1 1 1 … 11 𝜔𝜔𝑛𝑛 𝜔𝜔𝑛𝑛2 … 𝜔𝜔𝑛𝑛𝑁𝑁−1
1 𝜔𝜔𝑛𝑛2 𝜔𝜔𝑛𝑛4 … 𝜔𝜔𝑛𝑛2 𝑁𝑁−1
1 𝜔𝜔𝑛𝑛3 𝜔𝜔𝑛𝑛6 … 𝜔𝜔𝑛𝑛3 𝑁𝑁−1
⋮ ⋮ ⋮ ⋱ ⋮1 𝜔𝜔𝑛𝑛𝑁𝑁−1 𝜔𝜔𝑛𝑛
2(𝑁𝑁−1) … 𝜔𝜔𝑛𝑛(𝑁𝑁−1)(𝑁𝑁−1)
QFT matrix
𝜔𝜔 = 𝑒𝑒2𝜋𝜋𝜋𝜋𝑁𝑁 , 𝑁𝑁 = 𝑞𝑛𝑛, and 𝑛𝑛 = number of qubits
��𝜓𝜓𝜋𝜋𝑛𝑛 =1𝑁𝑁�𝑞𝑞=0
𝑁𝑁−1
𝑓𝑓 𝑞𝑞𝑞𝑞𝑞𝑞 ⟩|𝒒𝒒 = �𝑞𝑞=0
𝑁𝑁−1
𝑪𝑪𝒒𝒒𝒊𝒊𝒊𝒊 ⟩|𝒒𝒒
QFT
outψinψ UQFT
12 ASAP 2019 – July 16th, 2019
Outline
Introduction and MotivationBackground and Related WorkProposed Work
Emulation Approaches Hardware Architectures
Experimental WorkConclusions and Future Work
13 ASAP 2019 – July 16th, 2019
Emulation Approaches – Direct circuit/gate approach
QFT Emulation
𝑇𝑇2𝑇𝑇𝑞 𝑇𝑇3 𝑇𝑇4 𝑇𝑇5
5-qubit QFT algorithm/circuit
Hardware node models for 5-qubit QFT
Modeling the quantum circuit as a series of transformations
QFT ≡ Quantum Fourier Transform 𝑇𝑇1 = 𝐻𝐻⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇𝑞 = 𝐶𝐶𝐶𝐶2 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇𝑞 = 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐶𝐶𝐶𝐶3 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼𝑇𝑇4 = 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .
𝐶𝐶𝐶𝐶4 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼𝑇𝑇5 = 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆 . 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 .
𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐶𝐶𝐶𝐶5 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 .𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆⊗ 𝐼𝐼 ⊗ 𝐼𝐼 . 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝐼𝐼 ⊗ 𝑆𝑆𝑆𝑆
⋮
⋯Node1 Node2 Node3
Advantages Modular Generalized framework
Disadvantages High resource utilization Longer latency Poor scalability
14 ASAP 2019 – July 16th, 2019
Emulation Approaches – CMAC approach
Methodology Vector-matrix multiplication
Complex multiply-and-accumulate (CMAC) Input quantum state vector, | ⟩𝝍𝝍𝒊𝒊𝒊𝒊
Algorithm reduced to a unitary matrix, 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨 lookup / dynamic generation / stream
Output quantum state vector, | ⟩𝝍𝝍𝒐𝒐𝒐𝒐𝒐𝒐
Advantages Generalized approach for any quantum algorithm Independent of circuit depth Lower resource utilization Lower latency Higher scalability Parallelizable hardware architectures
Disadvantages Calculating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨 could be challenging, but doable
��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑈𝑈𝐴𝐴𝐴𝐴𝐴𝐴 . ��𝜓𝜓𝜋𝜋𝑛𝑛
15 ASAP 2019 – July 16th, 2019
Emulation Approaches – CMAC approach optimizations
Lookup Computation elements (𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨) stored in memory Scalability limited by available memory Speed limited by memory bandwidth
Dynamic generation Additional hardware units required for
generating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨
Improved memory utilization Improved scalability Speed limited by complexity of algorithm
generating 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨
Stream No hardware required for 𝑼𝑼𝑨𝑨𝑨𝑨𝑨𝑨
Improved memory utilization Improved scalability Speed limited by I/O bandwidth
��𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑈𝑈𝐴𝐴𝐴𝐴𝐴𝐴 . ��𝜓𝜓𝜋𝜋𝑛𝑛
16 ASAP 2019 – July 16th, 2019
Outline
Introduction and MotivationBackground and Related WorkProposed Work
Emulation Approaches Hardware Architectures
Experimental WorkConclusions and Future Work
17 ASAP 2019 – July 16th, 2019
Hardware Architectures – CMAC approach
Complex Multiply-and-Accumulate (CMAC) unit Implements vector-matrix multiplication on hardware Complex valued inputs Single/double-precision floating-point
𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑖𝑖 = �𝑗𝑗=0
𝑁𝑁−1
𝑹𝑹𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓(𝒊𝒊, 𝒋𝒋)
𝜓𝜓𝑜𝑜𝑜𝑜𝑜𝑜𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑖𝑖 = �
𝑗𝑗=0
𝑁𝑁−1
𝑹𝑹𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊(𝒊𝒊, 𝒋𝒋)
where, 𝑖𝑖 = 0, 1, 𝑞, … , 𝑁𝑁 − 1
𝑹𝑹𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 = 𝜓𝜓𝜋𝜋𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑗𝑗 × 𝑼𝑼𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 − 𝜓𝜓𝜋𝜋𝑛𝑛𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑗𝑗 × 𝑼𝑼𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋
𝑹𝑹𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋 = 𝜓𝜓𝜋𝜋𝑛𝑛𝜋𝜋𝑖𝑖𝑟𝑟𝑖𝑖 𝑗𝑗 × 𝑼𝑼𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝒊𝒊, 𝒋𝒋 + 𝜓𝜓𝜋𝜋𝑛𝑛𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑗𝑗 × 𝑼𝑼𝒊𝒊𝒊𝒊𝒓𝒓𝒊𝒊 𝒊𝒊, 𝒋𝒋
18 ASAP 2019 – July 16th, 2019
Hardware Architectures – CMAC approach
Single CMAC Fully optimized for area One CMAC instance 𝑵𝑵 cycles to store input state vector elements 𝑵𝑵𝟐𝟐 cycles to compute for all algorithm matrix elements
N-concurrent CMAC Fully optimized for speed 𝑵𝑵 parallel CMAC instances 𝑵𝑵 cycles to store input state vector elements 𝑵𝑵 cycles to compute for all algorithm matrix elements
Dual-sequential CMAC Two CMAC instances connected serially Storage overlapped with computations Improved execution time
Single CMAC
N-concurrent CMAC
Dual sequential CMACN = 2n basis states, and n = number of qubits
19 ASAP 2019 – July 16th, 2019
Hardware Architectures – CMAC approach
CMAC Architecture
Complexity
Space (𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓) Time (𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓)
Single 𝑂𝑂(1) 𝑂𝑂(𝑁𝑁2)
N-concurrent 𝑂𝑂(𝑁𝑁) 𝑂𝑂(𝑁𝑁)
Dual sequential 𝑂𝑂(1) 𝑂𝑂(𝑁𝑁2)
Space and Time complexities of proposed architectures
Complexity analysis of CMAC Architectures Single CMAC𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟏𝟏 + 𝑵𝑵 + 𝑵𝑵𝟐𝟐 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵𝟐𝟐)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝟏𝟏 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪 = 𝑶𝑶(𝟏𝟏)
N-concurrent CMACs𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟐𝟐 + 𝟐𝟐𝑵𝑵 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝑵𝑵 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪𝒔𝒔 = 𝑶𝑶(𝑵𝑵)
Dual-sequential CMACs𝑶𝑶𝒐𝒐𝒊𝒊𝒊𝒊𝒓𝒓 = 𝑨𝑨𝟑𝟑 + 𝑵𝑵𝟐𝟐 × 𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 = 𝑶𝑶(𝑵𝑵𝟐𝟐)𝑶𝑶𝒔𝒔𝒔𝒔𝒓𝒓𝒔𝒔𝒓𝒓 = 𝟐𝟐 × 𝑪𝑪𝑪𝑪𝑨𝑨𝑪𝑪𝒔𝒔 = 𝑶𝑶(𝟏𝟏)
𝑨𝑨𝟏𝟏,𝑨𝑨𝟐𝟐,𝑨𝑨𝟑𝟑 ≡ initial pipeline latencies𝑻𝑻𝒔𝒔𝒓𝒓𝒐𝒐𝒔𝒔𝒌𝒌 ≡ system clock period
20 ASAP 2019 – July 16th, 2019
Outline
Introduction and MotivationBackground and Related WorkProposed WorkExperimental WorkConclusions and Future Work
21 ASAP 2019 – July 16th, 2019
Experimental Setup
Testbed Platform High-performance reconfigurable
computing (HPRC) system from DirectStream
Single compute node to warehouse scale multi-node deployments
OS-less, FPGA-only (Arria 10) architecture On-chip memory (OCM) On-board memory (OBM) Removes interconnection bottlenecks Reduces resource contention and energy use
Highly productive development environment Parallel High-Level Language C++-to-HW (previously Carte-C) compiler Smooth learning curve + fast development time
DirectStream (DS8) system
23 ASAP 2019 – July 16th, 2019
Experimental Results
QFT emulation using on-chip memory (OCM) architectures
Number of qubits
On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs
2 10.30 8.04 1.05 1.4E-6
3 10.24 8.12 1.05 1.15E-6
4 10.24 8.11 1.05 2.01E-6
5 10.27 8.18 1.05 5.37E-6
6 10.26 8.55 1.05 1.87E-5
7 10.26 10.25 1.05 7.17E-5
8 10.29 16.73 1.05 3.19E-4
9 10.31 41.28 1.05 0.0013
Single CMAC (lookup)
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
24 ASAP 2019 – July 16th, 2019
Experimental Results
Number of qubits
On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs
2 10.70 8.04 1.05 6.78E-7
3 10.74 8.12 1.05 7.64E-7
4 11.53 8.11 1.05 9.36E-7
5 17.10 8.18 1.05 1.28E-6
6 24.5 8.55 1.05 1.97E-6
7 39.5 10.25 1.05 3.34E-6
8 74.88 16.73 1.05 6.09E-6
N-concurrent CMAC (lookup)
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz
QFT emulation using on-chip memory (OCM) architectures
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
25 ASAP 2019 – July 16th, 2019
Experimental Results
Number of qubits
On-chip resource* utilization (%) Emulation time (sec)**ALMs BRAMs DSPs
2 12.39 8.55 2.11 7.55E-7
3 12.34 8.55 2.11 9.61E-7
4 12.36 8.63 2.11 1.79E-6
5 12.43 8.70 2.11 5.08E-6
6 12.38 8.99 2.11 1.83E-5
7 12.39 10.69 2.11 7.1E-5
8 12.37 17.18 2.11 0.0003
9 12.37 43.54 2.11 0.0011
Dual sequential CMAC (lookup)
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Operating frequency: 233 MHz
QFT emulation using on-chip memory (OCM) architectures
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
26 ASAP 2019 – July 16th, 2019
Experimental Results
Comparison of execution times for on-chip memory (OCM) architectures
27 ASAP 2019 – July 16th, 2019
Experimental Results
QFT emulation using on-board memory (OBM) architectures
Qubits
On-chip resource* utilization (%)
OBM** Utilization (bytes) Emulation time
(sec)***ALMs BRAM DSPs SRAM SDRAM
2 10.71 8.44 1.05 32 128 1.7E-63 10.71 8.44 1.05 64 512 2.0E-64 10.71 8.44 1.05 128 2K 3.9E-65 10.71 8.44 1.05 256 8K 1.1E-56 10.71 8.44 1.05 512 32K 3.9E-57 10.71 8.44 1.05 1K 128K 0.000158 10.71 8.44 1.05 2K 512K 0.000619 10.71 8.44 1.05 4K 2M 0.00241
10 10.71 8.44 1.05 8K 8M 0.0096311 10.71 8.44 1.05 16K 32M 0.0385112 10.71 8.44 1.05 32K 128M 0.1539913 10.71 8.44 1.05 64K 512M 0.6158614 10.71 8.44 1.05 128K 2G 2.3632415 10.71 8.44 1.05 256K 8G 9.85316 10.71 8.44 1.05 512K 32G 39.4209
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz
Qubits
On-chip resource* utilization (%)
OBM** Utilization (bytes) Emulation time
(sec)***ALMs BRAM DSPs SRAM SDRAM
2 12 8.63 2.11 32 128 7.55E-73 12 8.63 2.11 64 512 9.61E-74 12 8.63 2.11 128 2K 1.79E-65 12 8.63 2.11 256 8K 5.08E-66 12 8.63 2.11 512 32K 1.83E-57 12 8.63 2.11 1K 128K 7.10E-58 12 8.63 2.11 2K 512K 0.000289 12 8.63 2.11 4K 2M 0.0011310 12 8.63 2.11 8K 8M 0.0045111 12 8.63 2.11 16K 32M 0.01800212 12 8.63 2.11 32K 128M 0.07200613 12 8.63 2.11 64K 512M 0.288802114 12 8.63 2.11 128K 2G 1.15208315 12 8.63 2.11 256K 8G 4.60832916 12 8.63 2.11 512K 32G 18.4331
Single CMAC (lookup) Dual-sequential CMAC (lookup)
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
28 ASAP 2019 – July 16th, 2019
Experimental Results
Comparison of execution times for on-board memory (OBM) architectures
30 ASAP 2019 – July 16th, 2019
Experimental Results
QFT emulation using on-board memory (OBM) architectures
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz† Results projected using regression
Qubits
On-chip resource* utilization (%)
OBM** Utilization (bytes) Emulation time
(sec)***ALMs BRAMs DSPs SDRAM
2 11 8 1 32 2.3E-64 11 8 1 128 3.4E-66 11 8 1 512 2.0E-58 11 8 1 2K 2.8E-410 11 8 1 8K 4.5E-312 11 8 1 32K 7.2E-214 11 8 1 128K 1.15E016 11 8 1 512K 1.84E+118 11 8 1 2M 2.95E+220 11 8 1 8M 4.72E+322 11 8 1 32M 7.5E+4†24 11 8 1 128M 1.2E+6†26 11 8 1 512M 1.93E+7†28 11 8 1 2G 3.09E+8†30 11 8 1 8G 4.95E+9†32 11 8 1 32G 7.92E+10†
Dual-sequential CMAC (stream)Dual-sequential CMAC (dynamic generation)
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
Qubits
On-chip resource* utilization (%)
OBM** Utilization (bytes) Emulation time
(sec)***ALMs BRAMs DSPs SDRAM
2 13.16 9.58 3.23 32 1.99E-64 13.16 9.58 3.23 128 3.02E-66 13.16 9.58 3.23 512 1.95E-58 13.16 9.58 3.23 2K 0.000310 13.16 9.58 3.23 8K 0.004512 13.16 9.58 3.23 32K 0.072014 13.16 9.58 3.23 128K 1.152116 13.16 9.58 3.23 512K 18.43318 13.16 9.58 3.23 2M 294.9320 13.16 9.58 3.23 8M 4718.93422 13.16 9.58 3.23 32M 18876†24 13.16 9.58 3.23 128M 302012†26 13.16 9.58 3.23 512M 4832188†28 13.16 9.58 3.23 2G 7.73E+7 †30 13.16 9.58 3.23 8G 1.23E+9 †32 13.16 9.58 3.23 32G 1.979E+10 †
31 ASAP 2019 – July 16th, 2019
Experimental Results
Number of pixels Qubits
On-chip resource* utilization (%)
OBM** Utilization (bytes) Emulation time
(sec)***ALMs BRAMs DSPs SDRAM
16x16 8 14 9 2 4K 6.73E-632x32 10 14 9 2 16K 1.66E-564x64 12 14 9 2 64K 5.62E-5
128x128 14 14 9 2 256K 0.0002256x256 16 14 9 2 1M 0.0008512x512 18 14 9 2 4M 0.0034
1024x1024 20 14 9 2 16M 0.01352048x2048 22 14 9 2 64M 0.0540
4Kx4K 24 14 9 2 256M 0.21608Kx8K 26 14 9 2 1G 0.8641
16Kx16K 28 14 9 2 4G 3.456332Kx32K 30 14 9 2 16G 13.825
QubitsOn-chip resource*
utilization (%)OBM** Utilization
(bytes) Emulation time (sec)***
ALMs BRAMs DSPs SDRAM2 11 8 1 32 2.3E-64 11 8 1 128 3.4E-66 11 8 1 512 2.0E-58 11 8 1 2K 2.8E-410 11 8 1 8K 4.5E-312 11 8 1 32K 7.2E-214 11 8 1 128K 1.15E016 11 8 1 512K 1.84E+118 11 8 1 2M 2.95E+220 11 8 1 8M 4.72E+322 11 8 1 32M 7.5E+4†24 11 8 1 128M 1.2E+6†26 11 8 1 512M 1.93E+7†28 11 8 1 2G 3.09E+8†30 11 8 1 8G 4.95E+9†32 11 8 1 32G 7.92E+10†
Grover’s search using OBM architecturesSingle CMAC (stream)
Quantum Haar transform (QHT) using OBM architecturesKernel approach
*Total on-chip resources: NALM=427,000, NBRAM=2,713, NDSP=1,518**Total on-board memory: 4 parallel SRAM banks of 8MB each and 2 parallel SDRAM banks of 32GB each***Operating frequency: 233 MHz† Results projected using regression
ALM ≡ Adaptive Logic ModulesBRAM ≡ Block Random Access MemoryDSP ≡ Digital Signal Processing block
32 ASAP 2019 – July 16th, 2019
Experimental Results
Comparison with related work (FPGA emulation)Reported Work Algorithm Number of qubits Precision Operating
frequency (MHz)Emulation time (sec)
Fujishima (2003) Shor’s factoring - - 80 10
Khalid et al (2004)QFT 3 16-bit fixed pt.
82.161E-9
Grover’s search 3 16-bit fixed pt. 84E-9
Aminian et al (2008) QFT 3 16-bit fixed pt. 131.3 46E-9
Lee et al (2016)QFT 5 24-bit fixed pt. 90 219E-9
Grover’s search 7 24-bit fixed pt. 85 96.8E-9
Silva and Zabaleta(2017) QFT 4 32-bit floating pt. - 4E-6
Pilch and Dlugopolski(2018) Deutsch 2 - - -
Proposed work
QFT 20
32-bit floating pt. 233
18.4
QHT 30 13.8
Grover’s search 20 18.4
33 ASAP 2019 – July 16th, 2019
Conclusions
Supremacy of Quantum Computing Need for Quantum Emulation Emulation using FPGAs
Proposed Approaches and Methods Direct circuit/gate approach CMAC approach + lookup / dynamic generation / stream Kernel approach
Case studies Quantum Fourier Transform (QFT) Multi-dimensional Quantum Haar Transform (QHT) Single-pattern / multi-pattern Grover’s search algorithm
Testbed Platform State-of-the-art HPRC system from DirectStream C++ to hardware compiler
34 ASAP 2019 – July 16th, 2019
Future Work
More algorithms Integer factoring using Shor’s algorithm Dimension reduction using QHT Image pattern recognition using QHT and Grover’s search
Design Optimizations Combining / sharing resources, e.g., multipliers
Accuracy trade-off study Fixed-point vs. floating-point implementations for higher scalability
Quantum error correction (QEC)
Power efficiency
ASAP 2019 – July 16th, 2019