### WHAT ALGORITHMS SHOULD WE STUDY WITH 100 QUBITS AND 1M LOGICAL GATES?

### **Alexandru Paler**

Aalto University, Helsinki, Finland Quantum Software and Algorithms Group

alexandru.paler@aalto.fi

### **Motivation: QC needs error-correction**



#### Physical (raw) qubits

- not well behaved
- faulty affected by environmental noise and manufacturing inconsistencies
- solitary (not many) on a device



#### **Error-corrected qubits**

- controlling the risks
- not faulty or controlled failure rates
- difficult to achieve due to lack of hardware qubits, not scalable classical software etc.













# **A Brief Introduction to Surface Codes**

Sufface Code distance three 0 0 1 0 9 date 2. o - data gulit
o - synd × gulit
□ - synd & gulit 4 synd X 4 synd 17 gutouts

Surface Parforming CNOTS Code Braiding plague #5 A enjoyced. distance three Two play C not enforced poor defects. 9 data 2. data gulit synd x gulit 4 synd X Support 4 synd two logical gub. 131 I - synd & gulit 17 gutouts

Of operators 团 PX -> 6 for more details arxiv. 1208.0928

Lattice Swerry Of operators 0 · 0 = 22 团 -> Q Q 6 = Marge 1 2 → 1+2 for more details arxiv. 1208.0928 Split 1+2 -11

formore details 4022 INT IAL t INT Split Marge C INT 1XI Merge 7 C

for more details 4022 Now the Circuits Compiled? INT Map 6 a . .... Ð . ... IRI t Lagar 1 ALC . INT INT INT INT Split Marge C Ł drouds INT IXI erral correction Merge AA On-C one mere operation

Spricetime Idence of consists of dxdxd time HW cutes. CAL. 100 gehts ~ 200 patches 1 Million Jates ~ 106. 402 => 10° vderme

Spacetime I dame of a Conputation. consists of dxdxd HW cutes. 100 gents ~ 200 patches 1 Million jates ~ 106. #2 => 10° vderme

oficel Error Rete prot fative 1/15 deme t faling. Noise Model : myle geht two sunt 10p meaniement 10p? 2

Pth QECC QECC. Norths dos Log Doch d=3 theshold prob. pth Play ~ (P) d+1 2 Usual Jalues: p~0.1%. 7th~ 1%. Incuare d by 2 -> 10x luves Pig d=17 min. reg. for 103

# **Scalable (Machine Learning) Decoders**

## **Faster Decoding by Pipelining and Parallelisation**

Goal: Real-time scalable decoding

Method:

- 1. Pipelining and Parallelism
- 2. Solve simple matches before full decoding
- 3. On-the-fly reweighting to account for error correlations
- 4. Do not sacrifice decoding performance for speed Helper functionalities:
- a. Automatic computation of logical error rates
- b. Exhaustive test of single and double error correction
- c. Prepare, benchmark and validate 64 core real time operation
- d. Prepare for different code variants



### **Correlated analysis of errors**

Systematic to analyse and decompose correlated errors.

Circuit level simulation - Gates have a list of possible errors



## **Correlated analysis of errors**

Systematic to analyse and decompose correlated errors.

Circuit level simulation - Gates have a list of possible errors

Errors generate detection events: 1 (connect to boundary), 2, 2+



2+ Detection events from a single error have to be decomposed

Treat 2+ events like hyperedges: decompose into known edges



An edge is generated by errors, the probability of an edge is determined by the prob of the errors



# **Correlated analysis of errors**

An edge is generated by errors,

the probability of an edge is determined by the prob of the errors



Errors are generated by gates.

Assume independent gate errors.



p\_corr = sum of probs from gate errors.

Update edge probability (after prematching)

p\_new = p\_old + p\_corr



## **Results: pipelinening and correlated decoding**

Fast decoding with comparable results to arXiv:1310.0863v1 d3@10-5 < 10-7 d9@10-3 < 10-7

Hardware agnostic correlated analysis for reweighting edges

Prematching to support reweighting and fastmatch

Pipelined correlated minimum weight perfect matching of the surface code Alexandru Paler<sup>1,2</sup> and Austin G. Fowler<sup>3</sup> <sup>1</sup>Administration procession and a structure of the surface of the



# **ML Decoders: Introduction and Motivation**

Optimal Decoding of QECC is a hard problem [1]

Belief propagation (BP) - one of the best-known classical decoding algorithms



Limitations of previous NN based decoding approaches:

- Different NN architectures for different code types
- Retain for each code distance
- there is a GNN decoder [4], but it does not work like we want it

### Why ML Decoders?

ML Decoding has linear time (although the scaling of the models with code distance is not known)



iOlius, A. D., Fuentes, P., Orús, R., Crespo, P. M., & Martinez, J. E. (2023). Decoding algorithms for surface codes. *arXiv preprint arXiv:2307.14989*.

## Why ML Decoders?



ML Decoding has linear time (although the scaling of the models with code distance is not known)

### What the goal is:



What the state of the art is:

Threshold



iOlius, A. D., Fuentes, P., Orús, R., Crespo, P. M., & Martinez, J. E. (2023). Decoding algorithms for surface codes. arXiv preprint arXiv:2307.14989.

### <u>The Graph Neural Network (GNN) Decoder</u> <u>Learning BP to Satisfy Constraints</u>





red: input vertices in GNN blue: output green: node state messages are sent along the edges

### <u>Graph Neural Network (GNN) Decoder</u> <u>The Sudoku analogy - Learning BP</u>





Green = to fill =

errors / data qubits



 $m_{22}$ 

 $m_{31}$ 

1112

ma

mar

Tanner graph for surface code of distance 3: **RED** vertices are check nodes, **GREEN** vertices are data nodes

### **GNN as replacement of MWPM or BPOSD for Surface code**

GNN vs MWPM 3 muper Bant-set ▲ 5 miigri 7 mwpm • 9 miljen \* 11 magan = gnri d3 0.1 = gen dő - gnn d7 Logical Error Rate (LER) 0.05 en neg = gnn df1 - gnn dt1-13 0.01 0.005 0.09 0.15 0.18 0.18 0.19 0.20 0.05 0.06 80.0 0.10 0.11 0.13 0.14 0.17 0.07 0.12 0.21 Physical Error Rate (PER)

### **GNN as replacement of BP2-OSD for IBM's BB code**



# **Very Fast Compilers (for Lattice Surgery)**



Our Challenge: Logical Computations at scale 100s to 1000s of logical qubits

- Start with a lattice of NN connected qubits that can operate a Surface Code Cycle
- This lattice is partitioned into tiles.
- A tile can hold a **patch**, which encodes a logical qubit in a planar code
- Patches have different kinds of **boundaries** that are used to perform multibody measurements
- Unused lattice can be used as **routing** to carry out measurements among patches with no shared boundary



arXiv:2302.02459 (quant-ph)

(Submitted on 5 Feb 2023)

#### A High Performance Compiler for Very Large Scale Surface Code Computations

George Watkins, Hoang Minh Nguyen, Varun Seshadri, Keelan Watkins, Steven Pearce, Hoi-Keen Lau, Alexandru Peler



# **LS Compiler Architecture**

A pluggable pipeline in decoupled stages, with options and text-based intermediate representations



Pre-processing is decoupled from routing on the lattice thanks to an intermediate representation of Lattice Surgery Instructions and a Layout Specification





# **Scalable Reinforcement Learning**

 $\bigcirc$ 

 $\bigcirc$ 



Space-time volume of braided circuit

Babbush, Ryan, et al. "Encoding electronic spectra in quantum circuits with linear T complexity." *Physical Review X* 8.4 (2018): 041015.



### Space-time volume of lattice surgery (LS) circuit

Paler, Alexandru, and Austin G. Fowler. "OpenSurgery for topological assemblies." 2020 IEEE Globecom Workshops (GC Wkshps. IEEE, 2020.

How do we optimize this? Maybe we can discover optimization algorithms...Let's use RL....

## **RL Scalability**

Training the agent takes around 41 hours on a single compute node with 32 CPUs and 2 GPUs. We run multiple environments in parallel on the CPUs during the sampling phase and train the agent distributed on both GPUs. The implementation of the algorithm could directly take advantage of larger compute nodes to speed up training time.

As AI applications demand greater compute power, efficiency may be improved via better chip design. The Nature paper [1] by Google researchers, published in June 2020, was advertised as a chip-design breakthrough using Machine Learning (ML). It addressed a challenging problem to optimize locations of circuit components on a chip and described applications to five TPU chip blocks, implying that no better methods were available at the time in academia or industry. The paper generalized the claims beyond chip design to suggest that Reinforcement Learning (RL) outperforms state of the art in combinatorial optimization.

### arXiv > quant-ph > arXiv:2311.18588

#### Quantum Physics

(Submitted on 30 Nov 2023)

#### Optimizing ZX-Diagrams with Deep Reinforcement Learning

#### Maximilian Nagele, Florian Marquardt

ZX-diagrams are a powerful graphical language for the description of quantum processes with applications in fundamental quan describe. These rules can be exploited to optimize the structure of ZX-diagrams for a range of applications. However, finding an problem and show that a trained reinforcement learning agent can significantly outperform other optimization techniques like a

#### 31 X1V > cs > arXiv.2306.09633

#### Computer Science > Machine Learning

(Submitted on 16 Avo 2023 or 11, last revised 29 Sep 2023 this version, v73

#### The False Dawn: Reevaluating Google's Reinforcement Learning for Chip Macro Placement

#### igor L. Markov

Reinforcement learning (RL) for physical design of calcon chips in a Google 2021 Nature paper stored controlency due to poorly documented claims that raised eyebrow demonstrated that Google RL lags behind (0 human designers, (i) a well-known algorithm Gimalated Annealing), and (ii) generally-available commercial software, while and reporting. Before publishing, Google rebuffed internal allegations of haud, We note policy implications and conclusions for chip design.



### Fast & scalable RL is needed

# **RL Introduction**

Reinforcement Learning: trial and error approach iteratively

*Goal:* finding the optimal policy that solves a specific problem

Quantum circuit optimization: reduce cost of the circuit

Cost can be many things:

- Number of gates and their specific cost (e.g two-qubit gates are more expensive)
- Depth of the circuit
- Number of qubits
- How well does the circuit map to a qubit layout

### Our goal: use RL to reduce the cost of the circuit

Problem: rewrite-rule based optimisation tends to increase the depth before it reaches optimum

Difficult optimization landscape





### **Quantum Circuit Optimization Landscape**



3. And exploring too much might kill you...

### **Example of a RL Epoch**





|              |       |       | •     |       |       |       |       |       |       |       |       | *     |        |          |
|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--------|----------|
| circuit      | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     | 10    | 11    | 12    | 13     | 14       |
| degree       | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6     | 6      | 4        |
| scount       | 8     | 12    | 16    | 20    | 18    | 16    | 14    | 12    | 10    | 8     | 6     | 4     | 2      | 2        |
| mcount       | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3     | 3      | 1        |
| depth        | 6     | 8     | 10    | 12    | 12    | 12    | 10    | 9     | 7     | 5     | 5     | 5     | 5      | 3        |
| cost         | 4.15  | 5.08  | 6.00  | 6.92  | 6,46  | 6.00  | 5.54  | 5.08  | 4.62  | 4.15  | 3.69  | 3.23  | 2.77   | 1.23     |
| avg degree   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5    | 1        |
| maxavgdegree | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5   | 1.5    | 1.5      |
| maxdepth     | 6     | 8     | 10    | 12    | 12    | 12    | 12    | 12    | 12    | 12    | 12    | 12    | 12     | 12       |
| maxcost      | 4.15  | 5.08  | 6.00  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92  | 6.92   | 6.92     |
| reward       | 1.250 | 1.250 | 1.250 | 1.250 | 1.321 | 1.404 | 1.627 | 1.893 | 2.610 | 4.766 | 6.105 | 8.117 | 11.334 | 1296.000 |
|              |       |       |       |       |       |       |       |       |       |       |       |       |        |          |

### **Example of a RL Epoch**



# **Very Large Scale Circuit Optimizer**

## **Motivation**

No software can handle gate optimization in randomly chosen circuit locations for circuits with *millions (billions?)* of gates!

| Optimizer     | Time       |
|---------------|------------|
| Cirq 1.2.0    | > 20 hours |
| Tket 1.21.0   | ~ 1 min    |
| PostgreSQL 14 | ?          |

Benchmarked state-of-the-art optimizers with circuits of 1 million templates.



TABLE IV. Resources required for quantum simulation of a planar Hubbard model with periodic boundary conditions and spin, as in Eq. (56). The dimension of the system indicates how many sites (spatial orbitals) are on each side of the square model. The number of system qubits is thus twice the number of spation orbitals. The number of logical ancillae is computed as Eq. (6 Finally, the number of T gates is computed using Eq. (63), whil assumes that  $\alpha/t = 4$  and  $\Delta E = t/100$ . The first three proble sizes in the table are near the classically intractable regime.

Example of practical circuit sizes

| Dimension      | Spin<br>orbitals | Logical<br>ancilla | Total<br>logical | T count              |  |
|----------------|------------------|--------------------|------------------|----------------------|--|
| 6×6            | 72               | 33                 | 105              | $9.3 \times 10^{3}$  |  |
| 8×8            | 128              | 33                 | 161              | $2.9 \times 10^{8}$  |  |
| $10 \times 10$ | 200              | 36                 | 236              | $7.1 \times 10^{8}$  |  |
| $20 \times 20$ | 800              | 42                 | 842              | $1.2 \times 10^{10}$ |  |

Encoding Electronic Spectra in Guantum Circuits with Linear T. Complexity

Ryan Babbuch, Graig Gahrey, Dowlink M. Berry, Nation Matter, Jarmi McClean, Namersha Rake, April: Powley, and Teachus Mexim. Phys. Neur. 8, 04128 – 1946-5440, 21 October 2128

Why **random**? circuit optimisation is a combinatorial (not sequential) problem. In-memory optimizers **are slow** for random memory access! Databases **are faster**.

# <u>Methods</u>

We consider four types of gate templates:

• Single-qubit gate cancellations

• Two-qubit gate cancellations

• Base changes



• Commutations



H

## **Results: Random Synthetic Circuits**

100000 Generating Synthetic Benchmark Circuits 10000 Start from empty circuit - identity on all 1. 1000 Time (seconds) qubits 100 2. For nr in range (LARGE NUMBER) Select random qubit(s) a. 10 Insert pairs of cancelling gates b. Hadamard gates i. 1 0  $\bigcirc$ **TKET** - blocked **DB** - blocked ii. CNOTS 0.1 e.g. LARGE NUMBER = 1 million (see next slides) 10000 100000 1000 1000000 **CNOT** count

considered the fastest

optimizer (written in Rust)

- Our tool is faster than |tket>。  $_{\circ}$ 
  - for more than 10k gates
  - speed-up increases with circuit size

### **Results: Multi-threaded performance**



Our benchmark circuit contains 1 million templates of either Type-1 or Type-2

- 2 million gates when using type-1
- 5 million gates when using type-2

# Conclusion: Executing algorithms/circuits of 100 qubits and 1M gates requires more work

### 1. Decoders

- a. Non-ML Decoders can be sped up by pipelining and parallelization https://arxiv.org/abs/2205.09828
- b. GNN Decoders seem to be learning the messages and algorithms of a message passing

### 2. Large scale compilation and optimization

- a. (Reinforcement) Learning of optimization algorithms has many bottlenecks
- b. Engineering Reward Functions seems to speed/improve RL <u>https://arxiv.org/abs/2311.12498</u>
- c. Compression of RL states with autoencoders https://arxiv.org/abs/2303.03280
- d. Parallelization and Fast Random Access of circuit optimization
  - i. ensures correctness of rewrites performed in parallel on the circuit
  - ii. supports analysis of circuits (e.g. average number of CNOT gates between pairs of neighbouring T gates)



