# Computer Arithmetic in High Performance Reconfigurable Computing

George A. Constantinides<sup>1</sup>

Imperial College London

January 15, 2012

<sup>1</sup>Thanks to: Mr Juan Jerez

George A. Constantinides Computer Arithmetic in High Performance Reconfigurable Comp

- Returning silicon to computation: some options, and convergence
- Reconfigurable computing: the FPGA
- Compilation driven by accuracy specifications: opening windows
- Computer arithmetic: numerical and hardware considerations
- A case study: accelerating the Lanczos algorithm (for LE solution)

# Our Research Group

- Circuits and Systems Group
- One of 5 research groups in EEE at Imperial
- In numbers:
  - 9 academic staff
  - c.20 research staff
  - c.40 PhD students
- Analogue electronics: medical devices
- Digital electronics: programmable devices, tools and applications
- My support from EPSRC, EC, Altera, Xilinx, ARM, Imagination, ESA and Mathworks.



















#### Computer Architecture is Changing



- All about silicon efficiency! Cache or computation?
- Fixed-functionality hardware too expensive what's the *right* way to introduce programmability?
- GPU: Hundreds of cores. Explicit memory management. Singles not doubles. Simple task parallelism.
- FPGA: Tens of thousands of 'cores'. Explicit memory management. Right-sized datapath. Arbitrary structure.

# What is a Field-Programmable Gate Array?



- Regular stucture, so at technology leading edge.
- Convergence of architectures?

Recently popularised by Berkeley: 'quick' emulation of manycore architectures.



### The FPGA as a GPP!



- Select an architecture for your algorithm.
- High speed (e.g. 20x LP), low power (e.g. 8x MC).

- Hardware represents numerical data as bit strings.
- A bit string of length *n* can represent at most 2<sup>*n*</sup> distinct values.
- Representations vary in how these strings are mapped onto reals: f : {0,1}<sup>n</sup> → ℝ.
  - Floating-point (s-m-e),
  - Fixed-point (s-m,2c,1c),
  - LNS (s-e),
  - RNS (m,m,...), etc.
- We have always cared about using 'enough' precision. Now we should care about using 'just enough' precision.
  - Lower precision  $\Rightarrow$  smaller area units, less bandwidth  $\Rightarrow$  more units, more transfer  $\Rightarrow$  faster.
  - Remember: It's all about silicon efficiency!

#### Hardware Implications



(from http://www.cise.ufl.edu/~mssz/CompOrg/CDAarith.html) (array/Wallace/Dadda)

Table: Resource utilization and latency for a single adder on a Virtex7XT 1140 using Xilinx Core Generator floating point v5.0

|        | Registers | LUTs | Latency |
|--------|-----------|------|---------|
| double | 1046      | 911  | 14      |
| float  | 557       | 477  | 11      |
| FX53   | 53        | 53   | 1       |
| FX24   | 24        | 24   | 1       |

- $\sim 20x$  resource savings
- $\sim 10x$  latency savings

#### The Central Problem

 $\max_{\substack{s,p\\ subject to : \forall i. Pre(i) \to Out(s, p, i) \in Accept(i).}$ 

- Here *s* denotes circuit structural parameters, and *p* denotes circuit precisions. Note plural we are in a parallel environment, so we should take advantage!
- Select degree of parallelism, circuit structure, number representation(s) and precisions to maximise performance on a given architecture.

(1)

#### The Lanczos Algorithm



Latency (cycles) given by

$$\lceil \frac{N}{P} \rceil + l_A * \lceil \log_2(N) \rceil + 5l_M + l_A + l_{SQ} + l_D + 2 + 2l_{red}$$
$$l_{red} := \lceil \frac{N}{P} \rceil + l_A * \lceil \log_2(P) \rceil + l_A + l_A * \lceil \log_2(l_A) \rceil - 1$$

#### Fixed point? Peak bounds required on...

- q<sub>i</sub>
- A
- Aq<sub>i</sub>
- $\alpha_i$
- $\beta_i q_{i-1}$
- $Aq_i \beta_{i-1}q_{i-1}$
- α<sub>i</sub>q<sub>i</sub>
- *q*<sub>*i*+1</sub>
- $q_{i+1}^T q_{i+1}$
- β<sub>i</sub>
- $\frac{1}{\beta_i}$

### Preconditioning

Instead of solving

$$Ax = b$$

solve

$$M^{-\frac{1}{2}}AM^{-\frac{1}{2}}y = M^{-\frac{1}{2}}b$$

$$\widehat{A}y = \widehat{b}$$
(2)
(3)

The solution to the original problem can be recovered via

$$x := M^{-\frac{1}{2}}y$$

We propose a positive diagonal preconditioner whose elements are the absolute sum of the elements in each row of A, i.e.

$$M_{kk} = \sum_{j=1}^{N} |A_{kj}|,$$

#### Theorem

Given preconditioner M, the symmetric Lanczos algorithm applied to  $\widehat{A} := M^{-\frac{1}{2}}AM^{-\frac{1}{2}}$ , for any symmetric matrix A, has intermediate variables with the following bounds:

- $q_i \in (-1,1)$
- $\widehat{A} \in (-1,1)$
- $\widehat{A}q_i \in (-1,1)$
- $\alpha_i \in (-1, 1)$
- $\beta_i q_{i-1} \in (-1, 1)$
- $\alpha_i q_i \in (-1, 1)$

- $\widehat{A}q_i \beta_{i-1}q_{i-1} \in (-2,2)$
- $q_{i+1} \in (-1,1)$
- $q_{i+1}^T q_{i+1} \in (-1,1)$
- $\beta_i \in (\epsilon, 1)$
- $\frac{1}{\beta_i} \in (0, 1/\epsilon)$

where  $\epsilon$  is determined by termination condition.

 $\min_{P,k} \text{ latency }$ 

subject to

$$\mathbb{P}(e_k \le \eta) > 90\%$$
(4)  
  $R(P,k) \le \mathsf{FPGA}_{\mathsf{area}}$ (5)

Find minimum number of bits k such that (4) is satisfied

Find maximum parallelism P such that (5) remains satisfied



#### Performance (N=229) - Many problems



- Architecture is not what it was.
- Microelectronics design, NA, HPC, programming language theory, all have strong contributing roles to play.
- Need to devise suitable specification format, and compilers, for heterogeneous architectures and look at the implications for those architectures.
- Let's talk!

#### Some Adverts...



George A. Constantinides

Computer Arithmetic in High Performance Reconfigurable Comp