Matrix Inversion

Explicit matrix inversion (analytic method) of a full matrix is a computationally intensive method. If the inversion is encountered, one should consider converting this problem into an easy decomposition problem which will result in analytic simplicity and computational convenience.

Decomposition methods are generally viewed as the preferred methods for matrix inversion because they scale well for large matrix dimensions while the complexity of the analytic method increases dramatically as the matrix dimensions grow. However, for small matrices, the analytic method, which can exploit a significant amount of parallelism, outperforms the decomposition methods. Also note that Cholesky and LU decompositions work only with positive definite and nonsingular diagonally dominant square matrices, respectively. QR decomposition, on the other hand, is more general and can be applied to any matrix.

Design space exploration for matrix inversion architectures can be divided into two parts: inflection point analysis and architectural design alternatives analysis. We also compared our results with the previously published papers.
Inflection Point Analysis:We present different execution results, serial and parallel, for different bit widths and matrix dimensions to answer: at what matrix size does an inflection point occur and how does varying bit width and degree of parallelism change the inflection point? The comparisons for sequential and parallel executions of matrix decomposition and inversion methods are shown in Figure below (a, b, c, d) with different bit widths: 16, 32 and 64. Square, spade and triangle represent QR, LU and Cholesky methods respectively. Solid, dashed and smaller dashed lines represent 64, 32 and 16 bits of bit widths respectively. The balloons denote the inflection points between these methods for the different bit widths.


Different design space exploration: inflection point analyses, of our tool. Top row: decomposition methods only. Bottom row: matrix inversion.

Architectural Design Alternatives: These analyses are shown for matrix inversion for different bit widths and matrix sizes. We present area results in terms of slices and performance results in terms of throughput. Throughput is calculated by dividing the maximum clock frequency (MHz) by the number of clock cycles to perform matrix inversion. All designs are written in Verilog and synthesized using Xilinx ISE 9.2. Resource utilization and design frequency are post place and route values obtained using a Virtex 4 SX35 FPGA. Both mode 1 (non-optimized) and mode 2 (optimized) results are shown for general matrix inversion in Figure below (a) to show the improvement in the results with the optimization feature. It is shown that area and throughput increase up to the optimal number of resources as the number of resources increase. However, adding more than the optimal number of resources decreases throughput while still increasing area. Mode 2 of GUSTO finds the optimal number of resources which maximizes the throughput while minimizing area where the application specific architecture provides an average of 59% decrease in area and 3X increase in throughput over Mode 1's general purpose (non optimized) design.


Different design space exploration examples, specifically area and throughput results of different bit width and matrix dimensions, of our tool.

Bit width of the data is another important input for the matrix inversion. The precision of the results is directly dependent on the number of bits used. The usage of a high number of bits results in high precision at a cost of higher area and lower throughput. We present 3 different bit widths, 19, 26 and 32 bits in (b) for these three different matrix inversion architectures. We also present three different matrix dimension, 4 × 4, 6 × 6 and 8 × 8, implementation results in (c) showing how the area and performance results scale with matrix dimension.

Comparisons: We provide a comparison between our results and previously published implementations for 4 × 4 matrices in Table below. We present all of our implementations with bit width 20 as this is the largest bit width value used in the related works. Though it is difficult to make direct comparisons between our designs and those of the related works (because we used fixed point arithmetic instead of floating point arithmetic and fully used FPGA resources (like DSP48s) instead of LUTs), we observe that our results are comparable. The main advantages of our implementation are that it provides the designer the ability to study the tradeoffs between architectures with different design parameters and provides a means to find an optimal design.

Ref [1] Ref [1] GUSTO
Impl A
GUSTO
Impl B
GUSTO
Impl C
Ref [2] Ref [3] GUSTO GUSTO GUSTO
Method Analytic Analytic Analytic Analytic Analytic QR QR QR Cholesky LU
Bit width 16 20 20 20 20 12 20 20 20 20
Data type floating floating fixed fixed fixed fixed floating fixed fixed fixed
Device type Virtex 4 Virtex 4 Virtex 4 Virtex 4 Virtex 4 Virtex 2 Virtex 4 Virtex 4 Virtex 4 Virtex 4
Slices 1561 2094 702 1400 2808 4400 9117 3584 2719 3682
DSP48s 0 0 4 8 16 NR 22 12 12 12
BRAMs NR NR 0 0 0 NR NR 1 1 1
Throughput 1.04 0.83 0.38 0.72 1.3 0.28 0.12 0.26 0.33 0.25

Table Comparisons between our results and previously published papers. NR denotes not reported.

[1] J. Eilert, D. Wu, D. Liu, "Efficient Complex Matrix Inversion for MIMO Software Defined Radio", IEEE International Symposium on Circuits and Systems. (2007) 2610 - 2613.
[2] F. Edman, V. Owall, "A Scalable Pipelined Complex Valued Matrix Inversion Architecture", IEEE International Symposium on Circuits and Systems. (2005) 4489 - 4492.
[3] M. Karkooti, J.R. Cavallaro, C. Dick, "FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm", Conference Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers (2005) 1625 - 162.