We begin by tuning the algorithm using conventional techniques. These make modest performance improvements by providing efficient cache usage and inner-loop code.
One novel technique uses the z-buffer operations of the Intel i860 architecture to perform 4 independent computations in parallel. This achieves a five-fold speedup over the optimized code (six-fold over the original). We also describe a related technique that could be used by processors that have 64-bit integer operations, but no z-buffer.
Another new technique uses floating-point multiplies and adds in place of the standard algorithm's integer additions and maximum operations. This gains more than a three-fold speedup on the IBM POWER2 processor. This method doesn't give the identical answers as the original program, but experimental evidence shows that the inaccuracies are small and do not affect which strings are chosen as good matches by the algorithm.
The protein matching problem requires comparing a query protein q against a database D of reference proteins r[k]. Each protein is represented as a sequence of characters over a 23 character alphabet of amino acids. (The related gene matching problem involves a 16 character alphabet of the subsets of the four nucleotides.) Let |p| be the length of protein p, |D| be the number of proteins in the data base, and ||D|| be the size of the data base, that is, the sum of the lengths of its proteins. A typical size for a problem of interest today might be |q| = 1000, |D| = 10,000, and ||D|| = 200,000,000 (the reference molecules are often longer than the query strings).
Smith and Waterman and Gotoh develop a dynamic programming algorithm that finds the optimal matching between two strings. The algorithm finds regions where the two proteins share genetic material. The score for a shared region is the cumulative sum of the the values (given by a table) of matching the amino acids in the corresponding positions. The matching regions may be separated by gaps of non-matching material in one or both of the strings. Since a long gap can be introduced by a single mutation event, the algorithm assigns a penalty of the form U+kV for a gap of length k, where U is typically larger than V. This level of generality has been found to be sufficient to provide accurate answers.
Orthogonal to the problem of speeding up any particular algorithm is the challenge of finding better algorithms. Due to the importance of sequence comparison problem, there is considerable interest in approximate string-matching algorithms that give satisfactory answers with less computations. Such programs include FLASH, FASTA, and BLAST. Although these algorithms show promising results, the Smith-Waterman algorithm is still of great use when the most accurate answers are needed.
The Smith-Waterman algorithm has a O(|q| ||D||) running time. For the typical problem size mentioned above, the computation takes many hours on the most powerful workstations; however, it is easy to take advantage of parallel supercomputers or workstation clusters since the query protein can be scored against each of the |D| reference protein independently.
This paper presents techniques for parallelizing the Smith-Waterman algorithm at the very lowest level of the computational hierarchy -- the bits within a computer word. We demonstrate a speedup of more than a factor of five using one of the techniques. While theorists may scoff that this is only a constant factor improvement, to a computational scientist, it can make the difference between certain experiments being impractical or practical.
In order for speedup numbers to be meaningful, care must be taken that comparisons are made with the "best sequential code" [Bailey]. Consequently, the next section details the conventional performance tuning of the algorithm. Section 3 presents two microparallel techniques and their application to protein matching. One uses the z-buffer instructions of the Intel I860, the processor chip used by the Intel Paragon; the other, 64-bit integer instructions of the DEC alpha and PowerPC 620 chips. Section 4 presents another technique based on interpreting ADD and MIN operations as floating-point multiply and add instructions. The concluding section briefly touches upon load-balancing and scalability issues.
The Smith-Waterman algorithm is a dynamic programming algorithm
related to the problem of computing the minimum edit distance between
two strings. Three scores are computed for each prefix of the query
protein q matched with each prefix of a reference protein r:
nogap
gives the score of matching the two prefixes,
q_gap
gives the score of a match ending in a gap in the
query protein, and r_gap
gives the score of a match
ending in a gap in the reference protein. The penalty of starting a
gap is given by the constant U
; the penalty for extending
a gap by one more position is V
. The value of matching a
character from each protein is given by the matrix
Value
. The final score of the two proteins is simply the
largest nogap
score computed for any pair of their
prefixes.
i
) along
the x-axis, inner iterations (on j
) along the y-axis, and
middle iterations (on k
) along the z-axis. There are
dependence arcs into a point <i,j,k>
from the
points before it (<i,j,k-1>
), below it
(<i,j-1,k>
), and diagonally below and before it
(<i,j-1,k-1>
). Notice the absence of left-right
dependences. To perform the computation, each node of the graph must be evaluated. However, it is not necessary that the be evaluated in exactly the order specified in figure 1. Any evaluation order that respects the dependences (no node is evaluated before any node that points to it) is legitimate. Since no arcs go from left to right (or vice versa), yz-planes can be executed in parallel.
As mentioned earlier, the three-dimensional arrays of figure 1 are unnecessary. Each iteration of the inner
loop uses scores (of r_gap
and nogap
)
computed in the preceding iteration of the inner loop (on
j
) and of scores (of q_gap
and
nogap
) computed in the preceding iteration of the next outer
loop (on k
). Only these intermediate scores need be
maintained. Thus, space proportional to |q| is sufficient to compute
score[i]
.
nogap
and
q_gap
computed in one iteration of the k
loop are used in the next. Unless the cache is big enough to hold all
these scores (as well as the Value
matrix), there will be
a cost associated with bringing the them back into cache. Since
nogap
and q_gap
are accessed sequentially,
the cost of a cache miss will be amortized over the length of a cache
line, so the cost will not usually be large.
Cache miss penalties can be avoided entirely by blocking the
computation. First, the middle loop is stripmined, that
is, broken into two nested loops. Then, the two innermost loops are
interchanged. The resulting computation proceeds through the yz-plane
of the iteration space graph in broad vertical
swaths. Notice that this order of evaluation respects the dependences
of the graph. The optimal width of the swathes,
swathsize
, is machine-dependent, and should be chosen to
be as large as possible (to reduce the loop overhead) while still
getting adequate cache reuse.
In the blocked program, the arrays nogap_horz
and
q_gap_horz
are used to store scores computed in the inner
loop that are needed at the next level up, while separate arrays
nogap_vert
and r_gap_vert
are needed to
store scores at the vertical swath boundaries. Thus, this approach
requires additional space proportional to swathsize
.
Value
matrix is
doubly subscripted. After the loop interchange described above, the
first subscript (q[j]
) doesn't change in the inner loop, so
this doubly subscripted array can be replaced by a singly subscripted
one. However, even a single subscript requires an indirect reference.
This indirect reference can also be eliminated. Outside the two
innermost loops, a sequence of costs is computed for each possible
character in the query protein. This sequence gives the costs for each
character in a segment of the reference protein and the given
character of q. Outside the innermost loop, the appropriate sequence
for the current character of q is chosen. In the innermost loop, the
values of this sequence are read sequentially.
The resulting code is shown in figure 2.
The operations have been reordered a little to avoid register copying.
The inner loop entails 5 ADD
's, 6 MAX
's, 3
loads, and 2 store's.
for i from 0 to |D|-1 do
for j from 0 to |q|-1 do
r_gap_vert[j], nogap_vert[j] := U, 0
for kk from 0 to |D[i]|-1 by swathsize do
this_swath := min(swathsize,|D[i]|-kk)-1
for a in alphabet do
for k from 0 to this_swath do
ValueDi[a][k] := Value[a, D[i,k+kk]]
for k from 0 to this_swath do
q_gap_horz[k], nogap_horz[k] := U, 0
temp := 0
for j from 0 to |q|-1 do
nogap := nogap_vert[j]
r_gap := r_gap_vert[j]
ValueqjDi:= ValueDi[q[j]]
for k from 0 to this_swath do
r_gap := ADD( V,
MAX( r_gap,
ADD( nogap,
U ) ) )
nogap := MAX( ADD( temp,
ValueqjDi[k] ),
ZERO )
temp := nogap_horz[k]
q_gap := ADD( V,
MAX( q_gap_horz[k],
ADD( temp,
U ) ) )
nogap := MAX( nogap,
MAX( q_gap,
r_gap ) )
score := MAX(score, nogap)
nogap_horz[k], q_gap_horz[k] := nogap, q_gap
temp := nogap_vert[j]
nogap_vert[j] := nogap
r_gap_vert[j] := r_gap
Figure 2. Optimized pseudocode.
We implemented the above pseudocode in C and measured the performance
on a single node of an Intel Paragon (that is, an i860 processor) and
on an IBM SP2 node (i.e. a RISC System/6000 model 590 or "Power2"
processor). The results are shown in figure
3. The times reported are computed by taking the running time of
the programs and dividing by the product of the length of the query
string and the length of the reference string. Thus, they represent
the time to compute one iteration of the inner loop, plus all
amortized overhead. The reference string always had length 1000, while
we varied the length |q| of the query string as shown. There was some
timing variation on the i860 (a maximum of 20 percent, but typically
much less), so the times given are the average of three runs.
Intel i860 | IBM Power2
|
q-String Length 500 2000 5000 | 500 2000 5000
------------------- | -------------------
Original Code 954ns 988 1077 | 280 265 264
|
Optimized Code 886 862 859 | 260 250 250
|
Speedup 7.6% 15% 25% | 7.7% 6.0% 5.6%
Figure 3. Result of blocking and inner loop
improvements. Times are nanoseconds for the entire run
divided by the product of the string lengths.
The times show that the original code on the i860 suffered a 13%
performance degradation for the longest string length, presumably
because of cache misses. The Power2, which has a larger cache, showed
no degredation for these string lengths. Even when the data fits in
cache, the optimized code runs faster than the original,
and its performance actually improves on longer
strings, since the overhead of constructing the localized Value table
is amortized over more iterations. Thus, the optimizations were
successful, although the speed improvements are not particularly
dramatic.
Modern pipelined and superscalar processors execute several instructions concurrently. If an instruction needs to use a result computed by the previous instruction, there may be a pipeline stall or functional unit interlock. Good compilers attempt to choose a sequence of instructions that reduce the number of such delays, but they are limited by the dependences of the source code. To realize the full power of the processor, it is sometimes necessary to restructure the source code to provide more independent operations.
Because of the readily available parallelism of the protein matching problem, this is easy to accomplish; see figure 4. The outermost loop was changed to have a stride of two, and the innermost loop is changed to process two proteins simultaneously. In terms of the Iteration Space Graph, the new code executes two yz-planes simultaneously. In the language of compiler optimizations, the outer loop was stripmined, one of the resulting loops was interchanged all the way to the innermost position, and it was subsequently unrolled. These optimizations are all legal since they respect the data dependences.
We also note that even if there were only one reference protein, it would still be possible, though more complicated, to find instruction-level parallelism. The example of Carter, Ferrante, and Hummel has the same iteration space graph as one yz-plane (except for the shape of the boundary), and the techniques of that paper could be used.
for i from 0 to |D|-1 by 2 do
...
for kk from 0 to |D[i]|-1 by swathsize do
/* |D[i]| must equal |D[i+1]| */
...
for j from 0 to |q|-1 do
...
for k from 0 to min(swathsize,|D[i]|-kk)-1 do
r_gap0 := ADD( V,
MAX( r_gap0,
ADD( nogap0,
U ) ) )
r_gap1 := ADD( V,
MAX( r_gap1,
ADD( nogap1,
U ) ) )
nogap0 := ...
...
Figure 4. Code with more independent inner-loop operations.
We implemented the above code and ran it on the i860 processor and the
Sparc processor of a Sun Sparc 20 workstation. In both cases, the
program ran slightly slower than the optimized code of the preceding
section. However the instruction level parallelism resulted in a
modest improvement on the Power2 architecture, which has two
fixed-point functional units (as well as two floating point
units). The results are shown in figure 5. The
IBM xlc compiler used compare and branch instructions to implement most of
the MAX operations, although it used the hardware's "monus"
(difference or zero) instruction for several. Branch instructions
reduce the effectiveness of the multiple functional units, since all
units must be restarted when there is a mis-predicted branch. We
believe that greater speedups would have resulted if the compiler had
made more liberal use of the monus instructions.
This section presents a microparallel implementation of the MIN and ADD operations. As with the other forms of parallelism, the operations being executed in parallel correspond to independent reference strings, and these strings should be chosen to be roughly the same length.
Certain z-buffer instructions of the Intel Paragon's i860 processors
operate on four 16-bit fields of an 8-byte doubleword simultaneously.
The fzchks
instruction performs 4 MIN operations in
parallel. The i860 also has a double word add (fiadd.dd
)
that performs an integer addition on the four 16-bit fields that are
supplied in even-odd floating point register pairs. Although there is
no z-buffer MAX instruction, we can represent the value of a constant
C (StartGap
, ExtendGap
, or any entry in the
Value
matrix) by the 16-bit integer BIAS - C, where BIAS
= 0xFF9C. With this representation, the fzchks
instruction acts as a MAX instruction on the biased encoding. The
particular choice of the BIAS constant allows us to represent values
in the range -100 < C < 65436, which is appropriate since the MAX's
with 0 in the Smith-Waterman code keep the negative numbers close to
zero. However, the ADD instruction must take the biasing into
account. Thus, ADD(x,y) = x + y - BIAS.
The Paragon's compiler does not produce the needed
fiadd.dd
, fisub.dd
, and fzchks
instructions. To get around this problem, we implemented the
Smith-Waterman algorithm in C using +, -, and *, respectively, for
adding, subtracting, and computing the min of the 8-byte encodings,
which were declared as "double". Then a program, CHANGE, modified the
assembly code produced by the compiler, converting the floating-point
add, subtract and multiply instructions to fiadd.dd
,
fisub.dd
, and fzchks
instructions.
The final change needed to implement the microparallel algorithm is to
bias-encode and pack the appropriate groups of four values from the
ValueDi
array into the 8-byte doublewords. Our earlier
optimizations, which removed the indirect addressing from the inner
loop, has the added benefit that only 23 packed numbers (one for each
possible value of the query string) need to be computed per 1x|q|x4
column of the iteration space graph.
We implemented 4-way microparallel code and obtained the results of figure 6.
Intel i860
q-String Length 500 2000 5000
----------------------
4-way microparallel 186ns 188 168
Speedup over optimized 4.76 4.59 5.11
Speedup over original 5.13 5.26 6.41
Figure 6. Performance of 4-way microparallel z-buffer code.
Times are nanoseconds divided by the product of the string lengths, and are
further divided by 4 (the number of strings processed in parallel.)
The performance improvement is quite dramatic! The "superlinear speedup"
of more than a factor of four for four-way microparallelism is
probably a consequence of the elimination of the conditional
branches used in computing the MAX's in the original and optimized codes.
Even though the z-buffer operations are slower than integer operations,
conditional branches are slower still.
We wish to emphasize that even though a little hacking of the assembly code was needed, the performance gains are not the result of careful assembly code tuning. None of the authors of this paper had prior familiarity with the i860's architecture. We suspect that someone who really understood the i860 could get even more improvements by using the "dual instruction" mode which allows a floating point or z-buffer operation and a fixed point operation to be executed concurrently.
ADD(A,B): ADD := A+B /* ith field: Ai + Bi */ MAX(A,B): t1 := A or x80808080 /* puts a 1 in the separator bits of A */ t2 := t1 - B /* ith field: 2^15+ Ai - Bi */ t3 := t2 and x80808080 /* separator bit is 1 iff Ai>Bi */ t4 := t3 shiftright 15 /* move separator bit to low-order bit */ t5 := t3 - t4 /* ith field: if Ai>Bi then 7F else 0 */ t6 := t2 and t5 /* ith field: MONUS(Ai,Bi) */ MAX := t6 + B /* ith field: if Ai>Bi then Ai else Bi */Figure 7. A 64-bit microparallel ADD and MAX operations.
Depending on the length of the proteins being matched and on the table entries, it is possible that fewer than 15 bits are required. This method can easily be modified to process five 11-bit or six 9-bit numbers in parallel.
The two algebraic structures S1 = (R, +, 0, MAX) and S2 = (R+, *, 1, +) are both semirings, where R is the set of real numbers and R+ is the positive reals. Given a real number BASE, the function encode(A) = BASE**A maps numbers from S1 into those of S2. The inverse mapping is decode(X) = log(X)/log(BASE). These transformations preserve the first semiring operation, that is,
The performance advantages of the floating-point formulation of the algorithm are that the pipeline-stopping conditional branches are eliminated, that the integer instruction unit can be devoted to doing loads and stores while the floating-point unit does the multiply and adds, and that fused multiply-add instructions can implement a MAX and ADD in a single instruction. (This last advantage could be put to fuller advantage by restructuring the inner loop to perform 5 multiply-adds and one multiply; we did not perform this experiment.) Figure 10 reports on the results of experiments on the Power2. Since this processor has two floating-point units, and each uses a two-stage pipeline, the instruction-level parallelism described earlier is of particular importance. The figure shows the result of matching 1 to 4 reference strings to a query string in parallel.
IBM Power2
IL Parallelization 1-way 2-way 3-way 4-way
----------------------------
Floating-Point Code 170ns 90 77 78
Speedup over IL-parallel 1.53 2.72 - -
Speedup over Original 1.59 3.0 3.51 3.46
Figure 10. Performance of the Floating-Point Algorithm.
Times are nanoseconds divided by the product of the string lengths
(which are all 1000 long), and are further divided by the number of
strings processed in parallel.
Once again, the performance is impressive. The clock on the Power2 is about
15ns, so each matching step of the algorithm -- involving 5 ADD's, 6 MAX's,
4 loads, 3 stores, and assorted bookkeeping operations -- takes only about
5 cycles!
Where two or more strings are bundled together to obtain either instruction-level parallelism or microparallelism, the shorter reference strings of the bundle must be padded with non-matching characters. This will keep the answer correct. To minimize the wasted cycles, the strings bundled together should be chosen to be nearly-equal length. Again, this is not difficult to do. (It would, however, prove extremely challenging to an automated program transformation tool!)
Introducing low-level parallelism can have a detrimental effect on the load-balancing problem, since there will be fewer bundles than there were reference strings. This effect should be minor, and will diminish as the number of reference strings in the data base grows.
A final concern: whenever the performance of individual processors is improved, the demand on the network that provided data to the processors is increased. While this might lessen the overall performance of multicomputers on some problems, the protein matching problem has a very high computation-to-communication ratio. For instance, with strings of length 1000, a bandwidth of 16 KB/sec is sufficient to keep up with the fastest computation speed of the Power2, even assuming that the all reference and query strings are communicated, and that no data compression is used. Thus, the factor of 3 to 5 speed improvements demonstrated in this paper should translate directly to comparable improvements on parallel computers.