Strategies for Optimization
This section describes various strategies for improving the performace of applications that use TACO to perform linear and tensor algebra computations.
Selecting the Right Tensor Format
TACO supports storing tensors in a wide range of formats, including many commonly used ones like dense arrays, compressed sparse row (CSR), and compressed sparse fiber (CSF). Using the right formats to store a sparse computation's operands and result can not only reduce the amount of memory needed to perform the computation but also improve its performance. In particular, by selecting formats that accurately describe the sparsity and structure of the operands, TACO can generate code under the hood that exploits these properties of the data to avoid redundantly computing with zero elements and thus speed up a computation.
As previously explained, TACO uses a novel scheme that describes different tensor storage formats by specifying whether each dimension is sparse or dense. A dense dimension indicates to TACO that most if not all slices of the tensor along that dimension contain at least one nonzero element. So if every element in a matrix is nonzero, we can make that explicit by storing the matrix in a format where both dimensions are dense, which indicates that every row is nonempty and that every column in each row stores a nonzero element:
pytaco.format([pytaco.dense, pytaco.dense]) # a.k.a. a dense array
A sparse dimension, on the other hand, indicates to TACO that most slices of the tensor along that dimension contain only zeros. So if relatively few rows of a matrix is nonempty and if relatively few columns in each nonempty row store nonzero elements, we can also make that explicit by storing the matrix in a format where both dimensions are sparse:
pytaco.format([pytaco.compressed, pytaco.compressed]) # a.k.a. a DCSR matrix
Storing a tensor dimension as a sparse dimension incurs overhead that is proportional to the number of nonempty slices along that dimension, so only do so if most slices are actually empty. Otherwise, it is more appropriate to store the dimension as a dense dimension.
It is easy to define custom formats for storing tensors with complex sparsity structures. For example, let's say we have a three-dimensional tensor that has no empty slice along the dimension, and let's say that each row in a slice is either entirely empty (i.e., for all and some fixed , ) or entirely full (i.e., for all and some fixed , ). Following the same scheme as before, we can define a tensor format that stores dimension 2 (i.e., the dimension) as a dense dimension, stores dimension 0 (i.e., the dimension) of each slice along dimension 2 as a sparse dimension, and stores dimension 1 (i.e., the dimension) of each nonempty row as a dense dimension also:
pytaco.format([pytaco.dense, pytaco.compressed, pytaco.dense], [2, 0, 1])
Using the format above, we can then efficiently store without explicitly storing any zero element.
As a rough rule of thumb, using formats that accurately describe the sparsity and structure of the operands lets TACO minimize memory traffic incurred to load tensors from memory as well as minimize redundant work done to perform a computation, which boosts performance. This is particularly the case when only one operand is sparse and the computation does not involve adding elements of multiple operands. This is not a hard and fast rule though. In particular, computing with multiple sparse operands might prevent TACO from applying some optimizations like parallelization that might otherwise be possible if some of those operands were stored in dense formats. Depending on how sparse your data actually is, this may or may not negatively impact performance.
The most reliable way to determine what are the best formats for storing tensors in your application is to just try out many different formats and see what works. Fortunately, as the examples above demonstrate, this is simple to do with TACO.
By default, TACO performs all computations using a single thread. The maximum
number of threads that TACO may use to perform computations can be adjusted by
pytaco.set_num_threads function. The example below, for
instance, tells TACO that up to four threads may be used to execute any
subsequent computation in parallel if possible:
In general, the maximum number of threads for performing computations should not be set greater than the number of available processor cores. And depending on the specific computation and characteristics of the data, setting the maximum number of threads to be less than the number of processor cores may actually yield better performance. As the example above demonstrates, TACO makes it easy to try out different numbers of threads and see what works best for your application.
Setting the maximum number of available threads to be greater than one does not guarantee that all computations will be executed in parallel. In particular, TACO will not execute a computation in parallel if
- it multiplies two or more sparse operands (e.g. sparse vector multiplication) or adds a sparse operand to any other operand (e.g., adding two DCSR matrices), unless the outermost dimensions of the sparse operands are stored as dense dimensions (e.g., adding two CSR matrices, which can be parallelized if the result is stored in a dense matrix);
- the first dimension it has to iterate over is one that is supposed to reduce over (e.g., multiplying a CSC matrix by a vector, which requires iterating over the column dimension of the matrix before the row dimension even though the column dimension is reduced over); or
- it stores the result in a sparse tensor format.
If TACO does not seem to be executing a computation in parallel, using different formats to store the operands and result may help.
By default, when performing computations in parallel, TACO will assign the same
number of coordinates along a particular dimension to be processed by each
thread. For instance, when adding two 1000-by-1000 matrices using two threads,
TACO will have each thread compute exactly 500 rows of the result. This would
be inefficient though if, for instance, all the nonzeros are stored in the
first 500 rows of the operands, since one thread would end up doing all the
work while the other thread does nothing. In cases like this, an alternative
parallelization strategy can be specified by calling the
In contrast to the default parallelization strategy, the dynamic strategy will have each thread first compute just one row of the result. Whenever a thread finishes computing a row, TACO will assign another row for that thread to compute, and this process is repeated until all 1000 rows have been computed. In this way, work is guaranteed to be evenly distributed between the two threads regardless of the sparsity structures of the operands.
Using a dynamic strategy for parallel execution will incur some overhead though, since work is assigned to threads at runtime. This overhead can be reduced by increasing the chunk size, which is the amount of additional work that is assigned to a thread whenever it completes its previously assigned work. The example below, for instance, tells TACO to assign ten additional rows of the result (instead of just one) for a thread to compute whenever it has completed the previous ten rows:
Since dynamic parallelization strategies incur additional overhead, whether or not using them improves the performance of a computation will depend on how evenly spread out the nonzero elements in the tensor operands are. If each matrix contains roughly the same number of nonzeros in every row, for instance, then using a dynamic strategy will likely not more evenly distribute work between threads. In that case, the default static schedule would likely yield better performance.
TACO supports efficiently computing complicated tensor algebra expressions
involving many discrete operations in a single shot. Let's say, for instance,
that we would like to (element-wise) add two vectors
c and compute
the cosine of each element in the sum. We can, of course, simply compute the
addition and the cosine of the sum in separate statements:
t[i] = b[i] + c[i] a[i] = pt.cos(t[i])
The program above will first invoke TACO to add
c, store the result
into a temporary vector
t, and then invoke TACO again to compute the cosine
of every element in
t. Performing the computation this way though not only
requires additional memory for storing
t but also requires accessing the
memory subsystem to first write
t to memory and then load
t back from
memory, which is inefficient if the vectors are large and cannot be stored in
cache. Instead, we can compute the addition and the cosine of the sum in a
a[i] = pt.cos(b[i] + c[i])
For the program above, TACO will automatically generate code that, for every
i, immediately computes the cosine of
b[i] + c[i] as soon as the sum is
computed. TACO thus avoids storing the sum of
c in a temporary
vector, thereby increasing the performance of the computation.
Fusing computations can improve performance if it does not require intermediate
results to be recomputed multiple times, as is the case with the previous
example. Let's say, however, that we would like to multiply a matrix
B by a
c and then multiply another matrix
A by the result of the first
multiplication. As before, we can express both operations in a single
y[i] = A[i,j] * B[j,k] * x[k]
In this case though, computing both operations in one shot would require that
the multiplication of
x be redundantly recomputed for every
(non-empty) row of
A, thus reducing performance. By contrast, computing the
two matrix-vector multiplications in separate statement ensures that the result
of the first matrix-vector multiplication does not have to be redundantly
computed, thereby minimizing the amount of work needed to perform the
t[j] = B[j,k] * c[k] y[i] = A[i,j] * t[j]