Guide to Benchmarking
The performance of Python applications that use TACO can be measured using
Python's built-in time.perf_counter
function with minimal changes to the
applications. As an example, we can benchmark the performance of the
scientific computing application shown here as
follows:
import pytaco as pt
from pytaco import compressed, dense
import numpy as np
import time
csr = pt.format([dense, compressed])
dv = pt.format([dense])
A = pt.read("pwtk.mtx", csr)
x = pt.from_array(np.random.uniform(size=A.shape[1]))
z = pt.from_array(np.random.uniform(size=A.shape[0]))
y = pt.tensor([A.shape[0]], dv)
i, j = pt.get_index_vars(2)
y[i] = A[i, j] * x[j] + z[i]
# Tell TACO to generate code to perform the SpMV computation
y.compile()
# Benchmark the actual SpMV computation
start = time.perf_counter()
y.compute()
end = time.perf_counter()
print("Execution time: {0} seconds".format(end - start))
In order to accurately measure TACO's computational performance, only the
time it takes to actually perform a computation should be measured. The time
it takes to generate code under the hood for performing that computation should
not be measured, since this overhead can be quite variable but can often be
amortized in practice. By default though, TACO will only generate and compile
code it needs for performing a computation immediately before it has to
actually perform the computation. As the example above demonstrates, by
manually calling the result tensor's compile
method, we can tell TACO to
generate code needed for performing the computation before benchmarking starts,
letting us measure only the performance of the computation itself.
Warning
pytaco.evaluate
and pytaco.einsum
should not be used to benchmark
TACO's computational performance, since timing those functions will
include the time it takes to generate code for performing the computation.
The time it takes to construct the initial operand tensors should also not be
measured, since again this overhead can often be amortized in practice. By
default, pytaco.read
and functions for converting NumPy arrays and SciPy
matrices to TACO tensors return fully constructed tensors. If you add nonzero
elements to an operand tensor by invoking its insert
method though, then
pack
must also be explicitly invoked before any benchmarking is done:
import pytaco as pt
from pytaco import compressed, dense
import numpy as np
import random
import time
csr = pt.format([dense, compressed])
dv = pt.format([dense])
A = pt.read("pwtk.mtx", csr)
x = pt.tensor([A.shape[1]], dv)
z = pt.tensor([A.shape[0]], dv)
y = pt.tensor([A.shape[0]], dv)
# Insert random values into x and z and pack them into dense arrays
for k in range(A.shape[1]):
x.insert([k], random.random())
x.pack()
for k in range(A.shape[0]):
z.insert([k], random.random())
z.pack()
i, j = pt.get_index_vars(2)
y[i] = A[i, j] * x[j] + z[i]
y.compile()
start = time.perf_counter()
y.compute()
end = time.perf_counter()
print("Execution time: {0} seconds".format(end - start))
TACO avoids regenerating code for performing the same computation though as long as the computation is redefined with the same index variables and with the same operand and result tensors. Thus, if your application executes the same computation many times in a loop and if the computation is executed on sufficiently large data sets, TACO will naturally amortize the overhead associated with generating code for performing the computation. In such scenarios, it is acceptable to include the initial code generation overhead in the performance measurement:
import pytaco as pt
from pytaco import compressed, dense
import numpy as np
import time
csr = pt.format([dense, compressed])
dv = pt.format([dense])
A = pt.read("pwtk.mtx", csr)
x = pt.tensor([A.shape[1]], dv)
z = pt.tensor([A.shape[0]], dv)
y = pt.tensor([A.shape[0]], dv)
for k in range(A.shape[1]):
x.insert([k], random.random())
x.pack()
for k in range(A.shape[0]):
z.insert([k], random.random())
z.pack()
i, j = pt.get_index_vars(2)
# Benchmark the iterative SpMV computation, including overhead for
# generating code in the first iteration to perform the computation
start = time.perf_counter()
for k in range(1000):
y[i] = A[i, j] * x[j] + z[i]
y.evaluate()
x[i] = y[i]
x.evaluate()
end = time.perf_counter()
print("Execution time: {0} seconds".format(end - start))
Warning
In order to avoid regenerating code for performing a computation, the
computation must be redefined with the exact same index variable objects
and also with the exact same tensor objects for operands and result. In
the example above, every loop iteration redefines the computation of y
and x
using the same tensor and index variable objects costructed outside
the loop, so TACO will only generate code to compute y
and x
in the
first iteration. If the index variables were constructed inside the loop
though, TACO would regenerate code to compute y
and x
in every loop
iteration, and the compilation overhead would not be amortized.
Note
As a rough rule of thumb, if a computation takes on the order of seconds or more in total to perform across all invocations with identical operands and result (and is always redefined with identical index variables), then it is acceptable to include the overhead associated with generating code for performing the computation in performance measurements.