TACO is built on research that is described in the following publications, which include conference papers and preprints (denoted by ) as well as master’s and PhD theses (denoted by ). Simply click on the title of any paper or thesis below to see more information, download a copy, or watch the accompanying talk.

(*) denotes co-first authors.

2022

##
**Compilation of Dynamic Sparse Tensor Algebra**
Stephen Chou and Saman Amarasinghe (OOPSLA 2022)

Download Paper Download SlidesAbstract

Many applications, from social network graph analytics to control flow analysis, compute on sparse data that evolves over the course of program execution. Such data can be represented as dynamic sparse tensors and efficiently stored in formats (data layouts) that utilize pointer-based data structures like block linked lists, binary search trees, B-trees, and C-trees among others. These specialized formats support fast in-place modification and are thus better suited than traditional, array-based data structures like CSR for storing dynamic sparse tensors. However, different dynamic sparse tensor formats have distinct benefits and drawbacks, and performing different computations on tensors that are stored in different formats can require vastly dissimilar code that are not straightforward to correctly implement and optimize.

This paper shows how a compiler can generate efficient code to compute tensor algebra operations on dynamic sparse tensors that may be stored in a wide range of disparate formats. We propose a language for precisely specifying recursive, pointer-based data structures, and we show how this language can express many different dynamic data structures, including all the ones named above as well as many more. We then describe how, given high-level specifications of such dynamic data structures, a compiler can emit code to efficiently access and compute on dynamic sparse tensors that are stored in the aforementioned data structures.

We evaluate our technique and find it generates efficient dynamic sparse tensor algebra kernels that have performance comparable to, if not better than, state-of-the-art libraries and frameworks such as PAM, Aspen, STINGER, and Terrace. At the same time, our technique supports a wider range of tensor algebra operations—such as those that simultaneously compute with static and dynamic sparse tensors—than Aspen, STINGER, and Terrace, while also achieving significantly better performance than PAM for those same operations.BibTex

@article{chou:2022:dynamic, author = {Chou, Stephen and Amarasinghe, Saman}, title = {Compilation of Dynamic Sparse Tensor Algebra}, year = {2022}, issue_date = {October 2022}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {6}, number = {OOPSLA2}, url = {https://doi.org/10.1145/3563338}, doi = {10.1145/3563338}, journal = {Proc. ACM Program. Lang.}, month = {oct}, articleno = {175}, numpages = {30}, keywords = {dynamic sparse tensors, sparse tensor algebra, sparse tensor formats, pointer-based data structures, sparse tensor algebra compilation, node schema language} }

##
**Format Abstractions for the Compilation of Sparse Tensor Algebra**
Stephen Chou (PhD Thesis)

Download ThesisAbstract

Tensors are commonly used to represent data in many domains, including data analytics, machine learning, science, and engineering. Many highly-optimized libraries and compilers have been developed for efficiently computing on dense tensors. However, existing libraries and compilers are limited in their ability to support real-world applications that work with sparse tensors, which contain mostly zeros. In particular, there exist countless specialized formats for storing sparse tensors in memory, each suited to specific types of applications and data. Since different formats often use very different data structures to store nonzeros though, computing with sparse tensors that are stored in different formats can require vastly dissimilar code that are all difficult to implement by hand and non-trivial to generate automatically. Existing libraries and compilers must therefore limit the set of computations and formats that they directly support, sacrificing usability and performance as a result.

In this dissertation, I describe how to build a compiler that supports efficiently computing on sparse tensors that may be stored in a wide variety of formats. I first show how many commonly-used sparse tensor formats---from array-based formats like CSR, COO, and DIA to formats that store nonzeros using pointer-based data structures like linked lists, BSTs, and C-trees---can all be expressed as compositions of per-dimension formats. I further show how such per-dimension formats can be precisely defined by implementing a common set of abstractions that capture how their underlying data structures store nonzeros in memory and that capture how these data structures can be efficiently accessed or constructed. I then demonstrate how, with such specifications of per-dimension formats at hand, a compiler can generate code to efficiently compute on tensors that are stored in any of the aforementioned---and countless other---formats. We have implemented our technique in the TACO sparse tensor algebra compiler, which is the first compiler to generate code that computes any basic tensor algebra expression with sparse tensors that may be stored in arbitrary formats. Our technique generates code that has performance competitive with, if not better than, equivalent code in hand-optimized libraries and frameworks.BibTex

@PhDThesis{chou:2022:phd-thesis, title = "Format Abstractions for the Compilation of Sparse Tensor Algebra", author = "Stephen Chou", month = "Aug", year = "2022", url = "http://tensor-compiler.org/files/chou-phd-thesis-taco-formats.pdf", type = "Ph.D. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model**
Peter Ahrens, Fredrik Kjolstad, and Saman Amarasinghe (PLDI 2022)

Download PaperAbstract

While loop reordering and fusion can make big impacts on the constant-factor performance of dense tensor programs, the effects on sparse tensor programs are asymptotic, often leading to orders of magnitude performance differences in practice. Sparse tensors also introduce a choice of compressed storage formats that can have asymptotic effects. Research into sparse tensor compilers has led to simplified languages that express these tradeoffs, but the user is expected to provide a schedule that makes the decisions. This is challenging because schedulers must anticipate the interaction between sparse formats, loop structure, potential sparsity patterns, and the compiler itself. Automating this decision making process stands to finally make sparse tensor compilers accessible to end users.

We present, to the best of our knowledge, the first automatic asymptotic scheduler for sparse tensor programs. We provide an approach to abstractly represent the asymptotic cost of schedules and to choose between them. We narrow down the search space to a manageably small Pareto frontier of asymptotically non-dominating kernels. We test our approach by compiling these kernels with the TACO sparse tensor compiler and comparing them with those generated with the default TACO schedules. Our results show that our approach reduces the scheduling space by orders of magnitude and that the generated kernels perform asymptotically better than those generated using the default schedules.BibTex

@inproceedings{ahrens:2022:autoscheduling, author = {Ahrens, Peter and Kjolstad, Fredrik and Amarasinghe, Saman}, title = {Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model}, year = {2022}, isbn = {9781450392655}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3519939.3523442}, doi = {10.1145/3519939.3523442}, booktitle = {Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation}, pages = {269–285}, numpages = {17}, keywords = {Asymptotic Analysis, Automatic Scheduling, Compilers, Conjunctive Query Containment, Query Optimization, Sparse Tensors}, location = {San Diego, CA, USA}, series = {PLDI 2022} }

##
**DISTAL: The Distributed Tensor Algebra Compiler**
Rohan Yadav, Alex Aiken, and Fredrik Kjolstad (PLDI 2022)

Download PaperAbstract

We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors and computation map onto target machines through separate format and scheduling languages. The combination of choices for data and computation distribution creates a large design space that includes many algorithms from both the past (e.g., Cannon’s algorithm) and the present (e.g., COSMA). DISTAL compiles a tensor algebra domain specific language to a distributed task-based runtime system and supports nodes with multi-core CPUs and multiple GPUs. Code generated by is competitive with optimized codes for matrix multiply on 256 nodes of the Lassen supercomputer and outperforms existing systems by between 1.8x to 3.7x (with a 45.7x outlier) on higher order tensor operations.

BibTex

@inproceedings{yadav:2022:distal, author = {Yadav, Rohan and Aiken, Alex and Kjolstad, Fredrik}, title = {DISTAL: The Distributed Tensor Algebra Compiler}, year = {2022}, isbn = {9781450392655}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3519939.3523437}, doi = {10.1145/3519939.3523437}, booktitle = {Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation}, pages = {286–300}, numpages = {15}, keywords = {Compilers, Distributed Systems, High Performance Computing}, location = {San Diego, CA, USA}, series = {PLDI 2022} }

##
**Unified Compilation for Lossless Compression and Sparse Computing**
Daniel Donenfeld, Stephen Chou, and Saman Amarasinghe (CGO 2022)

Download PaperAbstract

This paper shows how to extend sparse tensor algebra compilers to support lossless compression techniques, including variants of run-length encoding and Lempel-Ziv compression. We develop new abstractions to represent losslessly compressed data as a generalized form of sparse tensors, with repetitions of values (which are compressed out in storage) represented by non-scalar, dynamic fill values. We then show how a compiler can use these abstractions to emit efficient code that computes on losslessly compressed data. By unifying lossless compression with sparse tensor algebra, our technique is able to generate code that computes with both losslessly compressed data and sparse data, as well as generate code that computes directly on compressed data without needing to first decompress it.

Our evaluation shows our technique generates efficient image and video processing kernels that compute on losslessly compressed data. We find that the generated kernels are up to 16. 3× faster than equivalent dense kernels generated by TACO, a tensor algebra compiler, and up to 16. 1× faster than OpenCV, a widely used image processing library.BibTex

@inproceedings{donenfeld:2022:compression, author={Donenfeld, Daniel and Chou, Stephen and Amarasinghe, Saman}, booktitle={2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)}, title={Unified Compilation for Lossless Compression and Sparse Computing}, year={2022}, pages={205-216}, doi={10.1109/CGO53902.2022.9741282} }Video

2021

##
**Compilation of Sparse Array Programming Models**
Rawn Henry*, Olivia Hsu*, Rohan Yadav, Stephen Chou, Kunle Olukotun, Saman Amarasinghe, and Fredrik Kjolstad (OOPSLA 2021)

Download PaperAbstract

This paper shows how to compile sparse array programming languages. A sparse array programming language is an array programming language that supports element-wise application, reduction, and broadcasting of arbitrary functions over dense and sparse arrays with any fill value. Such a language has great expressive power and can express sparse and dense linear and tensor algebra, functions over images, exclusion and inclusion filters, and even graph algorithms.

Our compiler strategy generalizes prior work in the literature on sparse tensor algebra compilation to support any function applied to sparse arrays, instead of only addition and multiplication. To achieve this, we generalize the notion of sparse iteration spaces beyond intersections and unions. These iteration spaces are automatically derived by considering how algebraic properties annotated onto functions interact with the fill values of the arrays. We then show how to compile these iteration spaces to efficient code.

When compared with two widely-used Python sparse array packages, our evaluation shows that we generate built-in sparse array library features with a performance of 1.4× to 53.7× when measured against PyData/Sparse for user-defined functions and between 0.98× and 5.53× when measured against SciPy/Sparse for sparse array slicing. Our technique outperforms PyData/Sparse by 6.58× to 70.3×, and (where applicable) performs between 0.96× and 28.9× that of a dense NumPy implementation, on end-to-end sparse array applications. We also implement graph linear algebra kernels in our system with a performance of between 0.56× and 3.50× compared to that of the hand-optimized SuiteSparse:GraphBLAS library.BibTex

@article{henry_hsu:2021:array, author = {Henry, Rawn and Hsu, Olivia and Yadav, Rohan and Chou, Stephen and Olukotun, Kunle and Amarasinghe, Saman and Kjolstad, Fredrik}, title = {Compilation of Sparse Array Programming Models}, year = {2021}, issue_date = {October 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {5}, number = {OOPSLA}, url = {https://doi.org/10.1145/3485505}, doi = {10.1145/3485505}, journal = {Proc. ACM Program. Lang.}, month = {oct}, articleno = {128}, numpages = {29}, keywords = {Sparse Array Programming, Compilation, Sparse Arrays} }Video

2020

##
**A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra**
Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad (OOPSLA 2020)

Download PaperAbstract

We address the problem of optimizing sparse tensor algebra in a compiler and show how to define standard loop transformations---split, collapse, and reorder---on sparse iteration spaces. The key idea is to track the transformation functions that map the original iteration space to derived iteration spaces. These functions are needed by the code generator to emit code that maps coordinates between iteration spaces at runtime, since the coordinates in the sparse data structures remain in the original iteration space. We further demonstrate that derived iteration spaces can tile both the universe of coordinates and the subset of nonzero coordinates: the former is analogous to tiling dense iteration spaces, while the latter tiles sparse iteration spaces into statically load-balanced blocks of nonzeros. Tiling the space of nonzeros lets the generated code efficiently exploit heterogeneous compute resources such as threads, vector units, and GPUs.

We implement these concepts by extending the sparse iteration theory implementation in the TACO system. The associated scheduling API can be used by performance engineers or it can be the target of an automatic scheduling system. We outline one heuristic autoscheduling system, but other systems are possible. Using the scheduling API, we show how to optimize mixed sparse-dense tensor algebra expressions on CPUs and GPUs. Our results show that the sparse transformations are sufficient to generate code with competitive performance to hand-optimized implementations from the literature, while generalizing to all of the tensor algebra.BibTex

@article{senanayake:2020:scheduling, author = {Senanayake, Ryan and Hong, Changwan and Wang, Ziheng and Wilson, Amalee and Chou, Stephen and Kamil, Shoaib and Amarasinghe, Saman and Kjolstad, Fredrik}, title = {A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra}, year = {2020}, issue_date = {November 2020}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {4}, number = {OOPSLA}, url = {https://doi.org/10.1145/3428226}, doi = {10.1145/3428226}, journal = {Proc. ACM Program. Lang.}, month = nov, articleno = {158}, numpages = {30}, keywords = {Sparse Tensor Algebra, Sparse Iteration Spaces, Optimizing Transformations} }Video

##
**Sparse Tensor Transpositions**
Suzanne Mueller, Peter Ahrens, Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe (SPAA 2020)

Download PaperAbstract

We present a new algorithm for transposing sparse tensors called Quesadilla. The algorithm converts the sparse tensor data structure to a list of coordinates and sorts it with a fast multi-pass radix algorithm that exploits knowledge of the requested transposition and the tensors input partial coordinate ordering to provably minimize the number of parallel partial sorting passes. We evaluate both a serial and a parallel implementation of Quesadilla on a set of 19 tensors from the FROSTT collection, a set of tensors taken from scientific and data analytic applications. We compare Quesadilla and a generalization, Top-2-sadilla to several state of the art approaches, including the tensor transposition routine used in the SPLATT tensor factorization library. In serial tests, Quesadilla was the best strategy for 60% of all tensor and transposition combinations and improved over SPLATT by at least 19% in half of the combinations. In parallel tests, at least one of Quesadilla or Top-2-sadilla was the best strategy for 52% of all tensor and transposition combinations.

BibTex

@inproceedings{mueller:2020:transposition, author = {Mueller, Suzanne and Ahrens, Peter and Chou, Stephen and Kjolstad, Fredrik and Amarasinghe, Saman}, title = {Sparse Tensor Transpositions}, year = {2020}, isbn = {9781450369350}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3350755.3400245}, doi = {10.1145/3350755.3400245}, booktitle = {Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures}, pages = {559–561}, numpages = {3}, keywords = {COO, sorting, radix sort, transposition, sparse tensors}, location = {Virtual Event, USA}, series = {SPAA '20} }Video

##
**Automatic Generation of Efficient Sparse Tensor Format Conversion Routines**
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe (PLDI 2020)

Download Paper Download SlidesAbstract

This paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping, analysis, and assembly. We then develop a language that precisely describes how different formats group together and order a tensor’s nonzeros in memory. This lets a compiler emit code that performs complex remappings of nonzeros when converting between formats. We also develop a query language that can extract statistics about sparse tensors, and we show how to emit efficient analysis code that computes such queries. Finally, we define an abstract interface that captures how data structures for storing a tensor can be efficiently assembled given specific statistics about the tensor. Disparate formats can implement this common interface, thus letting a compiler emit optimized sparse tensor conversion code for arbitrary combinations of many formats without hard-coding for any specific combination.

Our evaluation shows that the technique generates sparse tensor conversion routines with performance between 1.00 and 2.01× that of hand-optimized versions in SPARSKIT and Intel MKL, two popular sparse linear algebra libraries. And by emitting code that avoids materializing temporaries, which both libraries need for many combinations of source and target formats, our technique outperforms those libraries by 1.78 to 4.01× for CSC/COO to DIA/ELL conversion.BibTex

@inproceedings{chou:2020:conversion, author = {Chou, Stephen and Kjolstad, Fredrik and Amarasinghe, Saman}, title = {Automatic Generation of Efficient Sparse Tensor Format Conversion Routines}, year = {2020}, isbn = {9781450376136}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3385412.3385963}, doi = {10.1145/3385412.3385963}, booktitle = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation}, pages = {823–838}, numpages = {16}, keywords = {coordinate remapping notation, attribute query language, sparse tensor formats, sparse tensor conversion, sparse tensor assembly, sparse tensor algebra}, location = {London, UK}, series = {PLDI 2020} }Video

##
**A Framework for Computing on Sparse Tensors based on Operator Properties**
Rawn Henry (MEng Thesis)

Download ThesisAbstract

Tensor operations have been traditionally limited to addition and multiplication operations. For operations of sparse tensors, these semantics were extended to account for the fact that tensors usually omit zero values. However, there are many operators with a rich semantics of operator properties that can be used in dense and sparse tensor computations.

This work addresses the problem of generating code for computing on a mix of sparseand dense tensors based on the properties of the operators on those tensors. I introduce the concept of a fill value to each tensor so that the data can be sparse on non-zeros. I show how to reason about the operator properties, along with the fillvalues of the input tensors in order to construct an IR describing how to iterate overthese tensors. I show how we can take advantage of the operator properties to perform useful optimizations for both iterating over tensors and performing reductions. Lastly, I show how a user can leverage set notation to directly describe to a compiler how it should iterate over sparse tensors.

The ideas discussed in this work have been prototyped in the open-source TACOsystem. The API used makes operator properties and tensor fill values have to beexplicitly provided by the user. However, it makes the TACO system much more flexible. I show how the primitives exposed in this work allows one to efficiently performseveral graph algorithms by drawing on the literature about GraphBLAS. In the evaluation section, we benchmark this system against the SuiteSparse implementation ofGraphBLAS on a variety of graph algorithms to demonstrate its performance.BibTex

@MastersThesis{henry:2020:meng-thesis, title = "A Framework for Computing on Sparse Tensors based on Operator Properties", author = "Rawn Henry", month = "May", year = "2020", url = "http://tensor-compiler.org/files/henry-meng-thesis-taco-array.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**Automatic Optimization of Sparse Tensor Algebra Programs**
Ziheng Wang (MEng Thesis)

Download ThesisAbstract

In this thesis, I attempt to give some guidance on how to automatically optimize programs using a domain-specific-language (DSLs) compiler that exposes a set of scheduling commands. These DSLs have proliferated as of late, including Halide, TACO, Tiramisu and TVM, to name a few. The scheduling commands allow succinct expression of the programmer’s desire to perform certain loop transformations,such as strip-mining, tiling, collapsing and parallelization, which the compiler proceeds to carry out. I explore if we can automatically generate schedules with good performance.

The main difficulty in searching for good schedules is the astronomical number of valid schedules for a particular program. I describe a system which generates a list of candidate schedules through a set of modular stages. Different optimization decisions are made at each stage, to trim down the number of schedules considered. I argue that certain sequences of scheduling commands are equivalent in their effect in partitioning the iteration space, and introduce heuristics that limit the number of permutations of variables. I implement these ideas for the open-source TACO system. I demonstrate several orders of magnitude reduction in the effective schedule search space. I also show that for most of the problems considered, we can generate schedules better than or equal to hand-tuned schedules in terms of performance.BibTex

@MastersThesis{wang:2020:meng-thesis, title = "Automatic Optimization of Sparse Tensor Algebra Programs", author = "Ziheng Wang", month = "May", year = "2020", url = "http://tensor-compiler.org/files/wang-meng-thesis-taco-autoscheduling.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra**
Ryan Senanayake (MEng Thesis)

Download ThesisAbstract

This work addresses the problem of optimizing mixed sparse and dense tensor algebra in a compiler. I show that standard loop transformations, such as strip-mining, tiling, collapsing, parallelization and vectorization, can be applied to irregular loops over sparse iteration spaces. I also show how these transformations can be applied to the contiguous value arrays of sparse tensor data structures, which I call their position spaces, to unlock load-balanced tiling and parallelism.

These concepts have been prototyped in the open-source TACO system, where they are exposed as a scheduling API similar to the Halide domain-specific language for dense computations. Using this scheduling API, I show how to optimize mixed sparse/dense tensor algebra expressions, how to generate load-balanced code by scheduling sparse tensor algebra in position space, and how to generate sparse tensor algebra GPU code. As shown in the evaluation, these transformations allow us to generate code that is competitive with many hand-optimized implementations from the literature.BibTex

@MastersThesis{senanayake:2020:meng-thesis, title = "A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra", author = "Ryan Senanayake", month = "Feb", year = "2020", url = "http://tensor-compiler.org/files/senanayake-meng-thesis-taco-scheduling.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**Sparse Tensor Transpositions in the Tensor Algebra Compiler**
Suzanne Mueller (MEng Thesis)

Download ThesisAbstract

The current state of the art for transposing sparse tensors involves converting the sparse tensor into a list of coordinates, sorting the list of coordinates and finally packing the list of coordinates into the desired sparse tensor format. This thesis explores the potential for faster methodologies. Its main contributions are an algorithm that exploits partial sortedness to minimize sorting passes and an implementation that demonstrates that this transposition algorithm is competitive with state of the art. In particular the algorithm takes advantage of the ordering that already exists to apply selective sorting passes and thereby reduce the amount of work that needsto be done to reorder the tensor.

BibTex

@MastersThesis{mueller:2020:meng-thesis, title = "Sparse Tensor Transpositions in the Tensor Algebra Compiler", author = "Suzanne Mueller", month = "Feb", year = "2020", url = "http://tensor-compiler.org/files/mueller-meng-thesis-taco-transposition.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**Sparse Tensor Algebra Compilation**
Fredrik Kjolstad (PhD Thesis)

Download ThesisAbstract

This dissertation shows how to compile any sparse tensor algebra expression to CPU and GPU code that matches the performance of hand-optimized implementations. A tensor algebra expression is sparse if at least one of its tensor operands is sparse, and a tensor is sparse if most of its values are zero. If a tensor is sparse, then we can store its nonzero values in a compressed data structure, and omit the zeros. Indeed, as the matrices and tensors in many important applications are extremely sparse, compressed data structures provide the only practical means to store them. A sparse tensor algebra expression may contain any number of operations, which must be compiled to fused sparse loops that compute the entire expression simultaneously. It is not viable to support only binary expressions, because their composition may result in worse asymptotic complexity than the fused implementation. I present compiler techniques to generate fused loops that coiterate over any number of tensors stored in different types of data structures. By design, these loops avoid computing values known to be zero due to the algebraic properties of their operators.

Sparse tensor algebra compilation is made possible by a sparse iteration theory that formulates sparse iteration spaces as set expressions of the coordinates of nonzero values. By ordering iteration space dimensions hierarchically, the compiler recursively generates loops that coiterate over tensor data structures one dimension at a time. By organizing per-dimension coiteration into regions based on algebraic operator properties, it removes operations that will result in zero. And by transforming the sparse iteration spaces, it optimizes the generated code. The result is the first sparse iteration compiler, called the Tensor Algebra Compiler (taco). Taco can compile any tensor algebra expressions, with tensors stored in different types of sparse and dense data structures, to code that matches the performance of hand-optimized implementations on CPUs and GPUs.BibTex

@PhDThesis{kjolstad:2020:phd-thesis, title = "Sparse Tensor Algebra Compilation", author = "Fredrik Kjolstad", month = "Feb", year = "2020", url = "http://tensor-compiler.org/files/kjolstad-phd-thesis-taco-compiler.pdf", type = "Ph.D. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

2019

##
**A Tensor Algebra Compiler Library Interface and Runtime**
Patricio Noyola (MEng Thesis)

Download ThesisAbstract

Tensor algebra is a powerful tool for computing on multidimensional data and has applications in many fields. The number of possible tensor operations is infinite, so it is impossible to manually implement all of them to operate on different tensor dimensions. The Tensor Algebra Compiler (taco) introduced a compiler approach to automatically generate kernels for any compound tensor algebra operation on any input tensor formats.

In this thesis, we present a new API for the taco library. The API removes the need to call compiler methods with the introduction of a delayed execution framework. Additionally, the API introduces multiple important tensor algebra features previously unavailable in taco. Finally, we propose extensions to taco’s code generation algorithm to automatically generate tensor API methods for any tensor format.

The proposed API makes taco code cleaner, shorter and more elegant. Furthermore, the extensions to its code generation algorithm make the API scalable to new formats and operations.BibTex

@MastersThesis{noyola:2019:meng-thesis, title = "A Tensor Algebra Compiler Library Interface and Runtime", author = "Patricio Noyola", month = "May", year = "2019", url = "http://tensor-compiler.org/files/noyola-meng-thesis-taco-interface.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**SuperTaco: Taco Tensor Algebra Kernels on Distributed Systems Using Legion**
Sachin Dilip Shinde (MEng Thesis)

Download ThesisAbstract

Tensor algebra is a powerful language for expressing computation on multidimensional data. While many tensor datasets are sparse, most tensor algebra libraries have limited support for handling sparsity. The Tensor Algebra Compiler (Taco) has introduced a taxonomy for sparse tensor formats that has allowed them to compile sparse tensor algebra expressions to performant C code, but they have not taken advantage of distributed systems.

This work provides a code generation technique for creating Legion programs that distribute the computation of Taco tensor algebra kernels across distributed systems, and a scheduling language for controlling how this distributed computation is structured. This technique is implemented in the form of a command-line tool called SuperTaco. We perform a strong scaling analysis for the SpMV and TTM kernels under a row blocking distribution schedule, and find speedups of 9-10x when using 20 cores on a single node. For multi-node systems using 20 cores per node, SpMV achieves a 33.3x speedup at 160 cores and TTM achieves a 42.0x speedup at 140 cores.BibTex

@MastersThesis{shinde:2019:meng-thesis, title = "SuperTaco: Taco Tensor Algebra Kernels on Distributed Systems Using Legion", author = "Sachin Dilip Shinde", month = "Feb", year = "2019", url = "http://tensor-compiler.org/files/shinde-meng-thesis-taco-distributed.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

##
**Tensor Algebra Compilation with Workspaces**
Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe (CGO 2019)

Download PaperAbstract

This paper shows how to extend sparse tensor algebra compilers to introduce temporary tensors called workspaces to avoid inefficient sparse data structures accesses. We develop an intermediate representation (IR) for tensor operations called concrete index notation that specifies when sub-computations occur and where they are stored. We then describe the workspace transformation in this IR, how to programmatically invoke it, and how the IR is compiled to sparse code. Finally, we show how the transformation can be used to optimize sparse tensor kernels, including sparse matrix multiplication, sparse tensor addition, and the matricized tensor times Khatri-Rao product (MTTKRP).

Our results show that the workspace transformation brings the performance of these kernels on par with hand-optimized implementations. For example, we improve the performance of MTTKRP with dense output by up to 35\%, and enable generating sparse matrix multiplication and MTTKRP with sparse output, neither of which were supported by prior tensor algebra compilers. This paper shows how to optimize sparse tensor algebraic expressions by introducing temporary tensors, called workspaces, into the resulting loop nests. We develop a new intermediate language for tensor operations called concrete index notation that extends tensor index notation. Concrete index notation expresses when and where sub-computations occur and what tensor they are stored into. We then describe the workspace optimization in this language, and how to compile it to sparse code by building on prior work in the literature.BibTex

@article{kjolstad:2019:workspaces, author = {Kjolstad, Fredrik and Ahrens, Peter and Kamil, Shoaib and Amarasinghe, Saman}, title = {Tensor Algebra Compilation with Workspaces}, booktitle = {Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization}, series = {CGO 2019}, year = {2019}, isbn = {978-1-7281-1436-1}, location = {Washington, DC, USA}, pages = {180--192}, numpages = {13}, url = {http://dl.acm.org/citation.cfm?id=3314872.3314894}, acmid = {3314894}, publisher = {IEEE Press}, address = {Piscataway, NJ, USA}, keywords = {code optimization, concrete index notation, sparse tensor algebra, temporaries, workspaces}, }

2018

##
**Format Abstraction for Sparse Tensor Algebra Compilers**
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe (OOPSLA 2018)

Download Paper Download SlidesAbstract

This paper shows how to build a sparse tensor algebra compiler that is agnostic to tensor formats (data layouts). We develop an interface that describes formats in terms of their capabilities and properties, and show how to build a modular code generator where new formats can be added as plugins. We then describe six implementations of the interface that compose to form the dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants thereof. With these implementations at hand, our code generator can generate code to compute any tensor algebra expression on any combination of the aforementioned formats.

To demonstrate our technique, we have implemented it in the taco tensor algebra compiler. Our modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations. Furthermore, by extending taco to support a wider range of formats specialized for different application and data characteristics, we can improve end-user application performance. For example, if input data is provided in the COO format, our technique allows computing a single matrix-vector multiplication directly with the data in COO, which is up to 3.6× faster than by first converting the data to CSR.BibTex

@article{chou:2018:formats, author = {Chou, Stephen and Kjolstad, Fredrik and Amarasinghe, Saman}, title = {Format Abstraction for Sparse Tensor Algebra Compilers}, journal = {Proc. ACM Program. Lang.}, issue_date = {November 2018}, volume = {2}, number = {OOPSLA}, month = oct, year = {2018}, issn = {2475-1421}, pages = {123:1--123:30}, articleno = {123}, numpages = {30}, url = {http://doi.acm.org/10.1145/3276493}, doi = {10.1145/3276493}, acmid = {3276493}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {modular code generation, sparse tensor algebra compilation, tensor formats}, }Video

##
**Unified Sparse Formats for Tensor Algebra Compilers**
Stephen Chou (SM Thesis)

Download ThesisAbstract

Tensor algebra is a powerful tool for computing on multidimensional data and has appli-cations in many fields. Practical applications often deal with tensors that are sparse, and there exists a wide variety of formats for storing such tensors, each suited to specific typesof applications and data. Examples of sparse tensor storage formats include COO, CSR, CSC, DCSR, BCSR, CSF, CSB, ELL, DIA, and hash maps.

In this thesis, we propose a levelized hierarchical abstraction that represents these seemingly disparate formats and countless others, and that hides the details of each format behind a common interface. We show that this tensor representation facilitates automatic generation of efficient compute kernels for tensor algebra expressions with any combination of formats. This is accomplished with a code generation algorithm that generates code level by level, guided by the capabilities and properties of the levels.

The performance of tensor algebra kernels generated using our technique is competitive with that of equivalent hand-implemented kernels in existing sparse linear and tensor algebra libraries. Furthermore, our technique can generate many more kernels for many more formats than exist in libraries or are supported by existing compiler techniques.BibTex

@MastersThesis{chou:2018:sm-thesis, title = "Unified Sparse Formats for Tensor Algebra Compilers", author = "Stephen Chou", month = "Feb", year = "2018", url = "http://tensor-compiler.org/files/chou-sm-thesis-taco-formats.pdf", type = "S.M. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }

2017

##
**The Tensor Algebra Compiler**
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe (OOPSLA 2017)

Download Paper Download SlidesAbstract

Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

BibTex

@article{kjolstad:2017:taco, author = {Kjolstad, Fredrik and Kamil, Shoaib and Chou, Stephen and Lugato, David and Amarasinghe, Saman}, title = {The Tensor Algebra Compiler}, journal = {Proc. ACM Program. Lang.}, issue_date = {October 2017}, volume = {1}, number = {OOPSLA}, month = oct, year = {2017}, issn = {2475-1421}, pages = {77:1--77:29}, articleno = {77}, numpages = {29}, url = {http://doi.acm.org/10.1145/3133901}, doi = {10.1145/3133901}, acmid = {3133901}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {code generation, iteration graphs, linear algebra, merge lattices, parallelism, performance, sparse data structures, tensor algebra, tensors} }Video

##
**taco: A Tool to Generate Tensor Algebra Kernels**
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe (ASE 2017)

Download PaperAbstract

Tensor algebra is an important computational abstraction that is increasingly used in data analytics, machine learning, engineering, and the physical sciences. However, the number of tensor expressions is unbounded, which makes it hard to develop and optimize libraries. Furthermore, the tensors are often sparse (most components are zero), which means the code has to traverse compressed formats. To support programmers we have developed taco, a code generation tool that generates dense, sparse, and mixed kernels from tensor algebra expressions. This paper describes the taco web and command-line tools and discusses the benefits of a code generator over a traditional library. See also the demo video at tensor-compiler.org/ase2017.

BibTex

@inproceedings{kjolstad:2017:tools, author={Kjolstad, Fredrik and Chou, Stephen and Lugato, David and Kamil, Shoaib and Amarasinghe, Saman}, booktitle={2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)}, title={taco: A Tool to Generate Tensor Algebra Kernels}, year={2017}, pages={943-948}, keywords={data analysis;learning (artificial intelligence);mathematics computing;program compilers;software libraries;tensors;code generation tool;code generator;command-line tools;compressed formats;computational abstraction;data analytics;dense kernels;machine learning;mixed kernels;physical sciences;sparse kernels;taco web;tensor algebra expressions;tensor algebra kernels;tensor expressions;Indexes;Kernel;Libraries;Linear algebra;Tensile stress;Tools;Tensor algebra;compiler;linear algebra;sparse}, doi={10.1109/ASE.2017.8115709}, month={Oct} }Video

2016

##
**An Investigation of Sparse Tensor Formats for Tensor Libraries**
Parker Allen Tew (MEng Thesis)

Download ThesisAbstract

Tensors provide a generalized structure to store arbitrary indexable data, which is applicable in fields such as chemometrics, physics simulations, signal processing and lies at the heart of machine learning. Many naturally occurring tensors are considered sparse as they contain mostly zero values. As with sparse matrices, various techniques can be employed to more efficiently store and compute on these sparse tensors.

This work explores several sparse tensor formats while ultimately evaluating two implementations; one based on explicitly storing coordinates and one that compresses these coordinates. The two formats, Coordinate and CSF2, were evaluated by comparing their execution time of tensor-matrix products and the MTTKRP operation on several datasets. We find that the Coordinate format is superior for uniformly distributed sparse tensors or when used in computation that emits a sparse tensor via a mode dependent operation. In all other considered cases for large sparse tensors, the storage savings of the compressed format provide the best results.BibTex

@MastersThesis{parker:2016:meng-thesis, title = "An Investigation of Sparse Tensor Formats for Tensor Libraries", author = "Parker Allen Tew", month = "Jun", year = "2016", url = "http://tensor-compiler.org/files/tew-meng-thesis-sparse.pdf", type = "M.Eng. Thesis", address = "Cambridge, MA", school = "Massachusetts Institute of Technology", }