Algorithm 1026: Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

Algorithm 1026 : Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions. / Psarras, Christos; Karlsson, Lars; Bro, Rasmus; Bientinesi, Paolo.

I: ACM Transactions on Mathematical Software, Bind 48, Nr. 3, 34, 2022.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Psarras, C, Karlsson, L, Bro, R & Bientinesi, P 2022, 'Algorithm 1026: Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions', ACM Transactions on Mathematical Software, bind 48, nr. 3, 34. https://doi.org/10.1145/3519383

APA

Psarras, C., Karlsson, L., Bro, R., & Bientinesi, P. (2022). Algorithm 1026: Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions. ACM Transactions on Mathematical Software, 48(3), [34]. https://doi.org/10.1145/3519383

Vancouver

Psarras C, Karlsson L, Bro R, Bientinesi P. Algorithm 1026: Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions. ACM Transactions on Mathematical Software. 2022;48(3). 34. https://doi.org/10.1145/3519383

Author

Psarras, Christos ; Karlsson, Lars ; Bro, Rasmus ; Bientinesi, Paolo. / Algorithm 1026 : Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions. I: ACM Transactions on Mathematical Software. 2022 ; Bind 48, Nr. 3.

Bibtex

@article{52ce4451ad7248818674c234d7f5faca,
title = "Algorithm 1026: Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions",
abstract = "Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this article, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time, the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity. ",
keywords = "CP, decomposition, high-performance, PARAFAC, Tensor",
author = "Christos Psarras and Lars Karlsson and Rasmus Bro and Paolo Bientinesi",
note = "Funding Information: This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—333849990/ GRK2379 (IRTG Modern Inverse Problems). Publisher Copyright: {\textcopyright} 2022 Association for Computing Machinery.",
year = "2022",
doi = "10.1145/3519383",
language = "English",
volume = "48",
journal = "ACM Transactions on Mathematical Software",
issn = "0098-3500",
publisher = "Association for Computing Machinery, Inc.",
number = "3",

}

RIS

TY - JOUR

T1 - Algorithm 1026

T2 - Concurrent Alternating Least Squares for Multiple Simultaneous Canonical Polyadic Decompositions

AU - Psarras, Christos

AU - Karlsson, Lars

AU - Bro, Rasmus

AU - Bientinesi, Paolo

N1 - Funding Information: This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—333849990/ GRK2379 (IRTG Modern Inverse Problems). Publisher Copyright: © 2022 Association for Computing Machinery.

PY - 2022

Y1 - 2022

N2 - Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this article, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time, the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.

AB - Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this article, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time, the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to MATLAB, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.

KW - CP

KW - decomposition

KW - high-performance

KW - PARAFAC

KW - Tensor

U2 - 10.1145/3519383

DO - 10.1145/3519383

M3 - Journal article

AN - SCOPUS:85140090648

VL - 48

JO - ACM Transactions on Mathematical Software

JF - ACM Transactions on Mathematical Software

SN - 0098-3500

IS - 3

M1 - 34

ER -

ID: 327671834