Loop fission and fusion

Incomputer science,loop fission(orloop distribution) is acompiler optimizationin which aloopis broken into multiple loops over the same index range with each taking only a part of the original loop's body.^[1]^[2]The goal is to break down a large loop body into smaller ones to achieve better utilization oflocality of reference.This optimization is most efficient inmulti-core processorsthat can split a task into multiple tasks for eachprocessor.

Conversely,loop fusion(orloop jamming) is acompiler optimizationandloop transformationwhich replaces multipleloopswith a single one.^[3]^[2]Loop fusion does not always improve run-time speed. On somearchitectures,two loops may actually perform better than one loop because, for example, there is increaseddata localitywithin each loop. One of the main benefits of loop fusion is that it allows temporary allocations to be avoided, which can lead to huge performance gains in numerical computing languages such asJuliawhen doing elementwise operations on arrays (however, Julia's loop fusion is not technically a compiler optimization, but a syntactic guarantee of the language).^[4]

Other benefits of loop fusion are that it avoids the overhead of the loop control structures, and also that it allows the loop body to be parallelized by the processor^[5]by taking advantage ofinstruction-level parallelism.This is possible when there are no data dependencies between the bodies of the two loops (this is in stark contrast to the other main benefit of loop fusion described above, which only presents itself when therearedata dependencies that require an intermediate allocation to store the results). If loop fusion is able to remove redundant allocations, performance increases can be large.^[4]Otherwise, there is a more complex trade-off between data locality, instruction-level parallelism, and loop overhead (branching, incrementing, etc.) that may make loop fusion, loop fission, or neither, the preferable optimization.

Fission

Example in C

inti,a[100],b[100];
for(i=0;i<100;i++){
a[i]=1;
b[i]=2;
}

is equivalent to:

inti,a[100],b[100];
for(i=0;i<100;i++){
a[i]=1;
}
for(i=0;i<100;i++){
b[i]=2;
}

Fusion

Example in C++ and MATLAB

Consider the following MATLAB code:

x=0:999;% Create an array of numbers from 0 to 999 (range is inclusive)
y=sin(x)+4;% Take the sine of x (element-wise) and add 4 to each element

The same syntax can be achieved inC++by using function and operator overloading:

#include<cmath>
#include<cassert>
#include<memory>
#include<iostream>

classArray{
size_tlength;
std::unique_ptr<float[]>data;

// Internal constructor that produces an uninitialized array
Array(size_tn):length(n),data(newfloat[n]){}

public:
// Factory method to produce an array over an integer range (the upper
// bound is exclusive, unlike MATLAB's ranges).
staticArrayRange(size_tstart,size_tend){
assert(end>start);
size_tlength=end-start;
Arraya(length);
for(size_ti=0;i<length;++i){
a[i]=start+i;
}
returna;
}

// Basic array operations
size_tsize()const{returnlength;}
float&operator[](size_ti){returndata[i];}
constfloat&operator[](size_ti)const{returndata[i];}

// Declare an overloaded addition operator as a free friend function (this
// syntax defines operator+ as a free function that is a friend of this
// class, despite it appearing as a member function declaration).
friendArrayoperator+(constArray&a,floatb){
Arrayc(a.size());
for(size_ti=0;i<a.size();++i){
c[i]=a[i]+b;
}
returnc;
}

// Similarly, we can define an overload for the sin() function. In practice,
// it would be unwieldy to define all possible overloaded math operations as
// friends inside the class like this, but this is just an example.
friendArraysin(constArray&a){
Arrayb(a.size());
for(size_ti=0;i<a.size();++i){
b[i]=std::sin(a[i]);
}
returnb;
}
};

intmain(intargc,char*argv[]){
// Here, we perform the same computation as the MATLAB example
autox=Array::Range(0,1000);
autoy=sin(x)+4;

// Print the result out - just to make sure the optimizer doesn't remove
// everything (if it's smart enough to do so).
std::cout<<"The result is:"<<std::endl;
for(size_ti=0;i<y.size();++i){
std::cout<<y[i]<<std::endl;
}

return0;
}

However, the above example unnecessarily allocates a temporary array for the result ofsin(x).A more efficient implementation would allocate a single array fory,and computeyin a single loop. To optimize this, a C++ compiler would need to:

Inline thesinandoperator+function calls.
Fuse the loops into a single loop.
Remove the unused stores into the temporary arrays (can use a register or stack variable instead).
Remove the unused allocation and free.

All of these steps are individually possible. Even step four is possible despite the fact that functions likemallocandfreehave global side effects, since some compilers hardcode symbols such asmallocandfreeso that they can remove unused allocations from the code.^[6]However, as ofclang12.0.0 andgcc11.1, this loop fusion and redundant allocation removal does not occur - even on the highest optimization level.^[7]^[8]

Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level, where the compiler will notice adjacent elementwise operations and fuse them into a single loop.^[9]Currently, to achieve the same syntax in general purpose languages like C++, thesinandoperator+functions must pessimistically allocate arrays to store their results, since they do not know what context they will be called from. This issue can be avoided in C++ by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations (e.g., using functions and overloads for in-place operations, such asoperator+=orstd::transform).

References

^Y.N. Srikant; Priti Shankar (3 October 2018).The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition.CRC Press.ISBN 978-1-4200-4383-9.
^^a ^bKennedy, Ken & Allen, Randy. (2001).Optimizing Compilers for Modern Architectures: A Dependence-based Approach.Morgan Kaufmann.ISBN 1-55860-286-0.
^Steven Muchnick; Muchnick and Associates (15 August 1997).Advanced Compiler Design Implementation.Morgan Kaufmann.ISBN 978-1-55860-320-2.loop fusion.
^^a ^bJohnson, Steven G. (21 January 2017)."More Dots: Syntactic Loop Fusion in Julia".julialang.org.Retrieved25 June2021.
^"Loop Fusion".Intel.Retrieved2021-06-25.
^Godbolt, Matt."Compiler Explorer - C++ (x86-64 clang 12.0.0)".godbolt.org.Retrieved2021-06-25.
^Godbolt, Matt."Compiler Explorer - C++ (x86-64 clang 12.0.0)".godbolt.org.Retrieved2021-06-25.
^Godbolt, Matt."Compiler Explorer - C++ (x86-64 gcc 11.1)".godbolt.org.Retrieved2021-06-25.
^"Functions · The Julia Language".docs.julialang.org.Retrieved2021-06-25.

Loop fission

[SrikantShankar2018-1] Y.N. Srikant; Priti Shankar (3 October 2018).The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition.CRC Press.ISBN 978-1-4200-4383-9.

[Kennedy-2] Kennedy, Ken & Allen, Randy. (2001).Optimizing Compilers for Modern Architectures: A Dependence-based Approach.Morgan Kaufmann.ISBN 1-55860-286-0.

[MuchnickAssociates1997-3] Steven Muchnick; Muchnick and Associates (15 August 1997).Advanced Compiler Design Implementation.Morgan Kaufmann.ISBN 978-1-55860-320-2.loop fusion.

[:0-4] Johnson, Steven G. (21 January 2017)."More Dots: Syntactic Loop Fusion in Julia".julialang.org.Retrieved25 June2021.

[5] "Loop Fusion".Intel.Retrieved2021-06-25.

[6] Godbolt, Matt."Compiler Explorer - C++ (x86-64 clang 12.0.0)".godbolt.org.Retrieved2021-06-25.

[7] Godbolt, Matt."Compiler Explorer - C++ (x86-64 clang 12.0.0)".godbolt.org.Retrieved2021-06-25.

[8] Godbolt, Matt."Compiler Explorer - C++ (x86-64 gcc 11.1)".godbolt.org.Retrieved2021-06-25.

[9] "Functions · The Julia Language".docs.julialang.org.Retrieved2021-06-25.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]