-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Parallelize CSR (sparse matrix) methods #19717
base: main
Are you sure you want to change the base?
Conversation
BenchmarksI'm attaching the ASV benchmark compare of this PR against the main branch using the bigger benchmarks I mentioned ( 6b78eac ) Running on M1 Macbook pro (6 + 2 cores) All are single repetition because I couldn't get ASV to do repeats, hence noisy (ignore the noisy ~1ms runs). But the pattern is clear, the bigger benchmarks are significantly faster. Random matrix multiply:
Sort elements:
Large arithmetic ops:
Raw output from compare-sparse.Arithmetic.txt |
CI errors seem to be |
Thanks @alugowski, this looks very interesting! The first question that came to mind for me regarding benchmarks/performance is how linear the scaling is with the requested number of threads. E.g., use an array of size 64k x 32, and plot time taken by parallelized operations as a function of the number of threads. The closer to linear the scaling the better. Did you try anything like that already?
I can't reproduce it either, but from the CI log it looks to me like the solution only slightly exceeds the requested The CircleCI job shows a build error in |
Re API for parallelism: your use of I don't have immediate ideas for further operations, probably best to complete the work on this first (or at least till we have a strategy for getting this merged). |
I'll write some code for a speedup benchmark.
That'll take some thought, there is no error just "killed". My only guess is a compile unit size issue, which relates to:
I need to clarify. I agree further non-CSR ops would be worth discussing later. I meant that we may have to make a decision on what to turn off in this PR for library size issues. Hence why I have those three groups:
I'm hoping someone has some ideas on how to not blow up the library size. My best guess is that having both the parallel and sequential code paths doubles the generated code size, but it's unclear why that would be as both call the same lambdas. Perhaps the compiler inlines too much. |
Speedup of Of course the sparsity pattern matters a lot with any sparse matrix op. A Poisson 2D grid is often a difficult one. Here: Part of the reason for worse speedup is that the ops in general are significantly faster. The random matmul takes 9.8s, while the Poisson 2d matmul takes only 0.13s. The binops are a difficult to decide on. In those plots I re-enabled the parallel versions of the "is_canonical" method, without that the speedup barely touches 1.5x at these sizes. That's still a nice speedup, but is it worth it? Note that parallel binops start to shine at much larger sizes: Hence the difficult tradeoff decision. Is the extra complexity worth it for only large matrices? I think if only plus/minus were parallelized that might be ok, but I'd also be happy with sequential binops. Attaching the script I used. Command for first plot: |
Yes, we're not ready to revisit OpenMP I'm afraid. And yes, TBB & co are a bit too heavy for us. |
Note about large p, like 32 or 64. Those sizes can uncover some non-obvious sequencing. A big one is allocators, the standard lib's allocator is thread safe but it can become a hotspot on large thread counts. This PR creates as few splits as possible, which should suffer the least. Unfortunately any solutions (like arena allocators or other tricks) are in libraries like TBB. |
I know I'm a broken record about library size, but even on the |
I've simplified the PR based on feedback:
The parallelism utilities are broken down as:
The PR looks big mostly because of the signature change in matmul that propagates through several layers. The "changes" in BSR are simply restoring the old pre-parallel CSR matmul_maxnnz that was reused by BSR. |
Thanks for adding a lot of detail here @alugowski. Looks like you're right about the scaling and memory bandwidth indeed.
These changes make sense I think, and should make it easier to get this merged since the wins are clearer without much of a downside (beyond a bit more code complexity of course). Looks like there's a crash in |
@alugowski one other question: did you consider fork-safety of the threadpool? The pocketfft threadpool implementation uses To be more precise - this is needed because |
`exclusive_scan` was helpful to compute `Cp` in several places, but with no parallel binops that's no longer true.
Good question. There is no longer a thread pool, and no mutexes or other fork-unfriendly constructs. Since The original PR had an optional thread pool, but had to be manually enabled with All threading logic is in this one function: scipy/scipy/sparse/sparsetools/util_par_algorithms.h Lines 51 to 88 in ff667f9
|
The last two test fails were true headscratchers. Appologies it took so long to figure them out. The MSVC fail was caused by my use of The gcc8 workflow fail is even more weird. I spent a lot of time trying to figure out what's different about GCC 8, and why that one test which also didn't use any sparse methods at all. But the issue wasn't the compiler, but that pytest fixture in Now CircleCI is complaining about a missing |
Merging main will fix this, the file was recently moved to |
If C++ parallel has a wait policy, it could end up stalling when switching between different parallel contexts: OpenMathLib/OpenBLAS#3187 and OpenMathLib/OpenBLAS#3187 (comment) In the issue, the example switches between OpenBLAS and OpenMP parallel contexts and a threadpool's wait policy causes the program to slow down. |
The linked comment says the issue is that OpenMP uses active waiting in its own thread pool, meaning OpenMP worker threads spin waiting for work. This crowds out any other workloads. There's nothing to be done about that, besides using OpenMP yourself. The parallel code in this PR does not actively wait, and neither did the thread pool that was originally in the PR. |
*/ | ||
template <class ExecPolicy, class RandIt, class UnaryFunc> | ||
util_par::internal::enable_if_threaded<ExecPolicy, void> | ||
for_each(ExecPolicy &&policy, RandIt first, RandIt last, UnaryFunc f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just call this util_par::for_each
to avoid confusion.
util_par::internal::parallel_chunk_for( | ||
std::forward<ExecPolicy>(policy), | ||
first, last, | ||
std::for_each<RandIt, UnaryFunc>, f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking pointers to standard library templates like this not allowed, you should wrap it in a lambda.
} else { | ||
threads.emplace_back(std::thread(std::forward<Chunk>(chunk), | ||
first, loop_end, | ||
std::forward<A>(chunk_args)...)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ever move from chunk_args
, then only one thread will see a valid argument.
template <class RandIt, class Chunk, typename... A> | ||
void parallel_chunk_for(util_par::threaded &&policy, | ||
RandIt first, RandIt last, | ||
Chunk chunk, A&&... chunk_args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If binary size is a concern, a standard technique would be to have a type-erased version which can execute either in serial or parallel. This prevents the implementation being inlined into the serial case and means the std::thread
code is never inlined anywhere.
e.g. something like
using loop_fn_t = void (void*, int64_t, int64_t); // context, begin, end
void parrallel_for(int64_t num_threads, int64_t num_steps, void* context, loop_fn_t * loop_fn);
OutputIt inclusive_scan(InputIt first, InputIt last, OutputIt result) | ||
{ | ||
#if (!defined(_GLIBCXX_RELEASE) || _GLIBCXX_RELEASE >= 9) | ||
return std::inclusive_scan(first, last, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what method the standard libraries use to implement the parallel algorithms and if they might introduce any issues with fork?
std::vector<I>(n_col, -1), // mask | ||
std::vector<T>(n_col, 0)); // sums | ||
}, [&](I i, auto& workspace){ | ||
auto& [mask, sums] = workspace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the signature to (I first, I last)
you could iterate over the whole chunk at once, avoiding the workspace issue and allowing smarter algorithms such as computing the per thread nnz
as you go and summing the per-thread values at the end. This would avoid you having to loop over C_row_nnz
twice.
int num_workers = 1; | ||
// int num_workers = std::min(16, std::thread::hardware_concurrency()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thanks for doing this! It'd be really great to have this.
But why set the default number of threads to 1 here? In other libraries (numpy
(or at least, the BLAS backend),dask
, numba
, etc.) the default number of threads to be used tends to be based on the local hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. NumPy, SciPy, scikit-learn are all single-threaded by default, and SciPy/scikit-learn have keywords to switch to multiple threads. The only exception is BLAS/LAPACK calls, and that's because it's not possible to control at runtime in a reasonable manner (and exact threading depends on build settings and env vars). I've been in some discussions with pandas devs as well, where the majority opinion was that single-threaded is the correct default. If you control the whole stack with your own framework (e.g., PyTorch/TF/JAX), it's a much easier story, but that doesn't apply here.
Using all threads by default seems logical at first sight, but you find soon enough that it breaks things when there's higher-level parallelism (e.g., Dask, multiprocessing).
It'd be great to have better coordination on parallelism control, but it's tough. See https://thomasjpfan.github.io/parallelism-python-libraries-design/ for a good summary of where we are at and what the options are to improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late response!
NumPy, SciPy, scikit-learn are all single-threaded by default
I think this is true on the python code level, but in practice when I run something like scikit learns dense PCA I will be using many threads by default, since matrix multiplication is multithreaded by default with pypi and conda numpy distributions.
The only exception is BLAS/LAPACK calls, and that's because it's not possible to control at runtime in a reasonable manner
Scikitlearn, via joblib
and threadpoolctl
do control the threading of BLAS/ LAPACK calls to avoid over subscription when their higher level parallelism is used.
I've been in some discussions with pandas devs as well, where the majority opinion was that single-threaded is the correct default.
I would be curious how this interacts with their increasing use of arrow via pyarrow, since pyarrow's compute and IO is multithreaded by default.
Using all threads by default seems logical at first sight, but you find soon enough that it breaks things when there's higher-level parallelism (e.g., Dask, multiprocessing).
If we are requiring some users to have to think more about parallelism, I think it's more reasonable to have those users be the ones using a higher level parallelism. I believe so much code is highly multithreaded by default in the scientific python ecosystem (pyarrow
, polars
, many things using numba
, numpy
's dot product, etc.) that I think people who want to use higher level parallelism already need to take steps to mitigate oversubscription.
I think one of the nice things this PR does is hook into some nice tooling for higher level parallelism. E.g. if this code is called within a jobllib Parallel job it will automatically scale the number of threads used via threadpoolctl
(thanks to @alugowski's upstream improvements!).
Dask also exerts control over the number of BLAS threads each job has access to, though I believe it's a different mechanism.
User expectations
I think it would be unintuitive if a @ b
was multithreaded for dense arrays and single threaded for sparse arrays by default.
(Closes gh-20845)
Apologies for the wall of text, I want to provide context and reasoning for the changes made here.
Most CSR sparse matrix operations have an outer loop that iterates over the matrix rows. Often each iteration of this loop is, or can easily be made to be, independent. This PR parallelizes this loop.
PR is fully implemented and working, but has some product-level questions remaining. See the tradeoffs section below. I'll update the PR to remove code for the rejected options.
Choice of parallelism framework
OpenMP is explicitly off the table as per docs, so no
omp parallel for
. Common libraries like TBB or Taskflow seem too heavy for just this one application, correct me if I'm wrong. That leaves simple threads or a thread pool.I chose to use the interface of the C++17 parallel algorithms to make the code future-proof and easy to understand for C++ programmers. I say interface because compiler support for those algorithms is not all there at the moment. Instead, I implemented the required algorithms on top of a thread pool using C++11 threads. This is fairly small and works on all platforms. It seems useful enough to be its own product, so I released it as poolSTL.
Parallel matmul
Only method that requires signature changes.
The core SMMP algorithm has two embarrassingly parallel phases. Phase 1 calculates the size of the result matrix then phase 2 fills it in.
The signature change is to the phase 1 implementation,
csr_matmat_maxnnz()
. Previously this method only calculated the result's nnz. A parallel phase 2 requires a pre-computed indptr (Cp
), so the change is to havecsr_matmat_maxnnz()
return the row sizes as it computes the nnz. A cumsum of the row sizes isCp
. This is done in Python to handle the (rare) case where the output matrix has enough non-zeros to overflow the input matrices' index type.A small logic change is also required. The previous code would only count computed values, regardless whether those values are zero or not. Over-allocation is ok. The parallel code cannot do that as phase 2 is not allowed to emit explicit zeros. Therefore, phase 1 must also compute the values to test against zero to return an accurate nnz.
Unfortunately this is extra work. Quick benchmarks suggest the single-threaded code is about 20-30% slower, but this is easily made up for in parallel.
Parallel binop
I only parallelized the canonical version.
Binop faces the same problem as matmul: the size and sparsity pattern of the output is needed for parallelism, but it's unknown. The sequential code would allocate for the worst case, then just construct the result. The parallel code uses matmul's approach: Phase 1 computes
Cp
in parallel, then phase 2 fills in the column indices and data in parallel. The sequential case remains the same single pass as before.One weird observation I can't fully explain is that the sequential case now appears to be significantly faster than before, like often 2x. The only change I can identify is that the new code breaks some accidental inter-loop data dependencies, but that doesn't seem like it should be enough for the performance boost. Happy to be corrected.
Parallel Other methods
Most obvious beneficiary is
csr_sort_indices
. Used everywhere and parallelizes very well. I also added an optimization to remove a temporary copy for the sort.Other embarrassingly parallel methods just replace their outer
for
loops with parallelstd::for_each
. Some methods that check the format instead usestd::all_of
.Parallelism control
Note: methods stay sequential unless the matrix is fairly large, current definition of "large" is 64k non-zeros. This may seem overly large, but sequential ops are fast.
pnumpy
also uses this threshold.Parallelism is controlled with
threadpoolctl
:PR default is single threaded. One must use the code above to enable parallelism.
Benchmarks
The standard scipy benchmarks are too small to really exercise this PR. On my machine few take longer than a millisecond. I have a branch with much bigger versions of the standard benchmarks: 6b78eac
If folks could try on their own benchmarks or their own workflows that would be very helpful.
Tests
I configured the tests to run twice: once sequentially and once with parallelism forced on even for tiny problems. Happy to change this if it's excessive.
Product level decisions and tradeoffs
Is the worker count API ok?
Is threadpoolctl enough?
Other parallel SciPy C++ codes (fft, optimize) use explicit
worker
arguments. That would be difficult here as none of these methods are called directly by the user, and many are operators.Any other (or additional) options?
Thread pool vs Dedicated threads
A common thread pool started on first use of a large op (Pros: threads are reused) vs starting and joining dedicated threads for each op (Pros: smaller library size, no idle worker threads left behind).
Controlled by
USE_POOL
constant inscipy/sparse/sparsetools/util_par.h
.What ops to parallelize
Library size is the main downside of this PR. Note that all the methods in
csr.h
are templates with quite a few instantiations each. The dedicated threads create about one more symbol for each method, while the thread pool requires several more (due to wrappers, vtables, etc). I've already more than halved the.so
size from my first attempts, would appreciate further ideas from folks with strongernm
skills than mine. Also, I understand that the final release build does some symbol stripping so the following may be an exaggeration.To further control library size this PR includes two more switches in
util_par.h
:PARALLEL_BINOP
whether to parallelize binops.PARALLEL_OTHERS
whether to parallelize methods other than matmul, sort_elements, binop. Currently off.It should also be possible to parallelize just some binops (like plus, minus) and save library size on the others.
For context, rough sizes of
_sparsetools.cpython-310-darwin.so
as built bypython dev.py build
on my machine:USE_POOL=0
/PARALLEL_BINOP=0
: 4.8 MBUSE_POOL=0
/PARALLEL_BINOP=1
: 8.0 MBUSE_POOL=1
/PARALLEL_BINOP=0
: 5.4 MBUSE_POOL=1
/PARALLEL_BINOP=1
: 13 MBIf the choice is to go dedicated threads (no pool) then we can rip just those methods from poolSTL and drop that file entirely. The pure threads functions are <100 lines total.