-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matmul operator @ can freeze / hang when used with default python multiprocessing using fork context instead of spawn #15973
Comments
@bicici can you give more details? I.e. I am not sure what you mean by "openblas ovecome this". Are you using an Intel patched version of NumPy or plain NumPy? |
When numpy use openblas, as noticed in joblib/joblib#138, does not freeze. When the same github numpy code (numpy1.18.x) and settings, compiled from source without error, pick Intel mkl libraries instead, they freeze. In comparison, linux installed numpy and scipy already packed with openblas: Different python3 versions in combination with different numpy / scipy versions stable / testing installed from debian repositories deb.debian.org/debian did start freezing beginning 23 March 2020 forcing single processor computation or risky computation. The freeze happened between matrix sizes of 1000 x 50 and 5000 x 100 and around 5000 x 50. An initial parallel programming freeze related experiment of mine was timestamped by IPython at Tue Mar 24 02:51:02 2020. A task that takes about 0.05 seconds to finish in serial mode freeze in parallel mode. This appears joblib and multiprocessing of python related. However, as I investigate further using scikit-learn sklearn.neural_network.MLPRegressor, there I found the freezing part in a @ b, which appears unnecessary instead of using a.dot(b) in that part of the code: Note: python test_buffer.py freeze is mentioned here: The concurrent findings of ...
may require white papers on the topics. matmul's a @ b operator and python's float(3e400) operator both freeze in the contexts mentioned: Thank you for asking. I see new issues popping up as of today, which might be related: |
@bicici the issue you link in joblib notes clearly that this is an MKL bug/issue, do you think there is anything to do with NumPy at all? Is your intention to ping MKL developers here, or is there something you expect from NumPy? I am seriously asking. |
My trail currently leads me to the a @ b operator documented and used by numpy after its introduction by python. I blame the a @ b matmul operator for the freeze. Sparse multiplications can take significantly longer than dense however I find this as a less likely explanation in this case. joblib's MKL issue might be linked to the same @ operator. a @ b operator might be critical and numpy's code using MKL can be compared with openblas. My findings point to an enlarging baloon which you might have experience with and it looks better that it is needled earlier (touched with a needle :) ). I only opened a ticket about the issue. The issue remains at large serious. I expect gains from informing both MKL and numpy. Thank you for taking action. Parallel matmul calls are failing in the test program. Maybe numpy matrix multiplication is not supposed to be run in parallel, then this is deeper than python global interpreter lock (GIL). I checked again and numpy.dot calls also froze in the test program: using X.dot(X.T) instead of X @ X.T. |
The default python in linux is also freezing on the following code: The stable version 3.7 of python is currently freezing and it should not be considered stable even though its tests pass during installation. I use tbb threads and follow the approach in dask/dask#3759 of using 'spawn' context in multiprocessing which seems to fix. |
The freeze / hang can happen with large matrices and in parallel settings. For instance, sklearn/neural_network/_multilayer_perceptron.py is using safe_sparse_dot, which calls the matmul operator @ for ret = a @ b. Affected package:
sklearn/neural_network/_multilayer_perceptron.py
from ..utils.extmath import safe_sparse_dot
ret = a @ b
python also freeze with expressions like exp(3e400) for float('inf') with -Ofast in test_buffer.py and such freeze may be related to those types of operators together with -Ofast in cpython. Therefore, compiling with fewer optimization flags might also overcome the issue and prevent @ to freeze the program. The freeze occurs when matrices are about larger than 5000 x 100.
MKL inteloneapi 2021.1-beta05 is freezing. openblas is not freezing.
Test program:
``
import concurrent.futures
from numpy import random, matmul
def mmtest(X, i):
print ('matmul @ call', i)
y_hat = X @ X.T
print ('done', i)
return y_hat
def mmtest_matmul(X, i):
print ('matmul func call', i)
y_hat = matmul(X, X.T)
print ('done', i)
return y_hat
def f_mpmm(X):
executor = concurrent.futures.ProcessPoolExecutor(7)
futures = []
futures.append(executor.submit(mmtest, X, 0))
futures.append(executor.submit(mmtest, X, 1))
futures.append(executor.submit(mmtest, X, 2))
futures.append(executor.submit(mmtest, X, 3))
futures.append(executor.submit(mmtest_matmul, X, 4))
futures.append(executor.submit(mmtest_matmul, X, 5))
futures.append(executor.submit(mmtest_matmul, X, 6))
concurrent.futures.wait(futures)
executor.shutdown()
def f_mm(X):
mmtest(X, 0)
mmtest(X, 1)
mmtest(X, 2)
mmtest(X, 3)
mmtest_matmul(X, 4)
mmtest_matmul(X, 5)
mmtest_matmul(X, 6)
def test():
X = random.randn(5000, 100); y = random.randn(5000)
print ('testing serial')
f_mm(X)
print ('testing multiprocessing')
f_mpmm(X)
if name == 'main':
test()
``
Test output with numpy built with Intel MKL:
testing serial
matmul @ call 0
done 0
matmul @ call 1
done 1
matmul @ call 2
done 2
matmul @ call 3
done 3
matmul func call 4
done 4
matmul func call 5
done 5
matmul func call 6
done 6
testing multiprocessing
matmul @ call 0
matmul @ call 1
matmul @ call 2
matmul @ call 3
matmul func call 4
matmul func call 5
matmul func call 6
[frozen]
Test output with numpy built with openblas:
testing serial
matmul @ call 0
done 0
matmul @ call 1
done 1
matmul @ call 2
done 2
matmul @ call 3
done 3
matmul func call 4
done 4
matmul func call 5
done 5
matmul func call 6
done 6
testing multiprocessing
matmul @ call 0
matmul @ call 1
matmul @ call 2
matmul @ call 3
matmul func call 4
matmul func call 5
matmul func call 6
done 0
done 1
done 2
done 3
done 6
done 4
done 5
Related files:
sklearn/neural_network/_multilayer_perceptron.py
sklearn/utils/extmath.py
Related issues:
"parallel processes freezing when matrices are too big"
joblib/joblib#138
"matmul operator freeze within safe_sparse_dot and bug fix"
scikit-learn/scikit-learn#16919
The text was updated successfully, but these errors were encountered: