【已解决】RuntimeError: No CUDA GPUs are availableERROR:torch.distributed.elastic.multiprocessing.api:fa

问题描述

        今天遇到这样一个问题:RuntimeError: No CUDA GPUs are available
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 466774) of binary: /home/visionx/anaconda3/envs/globetrotter/bin/python

        完整描述是:

/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
No CUDA runtime is found, using CUDA_HOME=':/usr/local/cuda'
2024-04-19 13:10:41 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
Namespace(alpha_xm=False, augment_image=True, batch_size=128, checkpoint_dir='checkpoints', checkpoint_path='checkpoints/train_sigurdsson', checkpoint_path_load_from='checkpoints/train_sigurdsson', config_arch='config', config_data='all-lang_test-zh-en', dataset_info_path='dataset_info', dataset_path='dataset', debug=False, evaluate=False, fp16=True, image_size=224, lambda_lm_loss=0.0, lambda_orthogonality_loss=1.0, lambda_visual_loss=0.0, lambda_xlang_loss=0.0, lambda_xm_loss=1.0, language_split='training', learning_rate=0.001, local_rank=0, max_txt_seq_len=50, momentum_bn=0.1, name='train_sigurdsson', not_use_images=False, num_epochs=100, opt_level='O1', output_attentions=False, p_clobber_other_txt=0.0, p_mask=0.0, pretrained_cnn=False, print_freq=1, prob_predict_token=0.0, results_dir='results', results_path='results/train_sigurdsson', resume=True, resume_but_restart=False, resume_epoch=-1, resume_latest=False, resume_name='train_sigurdsson', runs_dir='runs', seed=0, sigurdsson=True, test=True, test_name='extract_features', test_options='val', tokenizer_path=None, tokenizer_type='huggingface', two_heads_modality=False, workers=20)
Traceback (most recent call last):
  File "main.py", line 262, in <module>
    main()
  File "main.py", line 152, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 466774) of binary: /home/visionx/anaconda3/envs/globetrotter/bin/python
Traceback (most recent call last):
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/visionx/anaconda3/envs/globetrotter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-19_13:10:46
  host      : visionx
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 466774)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

        看的出来,已经很完整了, 那造成这样的原因是什么呢?

原因分析和解决

        我们先看一下网上有没有类似的错误:

        看完上面这个的时候是懵比的,这是怎么回事呢?感觉和我的也不太符合啊!

        仔细分析一下,这上面说没有GPU,那会不会是我运行过程中开的进程太多了呢?看一下bash文件: 

CUDA_VISIBLE_DEVICES=5 NCCL_LL_THRESHOLD=4 python \
-W ignore \
-i \
-m torch.distributed.launch \
--master_port=9997 \
--nproc_per_node=1 \
main.py \

        果然,我这机器里面只有2个GPU,但是呢,这里面的却是5,显然是没办法找到的!在这里改一下就可以了 

CUDA_VISIBLE_DEVICES=0,1 NCCL_LL_THRESHOLD=4 python \
-W ignore \
-i \
-m torch.distributed.launch \
--master_port=9997 \
--nproc_per_node=1 \
main.py \

        成功!

        不过,很有意思的是,看的出来我这里面使用的是分布式计算,那我单机多GPU训练的时候,一个GPU上能跑几个进程呢? 也就是说nproc_per_node=最大?详细的可以看我这个博客。

相关链接

选择要用的GPU: CUDA_VISIBLE_DEVICES-CSDN博客文章浏览阅读10w+次,点赞27次,收藏98次。服务器中有多个GPU,选择特定的GPU运行程序可在程序运行命令前使用:CUDA_VISIBLE_DEVICES=0命令。0为服务器中的GPU编号,可以为0, 1, 2, 3等,表明对程序可见的GPU编号。1. 命令:CUDA_VISIBLE_DEVICES=1 # 只有编号为1的GPU对程序是可见的,在代码中gpu[0]指的就是这块儿GPUCUDA_VISIBLE_DEVICES=0..._cuda_visible_deviceshttps://blog.csdn.net/lscelory/article/details/83579062解决 RuntimeError: No CUDA GPUs are available 错误-百度开发者中心当使用PyTorch等深度学习框架时,若出现'RuntimeError: No CUDA GPUs are available'错误,通常意味着程序未能成功检测到CUDA兼容的GPU。本文将介绍可能的原因和解决方案,帮助读者顺利运行基于GPU的深度学习代码。icon-default.png?t=N7T8https://developer.baidu.com/article/details/3238503

完结撒花

        并不是所有人都值得教化,所以那些不能被教化的人该清除就得清除!

  • 32
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值