多卡训练报错RuntimeError: The server socket has failed to listen on any local network address.
多卡训练报错 RuntimeError: The server socket has failed to listen on any local network address. The server socket
问题描述
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
解决方案
原因:训练的时候,已经有一张卡在执行训练,第二张卡执行训练任务时,使用了同一个端口
解决办法:修改第二张卡的端口号,与第一张卡使用的端口号不重复即可
或者不使用多卡训练了。使用DP模式
CUDA_VISIBLE_DEVICES=4 python train.py
完成