[Bug fix]: predict error on multi-gpu#1082
Conversation
|
@kawa23 Thanks for the pull request. I'm reviewing it and trying to trace the issue. A quick question about the error you reported in #1044: How come your tensor shape is 24000? I'd expect it to be 2400 since your batch size is 4. Did you happen to change I acknowledge the error you got, but I'm suspecting that the issue might be somewhere else. |
Yes, When I changed the and the I found that the And I also found that it was here #L819 ~ #L821
throws the error when I tracking it. May be I should test on Thanks for your reply, and I'll appreciate it if you find out and fix it. |
|
@kawa23 Our company has a DGX-1, and I have tested your code from 1 to 8 GPUs. When I tested on 1 to 4 GPU, everything works fine. But when I tested on 5,6,7 GPUs, it gave me the following error: And when I tested on 8 GPUs, it gave me another error: Any ideas? |
|
@keineahnung2345 @kawa23 ,I tried to use multiple GPU to test Mask RCNN recently, but the speed did not improve. Could you post your code? thank you |
@waleedka Hi, thank you for ack on the error. I'm having the same issue (2 GPU, 2 images/GPU, no change of config so I'm 2400 as you stated here, instead of 24000). What's the proper fix to this? |
I get the similar problem as the up: |
Aashish-Gautam
left a comment
There was a problem hiding this comment.
It fixed my issue. Thanks for the help.
Not sure why this change is already not been done in model.py
Predict problem on multi gpu (Input to reshape is a tensor with 24000 values, but the requested has 48000)
I thought this issue, see also:( #1044 )caused by
DetectionLayer output reshapehttps://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L820
was called before
parallel_model mergeat thedetection funchttps://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L2043
so I think, the
DetectionLayer output reshapewas error, and it should be changed like that,it worked for me on mulit-gpu(two)