Does batch_size=128 during training refer to the global or single-GPU batch size, and is it trained using DeepSpeed Zero3?
#13
by
Hipanda
- opened
hi, thank you for your awesome work. There are some doubts about the training batch size in the paper. Does batch_size=128 during training refer to the global or single-GPU batch size?
Looking forward to your reply!
They used gradient checkpointing to conserve GPU memory, so the could set batch_size=128.
I want to ask how to use gradient checkpointing in ms-swift, looking forward to replies.