Does batch_size=128 during training refer to the global or single-GPU batch size, and is it trained using DeepSpeed Zero3?

#13
by Hipanda - opened

hi, thank you for your awesome work. There are some doubts about the training batch size in the paper. Does batch_size=128 during training refer to the global or single-GPU batch size?
Looking forward to your reply!

They used gradient checkpointing to conserve GPU memory, so the could set batch_size=128.
I want to ask how to use gradient checkpointing in ms-swift, looking forward to replies.

Sign up or log in to comment