behavior between GptOssExperts and Mxfp4GptOssExperts

#77
by DaleMeng - opened

according to the code in https://github.com/huggingface/transformers/blob/v4.55.0/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L62
In GptOssExperts.forward
image.png
It shows there are some special actions compared to the normal swiGLU mlp layer
there are clamp with gate and up value; up + 1 before element wise multi with glu value
At first I add some break-point and want to check how is these special actions work, but I found in fact during the forward process the class Mxfp4GptOssExperts is working.
https://github.com/huggingface/transformers/blob/v4.55.0/src/transformers/integrations/mxfp4.py#L179
image.png
according to the code above, it calls the triton_kernels.matmul_ogs
I am not familiar with Triton and the lower-level code, but I think matmul_ogs should be a normal operations for the MOE calculations without customized code
so I just want to know if the behavior based on triton_kernels.matmul_ogs here are the same with GptOssExperts.forward

Sign up or log in to comment