using SlidingWindowLayer Cache will cause a crash

#5
by mdabbah - opened

Tested the kernel with gpt-oss with a variable cache implementation that uses SlidingWindowLayer Cache for sliding window layers and static cache for full attention layers.

with this setup the current implementation will crash on the first generation step on the first attention layer which is a sliding attention layer

sent arguments to the attention interface:

query_states.shape
torch.Size([1, 64, 1, 64])
key_states.shape
torch.Size([1, 8, 128, 64])
value_states.shape
torch.Size([1, 8, 128, 64])

sliding_window=128
kwargs
{'position_ids': tensor([[575]], device='cuda:0'), 'output_attentions': False, 'use_cache': True}

Screenshot 2025-08-13 at 18.30.51.png
Screenshot 2025-08-13 at 18.31.02.png

same arguments work fine on eager implementation

Sign up or log in to comment