using SlidingWindowLayer Cache will cause a crash
#5
by
mdabbah
- opened
Tested the kernel with gpt-oss with a variable cache implementation that uses SlidingWindowLayer Cache for sliding window layers and static cache for full attention layers.
with this setup the current implementation will crash on the first generation step on the first attention layer which is a sliding attention layer
sent arguments to the attention interface:
query_states.shape
torch.Size([1, 64, 1, 64])
key_states.shape
torch.Size([1, 8, 128, 64])
value_states.shape
torch.Size([1, 8, 128, 64])
sliding_window=128
kwargs
{'position_ids': tensor([[575]], device='cuda:0'), 'output_attentions': False, 'use_cache': True}
same arguments work fine on eager implementation