Asterisk: Hybrid ASPP-Attention Architecture

Asterisk is a research implementation that combines the ASPP (Adjacency-Structured Parallel Propagation) operator with standard attention mechanisms to enhance the SmolLM2-135M model. The model implements a hybrid architecture that fuses graph-based local reasoning (ASPP) with global attention for improved expressiveness on structured reasoning tasks.

Model Description

  • Base Model: SmolLM2-135M-Instruct
  • Architecture: Hybrid ASPP-Attention (30 hybrid layers)
  • Parameters: 171.2M (35M additional ASPP parameters)
  • Training: Supervised Fine-Tuning on Capybara dataset
  • Framework: Transformers 4.57.6, TRL 0.27.0

Evaluation Results

Evaluated on LM-Evaluation-Harness:

Task Metric Score Stderr
HellaSwag acc_norm 0.4430 ±0.0157
ARC-Easy acc_norm 0.5450 ±0.0158
ARC-Challenge acc_norm 0.2884 ±0.0132
PIQA acc_norm 0.6770 ±0.0148
WinoGrande acc 0.5210 ±0.0158

Key Innovation: The Asterisk Operator (★-operator)

The Asterisk Operator performs local parallel state evolution through point-wise transformations:

h_i^(t+1) = φ(h_i^(t))  [K-step iterative evolution]

This is then gated and fused with standard Llama attention outputs:

output = gate * ASPP(x) + (1-gate) * Attention(x)

Architecture

1. ASPPOperator (Point-wise Parallel Propagation)

class ASPPOperator:
    """

    Forward pass:
    1. Optional dimensionality reduction: h_t = down_proj(hidden_states)
    2. K-step evolution: h_t = h_t + α * φ(h_t)  [K times]
    3. Layer normalization after each step
    4. Optional projection back: output = up_proj(h_t)

    Parameters:
    - hidden_size: 576 (model dimension)
    - aspp_hidden_dim: 256 (internal ASPP dimension)
    - aspp_num_steps: 8 (evolution iterations)
    - aspp_dropout: 0.2
    """

Pseudocode:

function ASPP(hidden_states):
    # Optional dimensionality reduction
    if use_projection:
        h_t ← down_proj(hidden_states)
        h_t ← dropout(h_t)
    else:
        h_t ← hidden_states

    # Learnable number of steps
    k_steps ← max(1, int(sigmoid(k_logit) * num_steps))

    # K-step point-wise evolution
    for t = 1 to k_steps:
        # Point-wise update: φ(h_t) = MLP(h_t)
        h_t_next ← update_net(h_t)

        # Scaled residual connection
        h_t ← h_t + residual_scale * h_t_next
        h_t ← layer_norm(h_t)

    # Project back to original dimension
    if use_projection:
        h_t ← up_proj(h_t)
        h_t ← dropout(h_t)

    return h_t

2. HybridASPPAttentionLayer

class HybridASPPAttentionLayer(LlamaDecoderLayer):
    """
    Extends LlamaDecoderLayer with parallel ASPP branch

    Architecture:
    1. Input LayerNorm
    2. Parallel branches:
       - ASPP operator for local structured reasoning
       - Standard LlamaAttention for global context
    3. Gated fusion: gate * ASPP + (1-gate) * Attention
    4. Residual connection
    5. Feed-forward MLP
    """

Pseudocode:

function HybridLayer(hidden_states, attention_mask, ...):
    residual ← hidden_states
    hidden_states ← input_layernorm(hidden_states)

    # Parallel branches
    aspp_output ← aspp_operator(hidden_states)
    attn_output ← self_attention(hidden_states, attention_mask, ...)

    # Gated fusion
    fusion_input ← concat([aspp_output, attn_output])
    gate ← sigmoid(linear(dropout(fusion_input)))
    fused_output ← gate * aspp_output + (1 - gate) * attn_output

    # Residual connection
    hidden_states ← residual + fused_output

    # MLP block
    residual ← hidden_states
    hidden_states ← post_attention_layernorm(hidden_states)
    hidden_states ← mlp(hidden_states)
    hidden_states ← residual + hidden_states

    return hidden_states

3. AsteriskForCausalLM

class AsteriskForCausalLM(LlamaForCausalLM):
    """
    Main model class with custom model_type "asterisk"

    Configuration:
    - hybrid_layer_indices: None (all 30 layers are hybrid)
    - aspp_hidden_dim: 256 (reduces overfitting)
    - aspp_num_steps: 8 (learnable, actual steps ≈ 6)
    - aspp_dropout: 0.2
    """

Note: These are preliminary results with sample limits. Full evaluation pending.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "path/to/Asterisk",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("path/to/Asterisk")

# Generate text
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Configuration

  • Dataset: Capybara (conversational instruction-following)
  • Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
  • Batch Size: 4 per device, gradient accumulation=4 (effective batch=16)
  • Epochs: 2
  • Scheduler: Cosine with warmup (100 steps)
  • Mixed Precision: bfloat16
  • Gradient Checkpointing: Enabled

ASPP Configuration

aspp_hidden_dim = 256      # Internal dimension (vs 576 model hidden_size)
aspp_num_steps = 8         # Max evolution steps (learnable)
aspp_dropout = 0.2         # Regularization
hybrid_layer_indices = None  # All 30 layers

Model Creation from Base

from AsteriskForCausalLM import AsteriskForCausalLM

# Create Asterisk model from SmolLM2 base
model, base_model = AsteriskForCausalLM.from_pretrained_base(
    "HuggingFaceTB/SmolLM2-135M-Instruct",
    hybrid_layer_indices=None,  # None = all layers
    aspp_hidden_dim=256,        # Internal ASPP dimension
    aspp_num_steps=8,           # K-step evolution
    aspp_dropout=0.2,           # Dropout rate
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Base model parameters are transferred, ASPP parameters initialized randomly
model.load_state_dict(base_model.state_dict(), strict=False)

Theoretical Background

Universality (Theorem 2.1)

ASPP can simulate any Message-Passing Neural Network (MPNN) function on finite graphs in D steps, where D is the graph diameter.

Convergence (Theorem 2.2)

Exponential convergence to fixed points with rate c=0.76 under Lipschitz continuity.

Turing Completeness

Proven via cyclic tag system simulation - ASPP can compute any Turing-computable function given sufficient depth.

Implementation Note: This implementation simplifies theoretical ASPP to point-wise evolution to reduce overfitting while maintaining iterative refinement benefits.

Files in Checkpoint

Asterisk/
├── AsteriskForCausalLM.py    # Model implementation (required for trust_remote_code)
├── config.json                # Model configuration with auto_map
├── model.safetensors          # Model weights
├── tokenizer.json             # Tokenizer
├── generation_config.json     # Generation settings
└── README.md                  # This file

Dependencies

pip install torch>=2.0.0
pip install transformers>=4.40.0
pip install trl>=0.8.0
pip install datasets>=2.14.0
pip install accelerate>=0.25.0
pip install bitsandbytes

Citations

If you use this model, please cite:

@misc{asterisk2026,
  title={Asterisk: Hybrid ASPP-Attention Architecture for Enhanced Language Modeling},
  author={NoesisLab},
  year={2026},
  publisher={Huggingface},
  url={https://huggingface.co/NoesisLab/Asterisk}
}
@misc{vonwerra2022trl,
  title={{TRL: Transformer Reinforcement Learning}},
  author={Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
  year={2020},
  journal={GitHub repository},
  publisher={GitHub},
  howpublished={\url{https://github.com/huggingface/trl}}
}
@article{allal2024SmolLM2,
  title={SmolLM2 - with great data, comes great performance},
  author={Allal, Loubna Ben and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
  year={2024}
}

License

This model inherits the Apache 2.0 license from SmolLM2-135M-Instruct.

Framework Versions

  • TRL: 0.27.0
  • Transformers: 4.57.6
  • PyTorch: 2.8.0+cu128
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Acknowledgments

Built on top of SmolLM2-135M-Instruct by HuggingFace. Training framework powered by TRL.

Downloads last month
45
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NoesisLab/Asterisk

Finetuned
(227)
this model
Finetunes
1 model