ECHO

ECHO (frEquenCy-aware Hierarchical encOding for variable-length signal) is a general machine signal representation learning model based on Masked Autoencoders (MAE) with band-splitting and frequency positional encoding that handles variable lengths.

Performance on SIREN

Overall performance summary (DCASE anomaly detection + Fault classification):

Model Details

Model Type: AudioMAEWithBand (MAE-based Audio Encoder)
Hidden Size: 192
Number of Layers: 12
Number of Attention Heads: 3
Intermediate Size: 768 (mlp_ratio=4.0)
Shift Size: 16 (half of patch_size)
Band Width: 32
Total Parameters: ~5.5M

Key Features

Band-splitting architecture: Processes audio in frequency bands for better local and global representation learning
Frequency position encoding: Incorporates frequency information into the model for better audio understanding
Efficient patch embedding: Uses sliding window patches for temporal modeling, enabling varying time lengths

Download

from huggingface_hub import snapshot_download

# Download the model to local directory
model_path = snapshot_download(
    repo_id="yucongzh/echo-tiny-0828",
    local_dir="./echo-tiny",
    local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")

Usage

import torch
import torchaudio
import sys

# Add the model path to Python path
sys.path.append('./echo-tiny')

# Import the model architecture
from audioMAE_band_upgrade import AudioMAEWithBand

# Create model instance with your configuration
model = AudioMAEWithBand(
    spec_len=2000,
    band_width=32,
    shift_size=16,
    in_chans=1,
    embed_dim=192,
    encoder_depth=12,
    num_heads=3,
    mlp_ratio=4.0,
    freq_pos_emb_dim=192
)

# Load pre-trained weights
from safetensors.torch import load_file
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict, strict=False)

# Set to evaluation mode
model.eval()

# Example usage
audio_signal = torch.randn(1, 240000)  # 5 seconds at 48kHz
sample_rate = 48000

# Method 1: Extract features directly from audio (Recommended)
with torch.inference_mode():
    utterance_level_features, segment_level_features = model.extract_features_from_audio(audio_signal, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")

# Method 2: Use preprocessing separately, then extract features
spec = model.preprocess_audio_to_spectrogram(audio_signal, sample_rate=sample_rate)
print(f"Spectrogram shape: {spec.shape}")

# Extract features from preprocessed spectrogram
with torch.inference_mode():
    utterance_level_features, segment_level_features = model.extract_features(spec, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")

Feature Types

The ECHO model outputs two types of features:

1. Utterance-level Features

Shape: [NxD,] (concatenated CLS tokens from all frequency bands)
Usage: Audio classification, emotion recognition, music genre classification, speaker identification
Characteristics: Global representation of the entire audio segment

2. Segment-level Features

Shape: [T, NxD] (temporal features for each patch, concatenated across bands)
Usage: Audio segmentation, event detection, temporal localization, sequence modeling
Characteristics: Fine-grained temporal representation with frequency band information

Citation

If you find ECHO helpful, please consider to cite our paper:

@article{echo2025,
  title={ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal},
  author={Yucong Zhang and Juan Liu and Ming Li},
  journal={arXiv preprint arXiv:2508.14689},
  year={2025},
}