ECHO
ECHO (frEquenCy-aware Hierarchical encOding for variable-length signal) is a general machine signal representation learning model based on Masked Autoencoders (MAE) with band-splitting and frequency positional encoding that handles variable lengths.
Performance on SIREN
Overall performance summary (DCASE anomaly detection + Fault classification):
Model Details
- Model Type: AudioMAEWithBand (MAE-based Audio Encoder)
- Hidden Size: 192
- Number of Layers: 12
- Number of Attention Heads: 3
- Intermediate Size: 768 (mlp_ratio=4.0)
- Shift Size: 16 (half of patch_size)
- Band Width: 32
- Total Parameters: ~5.5M
Key Features
- Band-splitting architecture: Processes audio in frequency bands for better local and global representation learning
- Frequency position encoding: Incorporates frequency information into the model for better audio understanding
- Efficient patch embedding: Uses sliding window patches for temporal modeling, enabling varying time lengths
Download
from huggingface_hub import snapshot_download
# Download the model to local directory
model_path = snapshot_download(
repo_id="yucongzh/echo-tiny-0828",
local_dir="./echo-tiny",
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")
Usage
import torch
import torchaudio
import sys
# Add the model path to Python path
sys.path.append('./echo-tiny')
# Import the model architecture
from audioMAE_band_upgrade import AudioMAEWithBand
# Create model instance with your configuration
model = AudioMAEWithBand(
spec_len=2000,
band_width=32,
shift_size=16,
in_chans=1,
embed_dim=192,
encoder_depth=12,
num_heads=3,
mlp_ratio=4.0,
freq_pos_emb_dim=192
)
# Load pre-trained weights
from safetensors.torch import load_file
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict, strict=False)
# Set to evaluation mode
model.eval()
# Example usage
audio_signal = torch.randn(1, 240000) # 5 seconds at 48kHz
sample_rate = 48000
# Method 1: Extract features directly from audio (Recommended)
with torch.inference_mode():
utterance_level_features, segment_level_features = model.extract_features_from_audio(audio_signal, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")
# Method 2: Use preprocessing separately, then extract features
spec = model.preprocess_audio_to_spectrogram(audio_signal, sample_rate=sample_rate)
print(f"Spectrogram shape: {spec.shape}")
# Extract features from preprocessed spectrogram
with torch.inference_mode():
utterance_level_features, segment_level_features = model.extract_features(spec, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")
Feature Types
The ECHO model outputs two types of features:
1. Utterance-level Features
- Shape:
[NxD,]
(concatenated CLS tokens from all frequency bands) - Usage: Audio classification, emotion recognition, music genre classification, speaker identification
- Characteristics: Global representation of the entire audio segment
2. Segment-level Features
- Shape:
[T, NxD]
(temporal features for each patch, concatenated across bands) - Usage: Audio segmentation, event detection, temporal localization, sequence modeling
- Characteristics: Fine-grained temporal representation with frequency band information
Citation
If you find ECHO helpful, please consider to cite our paper:
@article{echo2025,
title={ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal},
author={Yucong Zhang and Juan Liu and Ming Li},
journal={arXiv preprint arXiv:2508.14689},
year={2025},
}
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support