|
--- |
|
tags: |
|
- espnet |
|
- audio |
|
- language-identification |
|
language: |
|
- abk |
|
- afr |
|
- amh |
|
- ara |
|
- asm |
|
- ast |
|
- aze |
|
- azz |
|
- bak |
|
- bas |
|
- bel |
|
- ben |
|
- bod |
|
- bos |
|
- bre |
|
- bul |
|
- cat |
|
- ceb |
|
- ces |
|
- chv |
|
- ckb |
|
- cmn |
|
- cnh |
|
- cym |
|
- dan |
|
- deu |
|
- div |
|
- ell |
|
- eng |
|
- epo |
|
- est |
|
- eus |
|
- fao |
|
- fas |
|
- fil |
|
- fin |
|
- fra |
|
- frr |
|
- ful |
|
- gle |
|
- glg |
|
- glv |
|
- grn |
|
- gug |
|
- guj |
|
- hat |
|
- hau |
|
- haw |
|
- heb |
|
- hin |
|
- hrv |
|
- hsb |
|
- hun |
|
- hye |
|
- ibo |
|
- ina |
|
- ind |
|
- isl |
|
- ita |
|
- jav |
|
- jpn |
|
- kab |
|
- kam |
|
- kan |
|
- kat |
|
- kaz |
|
- kea |
|
- khk |
|
- khm |
|
- kin |
|
- kir |
|
- kmr |
|
- kor |
|
- lao |
|
- lat |
|
- lav |
|
- lin |
|
- lit |
|
- ltz |
|
- lug |
|
- luo |
|
- mal |
|
- mar |
|
- mhr |
|
- mkd |
|
- mlg |
|
- mlt |
|
- mon |
|
- mri |
|
- mrj |
|
- msa |
|
- mya |
|
- myv |
|
- nan |
|
- nbl |
|
- nep |
|
- nld |
|
- nno |
|
- nob |
|
- nor |
|
- nso |
|
- nya |
|
- oci |
|
- ori |
|
- orm |
|
- pan |
|
- pol |
|
- por |
|
- pus |
|
- ron |
|
- rus |
|
- sah |
|
- san |
|
- sco |
|
- sin |
|
- skr |
|
- slk |
|
- slv |
|
- sna |
|
- snd |
|
- som |
|
- sot |
|
- spa |
|
- sqi |
|
- srp |
|
- ssw |
|
- sun |
|
- swa |
|
- swe |
|
- tam |
|
- tat |
|
- tel |
|
- tgk |
|
- tgl |
|
- tha |
|
- tok |
|
- tos |
|
- tpi |
|
- tsn |
|
- tso |
|
- tuk |
|
- tur |
|
- uig |
|
- ukr |
|
- umb |
|
- urd |
|
- uzb |
|
- ven |
|
- vie |
|
- war |
|
- wol |
|
- xho |
|
- xty |
|
- yid |
|
- yor |
|
- yue |
|
- zul |
|
datasets: |
|
- geolid |
|
license: cc-by-4.0 |
|
--- |
|
|
|
## ESPnet2 Spoken Language Identification (LID) model |
|
|
|
### `espnet/geolid_combined_shared_trainable` |
|
|
|
This geolocation-aware language identification (LID) model is developed using the [ESPnet](https://github.com/espnet/espnet/) toolkit. It integrates the powerful pretrained [MMS-1B](https://huggingface.co/facebook/mms-1b) as the encoder and employs [ECAPA-TDNN](https://arxiv.org/pdf/2005.07143) as the embedding extractor to achieve robust spoken language identification. |
|
|
|
The main innovations of this model are: |
|
1. Incorporating geolocation prediction as an auxiliary task during training. |
|
2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information. |
|
This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations. |
|
|
|
For further details on the geolocation-aware LID methodology, please refer to our paper: *Geolocation-Aware Robust Spoken Language Identification* (arXiv link to be added). |
|
|
|
### Usage Guide: How to use in ESPnet2 |
|
|
|
#### Prerequisites |
|
First, ensure you have ESPnet installed. If not, follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html). |
|
|
|
#### Quick Start |
|
Run the following commands to set up and use the pre-trained model: |
|
|
|
```bash |
|
cd espnet |
|
|
|
pip install -e . |
|
cd egs2/geolid/lid1 |
|
|
|
# Download the exp_combined to egs2/geolid/lid1 |
|
hf download espnet/geolid_combined_shared_trainable --local-dir . --exclude "README.md" "meta.yaml" ".gitattributes" |
|
|
|
./run_combined.sh --skip_data_prep false --skip_train true |
|
``` |
|
|
|
This will download the pre-trained model and run inference. |
|
|
|
### Train and Evaluation Datasets |
|
|
|
The training utilized a combined dataset, merging five domain-specific corpora, resulting in 9,865 hours of speech data covering 157 languages. |
|
|
|
| Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (Combined) | |
|
| ------------- | ----------- | ------------------ | ------- | --------------------------- | |
|
| [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) | YouTube | 107/33 | No | Seen | |
|
| [Babel](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=31a13cefb42647e924e0d2778d341decc44c40e9) | Telephone | 25/25 | No | Seen | |
|
| [FLEURS](https://huggingface.co/datasets/google/xtreme_s) | Read speech | 102/102 | No | Seen | |
|
| [ML-SUPERB 2.0](https://huggingface.co/datasets/espnet/ml_superb_hf) | Mixed | 137/(137, 8) | Yes | Seen | |
|
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Parliament | 16/16 | No | Seen | |
|
|
|
### Results |
|
|
|
**Accuracy (%) on In-domain and Out-of-domain Test Sets** |
|
|
|
<style> |
|
.hf-model-cell { |
|
max-width: 120px; |
|
overflow-x: auto; |
|
white-space: nowrap; |
|
scrollbar-width: thin; |
|
scrollbar-color: #888 #f1f1f1; |
|
} |
|
|
|
.config-cell { |
|
max-width: 100px; |
|
overflow-x: auto; |
|
white-space: nowrap; |
|
scrollbar-width: thin; |
|
scrollbar-color: #888 #f1f1f1; |
|
} |
|
|
|
.hf-model-cell::-webkit-scrollbar, |
|
.config-cell::-webkit-scrollbar { |
|
height: 6px; |
|
} |
|
|
|
.hf-model-cell::-webkit-scrollbar-track, |
|
.config-cell::-webkit-scrollbar-track { |
|
background: #f1f1f1; |
|
border-radius: 3px; |
|
} |
|
|
|
.hf-model-cell::-webkit-scrollbar-thumb, |
|
.config-cell::-webkit-scrollbar-thumb { |
|
background: #888; |
|
border-radius: 3px; |
|
} |
|
|
|
.hf-model-cell::-webkit-scrollbar-thumb:hover, |
|
.config-cell::-webkit-scrollbar-thumb:hover { |
|
background: #555; |
|
} |
|
</style> |
|
|
|
<div style="overflow-x: auto;"> |
|
|
|
| ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. | |
|
| ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- | |
|
| <div class="hf-model-cell">[egs2/geolid/lid1](https://github.com/espnet/espnet/tree/master/egs2/geolid/lid1)</div> | <div class="config-cell">`conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable.yaml`</div> | 94.4 | 95.4 | 97.7 | 88.6 | 86.8 | 99.0 | 93.7 | |
|
|
|
</div> |
|
|
|
For more detailed inference results, please refer to the `exp_combined/lid_mms_ecapa_upcon_32_44_it0.4_shared_trainable_raw/inference` directory in this repository. |
|
|
|
> **Note (2025-08-18):** |
|
> The corresponding GitHub recipe [egs2/geolid/lid1](https://github.com/espnet/espnet/tree/master/egs2/geolid/lid1) has not yet been merged into the ESPnet master branch. |
|
> See TODO: add PR link for the latest updates. |
|
|
|
## LID config |
|
|
|
<details><summary>expand</summary> |
|
|
|
``` |
|
config: conf/combined/mms_ecapa_upcon_32_44_it0.4_shared_trainable_dev.yaml |
|
print_config: false |
|
log_level: INFO |
|
drop_last_iter: false |
|
dry_run: false |
|
iterator_type: category |
|
valid_iterator_type: category |
|
output_dir: exp_combined/lid_mms_ecapa_upcon_32_44_it0.4_shared_trainable_dev_raw |
|
ngpu: 1 |
|
seed: 3702 |
|
num_workers: 8 |
|
num_att_plot: 0 |
|
dist_backend: nccl |
|
dist_init_method: env:// |
|
dist_world_size: null |
|
dist_rank: null |
|
local_rank: 0 |
|
dist_master_addr: null |
|
dist_master_port: null |
|
dist_launcher: null |
|
multiprocessing_distributed: false |
|
unused_parameters: true |
|
sharded_ddp: false |
|
use_deepspeed: false |
|
deepspeed_config: null |
|
gradient_as_bucket_view: true |
|
ddp_comm_hook: null |
|
cudnn_enabled: true |
|
cudnn_benchmark: true |
|
cudnn_deterministic: false |
|
use_tf32: false |
|
collect_stats: false |
|
write_collected_feats: false |
|
max_epoch: 33 |
|
patience: null |
|
val_scheduler_criterion: |
|
- valid |
|
- loss |
|
early_stopping_criterion: |
|
- valid |
|
- loss |
|
- min |
|
best_model_criterion: |
|
- - valid |
|
- accuracy |
|
- max |
|
keep_nbest_models: 2 |
|
nbest_averaging_interval: 0 |
|
grad_clip: 9999 |
|
grad_clip_type: 2.0 |
|
grad_noise: false |
|
accum_grad: 4 |
|
no_forward_run: false |
|
resume: true |
|
train_dtype: float32 |
|
use_amp: true |
|
log_interval: 100 |
|
use_matplotlib: true |
|
use_tensorboard: true |
|
create_graph_in_tensorboard: false |
|
use_wandb: false |
|
wandb_project: null |
|
wandb_id: null |
|
wandb_entity: null |
|
wandb_name: null |
|
wandb_model_log_interval: -1 |
|
detect_anomaly: false |
|
use_adapter: false |
|
adapter: lora |
|
save_strategy: all |
|
adapter_conf: {} |
|
pretrain_path: null |
|
init_param: [] |
|
ignore_init_mismatch: false |
|
freeze_param: [] |
|
num_iters_per_epoch: 2000 |
|
batch_size: 20 |
|
valid_batch_size: null |
|
batch_bins: 1440000 |
|
valid_batch_bins: null |
|
category_sample_size: 10 |
|
upsampling_factor: 0.5 |
|
category_upsampling_factor: 0.5 |
|
dataset_upsampling_factor: 0.3 |
|
dataset_scaling_factor: 1.2 |
|
max_batch_size: 6 |
|
min_batch_size: 1 |
|
train_shape_file: |
|
- exp_combined/lid_stats_16k/train/speech_shape |
|
valid_shape_file: |
|
- exp_combined/lid_stats_16k/valid/speech_shape |
|
batch_type: catpow_balance_dataset |
|
language_upsampling_factor: 0.5 |
|
valid_batch_type: null |
|
fold_length: |
|
- 120000 |
|
sort_in_batch: descending |
|
shuffle_within_batch: false |
|
sort_batch: descending |
|
multiple_iterator: false |
|
chunk_length: 500 |
|
chunk_shift_ratio: 0.5 |
|
num_cache_chunks: 1024 |
|
chunk_excluded_key_prefixes: [] |
|
chunk_default_fs: null |
|
chunk_max_abs_length: null |
|
chunk_discard_short_samples: true |
|
train_data_path_and_name_and_type: |
|
- - dump/raw/train_all_no_filter_lang/wav.scp |
|
- speech |
|
- sound |
|
- - dump/raw/train_all_no_filter_lang/utt2lang |
|
- lid_labels |
|
- text |
|
valid_data_path_and_name_and_type: |
|
- - dump/raw/dev_ml_superb2_lang/wav.scp |
|
- speech |
|
- sound |
|
- - dump/raw/dev_ml_superb2_lang/utt2lang |
|
- lid_labels |
|
- text |
|
multi_task_dataset: false |
|
allow_variable_data_keys: false |
|
max_cache_size: 0.0 |
|
max_cache_fd: 32 |
|
allow_multi_rates: false |
|
valid_max_cache_size: null |
|
exclude_weight_decay: false |
|
exclude_weight_decay_conf: {} |
|
optim: adam |
|
optim_conf: |
|
lr: 1.0e-05 |
|
betas: |
|
- 0.9 |
|
- 0.98 |
|
scheduler: tristagelr |
|
scheduler_conf: |
|
max_steps: 12500 |
|
warmup_ratio: 0.1 |
|
hold_ratio: 0.4 |
|
decay_ratio: 0.5 |
|
init_lr_scale: 0.6 |
|
final_lr_scale: 0.1 |
|
init: null |
|
use_preprocessor: true |
|
input_size: null |
|
target_duration: 3.0 |
|
lang2utt: dump/raw/train_all_no_filter_lang/lang2utt |
|
lang_num: 157 |
|
sample_rate: 16000 |
|
num_eval: 10 |
|
rir_scp: '' |
|
model: upstream_condition |
|
model_conf: |
|
lang2vec_conditioning_layers: |
|
- 32 |
|
- 36 |
|
- 40 |
|
- 44 |
|
apply_intermediate_lang2vec_loss: true |
|
apply_intermediate_lang2vec_condition: true |
|
inter_lang2vec_loss_weight: 0.4 |
|
cutoff_gradient_from_backbone: false |
|
cutoff_gradient_before_condproj: true |
|
shared_conditioning_proj: true |
|
frontend: s3prl_condition |
|
frontend_conf: |
|
frontend_conf: |
|
upstream: hf_wav2vec2_condition |
|
path_or_url: facebook/mms-1b |
|
download_dir: ./hub |
|
multilayer_feature: true |
|
specaug: null |
|
specaug_conf: {} |
|
normalize: utterance_mvn |
|
normalize_conf: |
|
norm_vars: false |
|
encoder: ecapa_tdnn |
|
encoder_conf: |
|
model_scale: 8 |
|
ndim: 512 |
|
output_size: 1536 |
|
pooling: chn_attn_stat |
|
pooling_conf: {} |
|
projector: rawnet3 |
|
projector_conf: |
|
output_size: 192 |
|
encoder_condition: identity |
|
encoder_condition_conf: {} |
|
pooling_condition: chn_attn_stat |
|
pooling_condition_conf: {} |
|
projector_condition: rawnet3 |
|
projector_condition_conf: {} |
|
preprocessor: lid |
|
preprocessor_conf: |
|
fix_duration: false |
|
sample_rate: 16000 |
|
noise_apply_prob: 0.0 |
|
noise_info: |
|
- - 1.0 |
|
- dump/raw/musan_speech.scp |
|
- - 4 |
|
- 7 |
|
- - 13 |
|
- 20 |
|
- - 1.0 |
|
- dump/raw/musan_noise.scp |
|
- - 1 |
|
- 1 |
|
- - 0 |
|
- 15 |
|
- - 1.0 |
|
- dump/raw/musan_music.scp |
|
- - 1 |
|
- 1 |
|
- - 5 |
|
- 15 |
|
rir_apply_prob: 0.0 |
|
rir_scp: dump/raw/rirs.scp |
|
use_lang2vec: true |
|
lang2vec_type: geo |
|
loss: aamsoftmax_sc_topk_lang2vec |
|
loss_conf: |
|
margin: 0.5 |
|
scale: 30 |
|
K: 3 |
|
mp: 0.06 |
|
k_top: 5 |
|
lang2vec_dim: 299 |
|
lang2vec_type: geo |
|
lang2vec_weight: 0.2 |
|
required: |
|
- output_dir |
|
version: '202506' |
|
distributed: false |
|
``` |
|
|
|
</details> |
|
|
|
|
|
|
|
### Citation |
|
|
|
```BibTex |
|
@inproceedings{wang2025geolid, |
|
author={Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, and Shinji Watanabe}, |
|
title={Geolocation-Aware Robust Spoken Language Identification}, |
|
year={2025}, |
|
booktitle={Procedings of ASRU}, |
|
} |
|
|
|
@inproceedings{watanabe2018espnet, |
|
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, |
|
title={{ESPnet}: End-to-End Speech Processing Toolkit}, |
|
year={2018}, |
|
booktitle={Proceedings of Interspeech}, |
|
pages={2207--2211}, |
|
doi={10.21437/Interspeech.2018-1456}, |
|
url={http://dx.doi.org/10.21437/Interspeech.2018-1456} |
|
} |
|
``` |
|
|