ggmbr commited on
Commit
6b1dc6b
·
verified ·
1 Parent(s): b3a653b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -98
README.md CHANGED
@@ -1,98 +1,102 @@
1
- ---
2
- license: cc-by-sa-3.0
3
- tags:
4
- - Speaker traits
5
- - Voice
6
- - Speaker
7
- language:
8
- - en
9
- base_model:
10
- - microsoft/wavlm-large
11
- datasets:
12
- - VCTK
13
- - VoxCeleb
14
- ---
15
-
16
- # Timbral Embeddings extractor
17
- This model produces embeddings that globally represent the timbral traits of a speaker's voice. These embeddings can be used the same way as for a classical speaker verification (ASV):
18
- in order to compare two voice signals, an embeddings vector must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
19
- The main difference with classical ASV embeddings is that, here, only the timbral traits are compared.
20
-
21
- The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
22
-
23
- The next section explains how to compute these timbral embeddings.
24
-
25
-
26
- # Usage
27
- The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/Speaker-wavLM-pro/blob/main/spk_embeddings.py)
28
- to build the architecture of the model.
29
- Its weights are then downloaded from this repository.
30
- ```
31
- from spk_embeddings import EmbeddingsModel, compute_embedding
32
- import torch
33
-
34
- model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-tbr")
35
- model.eval()
36
- ```
37
-
38
- The model produces normalized vectors as embeddings.
39
-
40
- The python file also contains the function to compute the timbral embeddings of an audio file.
41
- In this tutorial version, the audio file is expected to be sampled at 16kHz.
42
- Depending on the available memory (cpu or gpu), you may change the value of the *max_size* parameter,
43
- which is used to truncate the long audio signals.
44
-
45
- finally, we can compute two embeddings from two different files and compare them with a cosine similarity:
46
-
47
- ```
48
- wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
49
- wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
50
-
51
- e1 = compute_embedding(wav1, model)
52
- e2 = compute_embedding(wav2, model)
53
- sim = float(torch.matmul(e1,e2.t()))
54
-
55
- print(sim) # 0.7743815779685974
56
- ```
57
-
58
- # Evaluations
59
- Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
60
- the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
61
- (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **1.685%**
62
- (with a decision threshold of **0.472**).
63
- This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
64
- found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.
65
-
66
- Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
67
-
68
- # Limitations
69
- The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
70
- The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
71
-
72
- # Publication
73
- Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
74
- [Disentangling prosody and timbre embeddings via voice conversion](https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf).
75
-
76
- Please consider citing this paper if you use this model in your own research work.
77
-
78
- In this paper the model is denoted as W-TBR. The other two models used in this study can also be found on HuggingFace :
79
- - [W-PRO](https://huggingface.co/Orange/Speaker-wavLM-pro) for non-timbral related embeddings
80
- - [W-SPK](https://huggingface.co/Orange/Speaker-wavLM-id) for speaker embeddings (ASV)
81
-
82
-
83
- ### Citation
84
- Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbre embeddings via voice conversion. Proc. Interspeech 2024, 2765-2769, doi: 10.21437/Interspeech.2024-207
85
-
86
- ### BibteX citation
87
- ```
88
- @inproceedings{gengembre24_interspeech,
89
- title = {Disentangling prosody and timbre embeddings via voice conversion},
90
- author = {Nicolas Gengembre and Olivier {Le Blouch} and Cédric Gendrot},
91
- year = {2024},
92
- booktitle = {Interspeech 2024},
93
- pages = {2765--2769},
94
- doi = {10.21437/Interspeech.2024-207},
95
- issn = {2958-1796},
96
- }
97
- ```
98
-
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ tags:
4
+ - Speaker traits
5
+ - Voice
6
+ - Speaker
7
+ language:
8
+ - en
9
+ base_model:
10
+ - microsoft/wavlm-large
11
+ datasets:
12
+ - VCTK
13
+ - VoxCeleb
14
+ ---
15
+
16
+ # Timbral Embeddings extractor
17
+ This model produces embeddings that globally represent the timbral traits of a speaker's voice. These embeddings can be used the same way as for a classical speaker verification (ASV):
18
+ in order to compare two voice signals, an embeddings vector must be computed for each of them. Then the cosine similarity between the two embeddings can be used for comparison.
19
+ The main difference with classical ASV embeddings is that, here, only the timbral traits are compared.
20
+
21
+ The model has been derived from the self-supervised pretrained model [WavLM-large](https://huggingface.co/microsoft/wavlm-large).
22
+
23
+ The next section explains how to compute these timbral embeddings.
24
+
25
+
26
+ # Usage
27
+ The following code snippet uses the file [spk_embeddings.py](https://huggingface.co/Orange/Speaker-wavLM-pro/blob/main/spk_embeddings.py)
28
+ to build the architecture of the model.
29
+ Its weights are then downloaded from this repository.
30
+ ```
31
+ from spk_embeddings import EmbeddingsModel, compute_embedding
32
+ import torch
33
+
34
+ model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-tbr")
35
+ model.eval()
36
+ ```
37
+
38
+ The model produces normalized vectors as embeddings.
39
+
40
+ The python file also contains the function to compute the timbral embeddings of an audio file.
41
+ In this tutorial version, the audio file is expected to be sampled at 16kHz.
42
+ Depending on the available memory (cpu or gpu), you may change the value of the *max_size* parameter,
43
+ which is used to truncate the long audio signals.
44
+
45
+ finally, we can compute two embeddings from two different files and compare them with a cosine similarity:
46
+
47
+ ```
48
+ wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
49
+ wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
50
+
51
+ e1 = compute_embedding(wav1, model)
52
+ e2 = compute_embedding(wav2, model)
53
+ sim = float(torch.matmul(e1,e2.t()))
54
+
55
+ print(sim) # 0.7743815779685974
56
+ ```
57
+
58
+ # Evaluations
59
+ Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
60
+ the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
61
+ (EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **1.685%**
62
+ (with a decision threshold of **0.472**).
63
+ This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
64
+ found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.
65
+
66
+ Please note that the EER value can vary a little depending on the max_size defined to reduce long audios (max 30 seconds in our case).
67
+
68
+ # Limitations
69
+ The fine tuning data used to produce this model (VoxCeleb, VCTK) are mostly in english, which may affect the performance on other languages.
70
+ The performance may also vary with the audio quality (recording device, background noise, ...), specially for audio qualities not covered by the training set, as no specific algorithm, e.g. data augmentation, was used during training to tackle this problem.
71
+
72
+ # Publication
73
+ Details about the method used to build this model have been published at Interspeech 2024 in the paper entitled
74
+ [Disentangling prosody and timbre embeddings via voice conversion](https://www.isca-archive.org/interspeech_2024/gengembre24_interspeech.pdf).
75
+
76
+ Please consider citing this paper if you use this model in your own research work.
77
+
78
+ In this paper the model is denoted as W-TBR. The other two models used in this study can also be found on HuggingFace :
79
+ - [W-PRO](https://huggingface.co/Orange/Speaker-wavLM-pro) for non-timbral related embeddings
80
+ - [W-SPK](https://huggingface.co/Orange/Speaker-wavLM-id) for speaker embeddings (ASV)
81
+
82
+
83
+ ### Citation
84
+ Gengembre, N., Le Blouch, O., Gendrot, C. (2024) Disentangling prosody and timbre embeddings via voice conversion. Proc. Interspeech 2024, 2765-2769, doi: 10.21437/Interspeech.2024-207
85
+
86
+ ### BibteX citation
87
+ ```
88
+ @inproceedings{gengembre24_interspeech,
89
+ title = {Disentangling prosody and timbre embeddings via voice conversion},
90
+ author = {Nicolas Gengembre and Olivier {Le Blouch} and Cédric Gendrot},
91
+ year = {2024},
92
+ booktitle = {Interspeech 2024},
93
+ pages = {2765--2769},
94
+ doi = {10.21437/Interspeech.2024-207},
95
+ issn = {2958-1796},
96
+ }
97
+ ```
98
+
99
+ # License
100
+
101
+ CREATIVE COMMONS Attribution-ShareAlike 3.0 Unported
102
+