GitHub arXiv

Contents

  1. Abstract
  2. Demos - Anonymization
  3. Demos - Accent Preservation


1. Abstract

Speaker anonymization aims to conceal a speaker's identity without degrading speech quality and intelligibility. Most speaker anonymization systems disentangle the speaker representation from the original speech and achieve anonymization by averaging or modifying the speaker representation. However, the anonymized speech is subject to reduction in pseudo speaker distinctiveness, speech quality and intelligibility for out-of-distribution speaker. To solve this issue, we propose SALT, a Speaker Anonymization system based on Latent space Transformation. Specifically, we extract latent features by a self-supervised feature extractor and randomly sample multiple speakers and their weights, and then interpolate the latent vectors to achieve speaker anonymization. Meanwhile, we explore the extrapolation method to further extend the diversity of pseudo speakers. Experiments on Voice Privacy Challenge dataset show our system achieves a state-of-the-art distinctiveness metric while preserving speech quality and intelligibility.



2. Demos -- Anonymization

Corresponding to Section 4 in our paper, below lists the anonymized samples. We compare our proposed method (SALT) with the VoicePrivacy Challenge offical basline system (B1.a) and the top rank system (NWPU-ASLP).

Our proposed model is denoted as [B or L]-Sx-Px, where B or L means WavLM-Base or WavLM-Large encoder, Sx means the scale factor s = x, Px means the preservation factor p = x.

3. Demos -- Accent Preservation

Corresponding to Section 4.1 in our paper, below lists the ASR results of some speech clips with obvious accents.

We notice that the recognition result of Whisper Large is usually better than U2++ in recognizing accented speech. We hypothesis that this is due to the accent robustness of the Whisper ASR model.

The text marked with red colour is the word with a strong accent.