banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

使用pyannote.audio進行語音分離和說話人識別

pip install pyannote.audio

場景:

一段音頻中有多個說話人,將不同的人說的話分離出來
已知一些人的語音特徵,跟分離出來的片段,分別求特徵的餘弦距離,餘弦距離最小的作為說話的人

 _*_ coding: utf-8 _*_
# @Time : 2024/3/16 10:47
# @Author : Michael
# @File : spearker_rec.py
# @desc :
import torch
from pyannote.audio import Model, Pipeline, Inference
from pyannote.core import Segment
from scipy.spatial.distance import cosine


def extract_speaker_embedding(pipeline, audio_file, speaker_label):
    diarization = pipeline(audio_file)
    speaker_embedding = None
    for turn, _, label in diarization.itertracks(yield_label=True):
        if label == speaker_label:
            segment = Segment(turn.start, turn.end)
            speaker_embedding = inference.crop(audio_file, segment)
            break
    return speaker_embedding

# 對於給定的音頻,提取聲紋特徵並與人庫中的聲紋進行比較
def recognize_speaker(pipeline, audio_file):
    diarization = pipeline(audio_file)
    speaker_turns = []
    for turn, _, speaker_label in diarization.itertracks(yield_label=True):
        # 提取切片的聲紋特徵
        embedding = inference.crop(audio_file, turn)  
        distances = {}
        for speaker, embeddings in speaker_embeddings.items():  
         # 計算與已知說話人的聲紋特徵的餘弦距離
            distances[speaker] = min([cosine(embedding, e) for e in embeddings])
        # 選擇距離最小的說話人
        recognized_speaker = min(distances, key=distances.get)  
        speaker_turns.append((turn, recognized_speaker))  
        # 記錄說話人的時間段和餘弦距離最小的預測說話人
    return speaker_turns

if __name__ == "__main__":
    token = "hf_***"  # 請替換為您的Hugging Face Token

    # 加載聲音分離識別模型
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=token,  # 在項目頁面agree使用協議,並獲取 Hugging Face Token
        # cache_dir="/home/huggingface/hub/models--pyannote--speaker-diarization-3.1/"
    )

    # 加載聲紋嵌入模型
    embed_model = Model.from_pretrained("pyannote/embedding", use_auth_token=token)
    inference = Inference(embed_model, window="whole")

    # pipeline.to(torch.device("cuda"))

    # 假設您已經有一個包含不同人聲的音頻文件集,以及對應的人
    audio_files = {
        "mick": "mick.wav",  # mick的音頻
        "moon": "moon.wav",  # moon的音頻
    }
    speaker_embeddings = {}
    for speaker, audio_file in audio_files.items():
        diarization = pipeline(audio_file)
        for turn, _, speaker_label in diarization.itertracks(yield_label=True):
            embedding = extract_speaker_embedding(pipeline, audio_file, speaker_label)
            # 獲取原始已知說話人的聲紋特徵
            speaker_embeddings.setdefault(speaker, []).append(embedding)

    # 給定新的未知人物的音頻文件
    given_audio_file = "2_voice.wav"  # 前半部分是 mick 說話,後半部分是 moon 說話

    # 識別給定音頻中的說話人
    recognized_speakers = recognize_speaker(pipeline, given_audio_file)
    print("Recognized speakers in the given audio:")
    for turn, speaker in recognized_speakers:
        print(f"Speaker {speaker} spoke between {turn.start:.2f}s and {turn.end:.2f}s")

輸出:

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.2.1+cpu. Bad things might happen unless you revert torch to 1.x.

Recognized speakers in the given audio:
Speaker mick spoke between 0.57s and 1.67s
Speaker moon spoke between 2.47s and 2.81s
Speaker moon spoke between 3.08s and 4.47s
載入中......
此文章數據所有權由區塊鏈加密技術和智能合約保障僅歸創作者所有。