pyannote.audioを使用して音声の分離と話者識別を行います。

pyannote/pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Jupyter Notebook7872906

pip install pyannote.audio

シーン：

複数の話者が含まれるオーディオから、異なる話者の発言を分離します。
いくつかの人物の音声特徴が既知であり、分離されたフラグメントとそれぞれの特徴の余弦距離を計算し、最小の余弦距離を持つものを話者として選択します。

 _*_ coding: utf-8 _*_
# @Time : 2024/3/16 10:47
# @Author : Michael
# @File : spearker_rec.py
# @desc :
import torch
from pyannote.audio import Model, Pipeline, Inference
from pyannote.core import Segment
from scipy.spatial.distance import cosine


def extract_speaker_embedding(pipeline, audio_file, speaker_label):
    diarization = pipeline(audio_file)
    speaker_embedding = None
    for turn, _, label in diarization.itertracks(yield_label=True):
        if label == speaker_label:
            segment = Segment(turn.start, turn.end)
            speaker_embedding = inference.crop(audio_file, segment)
            break
    return speaker_embedding

# 対象のオーディオに対して、音声特徴を抽出し、音声ライブラリの特徴と比較します
def recognize_speaker(pipeline, audio_file):
    diarization = pipeline(audio_file)
    speaker_turns = []
    for turn, _, speaker_label in diarization.itertracks(yield_label=True):
        # スライスの音声特徴を抽出します
        embedding = inference.crop(audio_file, turn)  
        distances = {}
        for speaker, embeddings in speaker_embeddings.items():  
         # 既知の話者の音声特徴との余弦距離を計算します
            distances[speaker] = min([cosine(embedding, e) for e in embeddings])
        # 最小の距離を持つ話者を選択します
        recognized_speaker = min(distances, key=distances.get)  
        speaker_turns.append((turn, recognized_speaker))  
        # 話者の時間範囲と最小の予測話者の余弦距離を記録します
    return speaker_turns

if __name__ == "__main__":
    token = "hf_***"  # ご自身のHugging Face Tokenに置き換えてください

    # 音声分離識別モデルをロードします
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=token,  # プロジェクトページで使用規約に同意し、Hugging Face Tokenを取得してください
        # cache_dir="/home/huggingface/hub/models--pyannote--speaker-diarization-3.1/"
    )

    # 声紋埋め込みモデルをロードします
    embed_model = Model.from_pretrained("pyannote/embedding", use_auth_token=token)
    inference = Inference(embed_model, window="whole")

    # pipeline.to(torch.device("cuda"))

    # 異なる話者の音声ファイルセットと対応する人物があると仮定します
    audio_files = {
        "mick": "mick.wav",  # mickの音声
        "moon": "moon.wav",  # moonの音声
    }
    speaker_embeddings = {}
    for speaker, audio_file in audio_files.items():
        diarization = pipeline(audio_file)
        for turn, _, speaker_label in diarization.itertracks(yield_label=True):
            embedding = extract_speaker_embedding(pipeline, audio_file, speaker_label)
            # 元の既知の話者の声紋特徴を取得します
            speaker_embeddings.setdefault(speaker, []).append(embedding)

    # 新しい未知の人物の音声ファイルが与えられた場合
    given_audio_file = "2_voice.wav"  # 前半はmickの発言、後半はmoonの発言です

    # 与えられた音声の話者を識別します
    recognized_speakers = recognize_speaker(pipeline, given_audio_file)
    print("与えられた音声の識別された話者:")
    for turn, speaker in recognized_speakers:
        print(f"話者 {speaker} は {turn.start:.2f}s から {turn.end:.2f}s まで話しました")

出力：

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.2.1+cpu. Bad things might happen unless you revert torch to 1.x.

与えられた音声の識別された話者:
話者 mick は 0.57s から 1.67s まで話しました
話者 moon は 2.47s から 2.81s まで話しました
話者 moon は 3.08s から 4.47s まで話しました