🎙️ Speaker Identifier

Finds unique speakers across multiple HF audio datasets. Each dataset is assumed to have one speaker (e.g. an audiobook). The app fetches audio directly via the datasets-server API (no full parquet download), downloads in parallel, then embeds clips across 8 parallel workers.

Model: microsoft/wavlm-base-plus-sv — language-agnostic speaker embeddings, works for any language.

How to use

Paste dataset repo IDs (one per line, owner/name format) into the left box.
Adjust parameters if needed (defaults work well for audiobooks):
- Samples per book — audio chunks to average per dataset. More = more robust, slower. 3 is usually enough.
- Audio length (sec) — seconds of each chunk to use for embedding. 5 sec is sufficient for a clear voice.
- Same-speaker threshold — cosine similarity cutoff. Raise if too many books merge into one speaker; lower if one person gets split across IDs.
- HF Token — only needed for private repos.
Click Identify Speakers. Downloads run in parallel, then all clips are embedded in one batch — expect ~20–40 sec for 35 books.

Output columns

Column	Meaning
`dataset`	Repo name (short)
`speaker_id`	Cluster label — same ID = same voice
`books_with_speaker`	How many books share this speaker
`intra_sim`	Avg cosine similarity within cluster (1.0 = only one book; lower = cluster is less tight)
`closest_match`	Most similar other book and similarity score

Tip: Sort by speaker_id to see all books by the same narrator grouped together.

The Errors / Timing box shows per-dataset timing and batch embed stats — useful for diagnosing slow datasets.

Results — sorted by speaker_id

Results — sorted by speaker_id