Speaker diarization is the process of partitioning an audio recording by speaker — determining "who spoke when" — so a transcript can be split into labeled turns instead of one undifferentiated block of text. It's what turns a raw transcript into a readable conversation.
How does speaker diarization work?
Diarization typically has three stages:
- Segmentation. The audio is divided into short segments, with detected boundaries where the speaker appears to change.
- Embedding. Each segment is turned into a voice embedding — a numerical fingerprint of the voice's characteristics.
- Clustering. Segments with similar embeddings are grouped, so all of "Speaker A" lands together, separate from "Speaker B."
The output is a timeline of speaker turns (Speaker 0 from 0:00–0:12, Speaker 1 from 0:12–0:20, and so on) that gets aligned with the transcript.
Diarization vs. speaker recognition: what's the difference?
These two are often confused:
| Term | Question it answers | Needs to know you in advance? |
|---|---|---|
| Diarization | "How many speakers, and when did each talk?" | No |
| Speaker recognition | "Which named person is this?" | Yes — a known voiceprint |
Diarization tells you there are three distinct speakers; recognition tells you they're Maya, David, and you. The best note-taking tools combine both: diarize automatically, then let you label a speaker once so recognition names them on every future recording.
Why does diarization matter?
Without it, a meeting transcript is an unreadable wall of text and you can't tell who committed to what. With it:
- Action items get attributed to the right person.
- You can ask "what did Priya say about the budget?" and get a real answer.
- Recurring participants become a searchable directory of voices.
How accurate is it?
Diarization quality is usually measured by diarization error rate (DER) — the fraction of time attributed to the wrong speaker (or missed). Clean two-person audio diarizes very well; accuracy drops with overlapping speech, many speakers, or poor microphones. Letting the system learn voices over time improves labeling on later recordings.
How Remindr uses diarization
Remindr diarizes every meeting, then matches voices to people you've labeled — so transcripts and summaries name participants instead of saying "Speaker 2." Learn how that plays out in real meetings on the meetings use-case page, or see the full picture in how AI meeting notes work.