Pro feature · v3.0 · Technical explainer
Cross-Recording
Voice ID.
Sarah on Tuesday's iPhone recording is the same Sarah on Friday's iPad recording. Including the cases where the upstream diarizer got it wrong.
This page is for evaluators in legal, medical, and consulting contexts. We will tell you how it works, where the failure modes live, and what the privacy posture is. No marketing.
Two problems that look like one.
Problem one: identity across recordings. The same person who showed up as "Sarah" in last week's iPhone meeting and "Speaker B" in yesterday's iPad call should be one identity in your library, not two.
Problem two: identity within a recording. Diarization sometimes hybrids two voices into a single label. Two people with similar fundamental frequencies, or one person speaking with a varied register, can end up as one "Speaker C" — half the lines belong to person X, half to person Y, and the transcript's attribution is silently wrong.
Both problems hurt downstream features. Promise Tracker can't track "Sarah's promises" if Sarah is six speakers in the library. Speaker Insights breaks if half of "Speaker C" is actually a different human. Compatibility Analysis is incoherent on a hybrid label.
Voice ID is the layer that fixes both.
Architecture.
Bonfiyah's voice identification is two passes layered on top of AssemblyAI's diarization. The first pass is a cohort-aware identity match across your devices. The second is a per-utterance ECAPA-TDNN re-clustering that splits hybrid clusters within a recording.
Pass 1 — Cohort-aware identity
A linked cohort across iPhone, iPad, Mac, and iCloud.
Your devices form a linked cohort. The cohort is bound by your iCloud account; every Bonfiyah-equipped device you sign into joins the same cohort. The speaker library is one library, scoped to the cohort, not to the device.
When a recording finishes diarizing on any device, the per-speaker voice signatures (256-dim embeddings derived from the audio) are matched against the cohort's existing speaker library. Above a calibrated similarity threshold, the new signature is bound to an existing speaker identity. Below the threshold, a new identity is created.
This is why Sarah is one person in your library whether the recording was on the iPhone in your pocket or the Mac on your desk — and why she's still one person if the iPhone recorded her two months ago.
Pass 2 — Per-utterance re-clustering
ECAPA-TDNN splits hybrid AAI clusters automatically.
AssemblyAI returns a coarse diarization — typically a small set of speaker labels for the recording. When two speakers have similar voice characteristics, the AAI pass occasionally merges them into a single label. We've measured this on our own corpus and it happens in roughly 4–6% of multi-speaker recordings, almost always between same-gender speakers in similar recording conditions.
To catch this, every utterance gets a fine-grained ECAPA-TDNN embedding — a 192-dim speaker representation from the same family of architectures used in modern speaker-verification benchmarks (VoxCeleb, SITW). Within each AAI-assigned cluster, we re-cluster utterances by ECAPA distance. If two sub-clusters separate cleanly above a calibrated margin, the cluster is split, and the constituent utterances are re-routed to their actual speakers.
The split happens silently, before you ever see the transcript. The result is that hybrid labels — the most insidious kind of attribution error, because the transcript is internally consistent but materially wrong — are caught and corrected without any manual intervention.
Where the inference runs
ECAPA on-device. Identity matching on-device.
The ECAPA-TDNN model ships in the app bundle (roughly 25 MB, quantized). Inference runs on Apple Silicon via Core ML — milliseconds per utterance, fully on-device, no audio leaves the iPhone for this step.
The cohort identity match is a vector-distance computation against your speaker library, which lives in your private iCloud — not on Bonfiyah's servers. The cohort exists because iCloud is your private storage; Bonfiyah is not the network for cohort sync.
For evaluators
Failure modes we've documented.
We are explicit about where the system can be wrong, because anyone evaluating this for a legal, medical, or research workflow needs the failure list.
Tight cohort, two siblings
Two voices that share substantial acoustic structure (siblings raised in the same household, identical twins) can fall below the splitting threshold. The library may bind them to one identity. Manual reassignment via the live picker resolves this; the resolved identity persists.
First-utterance attribution under noise
A speaker's very first utterance in a noisy environment can be mis-bound to an existing similar voice in the library, especially in a large library (hundreds of identities). After three to five utterances the identity stabilises. The first-utterance edge case is rare but real.
Voice age drift
A child's voice changes substantially over months. A speaker library entry that hasn't been refreshed in >6 months may not match the same person today. The library's per-identity signature is updated on every successful match, so this resolves on its own with regular use.
Phone calls vs. in-room
A speaker captured over a low-bitrate phone call has different spectral characteristics than the same speaker captured in-room. We model this as channel variance, and our matching is robust to it within tested bounds — but extreme channel mismatches (e.g. a 4 kHz cellular call vs. studio-quality lavalier) can drop a match below threshold.
Our internal cohort accuracy on a 12,000-utterance evaluation set is 97.3% top-1 identity match against ground truth. The hybrid-cluster split pass corrects roughly 73% of AAI hybrid errors automatically; the rest are catchable via the Resplit Voice Matching surface inside the recording view.
What identity stability buys you downstream.
Promise Tracker
"What did Sarah promise me last week" only works if Sarah is one identity, regardless of which device captured the recording. Voice ID is the precondition.
Speaker Insights / People Memory
A profile of every person you've recorded with is only coherent if "every recording with Sarah" is computable. With a fragmented identity, the profile is fragmented; with a hybrid identity, the profile is wrong.
Compatibility Analysis
Pairwise frameworks (Attachment, Big Five, Gottman, Thomas-Kilmann) are anchored to two specific identities. Hybrid clusters break this entirely; we will not run Compatibility on a recording with unresolved hybrid labels.
Team Dynamics
A team is a selection of identities, with placements anchored across many recordings. Identity stability is what makes the longitudinal read possible — the system can show you Riley's placement migrating from low-cohesion to mid-cohesion across a quarter only because Riley is one identity.
Privacy
Voice signatures are biometric data. We treat them that way.
The 192-dim ECAPA embeddings are biometric identifiers under GDPR, BIPA, and most modern privacy regimes. They are stored encrypted in your private iCloud library, never sent to a Bonfiyah-controlled backend, and never shared across users. We do no cross-account voice search, ever.
Inference runs on-device via Core ML. The audio used to compute an embedding is the audio of the recording you already authorized; no additional capture takes place. Voice signatures are not derivable back into the original audio — they are not a recording, they are a numerical fingerprint of the voice.
If you delete a speaker from your library, the embedding is purged from the library and from any iCloud-synced copy. If you delete a recording, the embeddings derived from it that were not yet bound to a stable identity are purged with the recording.
If you uninstall Bonfiyah, the ECAPA model goes with the app. There is no off-device residue of the voice-ID system.
FAQ
Can the voice ID be used to identify someone outside my library?
No. There is no global voice database, no cross-account index, no "who is this voice on the internet" pathway. Identity is scoped to your speaker library, which is scoped to your iCloud. We are a memory layer, not a surveillance layer.
Is the ECAPA model open about its training data?
Yes. The model lineage is the public ECAPA-TDNN architecture trained on VoxCeleb 1+2 with augmentation, fine-tuned on a permissively licensed multi-language speaker corpus. We do not train on your transcripts or audio. The exact training recipe and the model weights' provenance are documented in our privacy policy.
Is BIPA-style consent required for the voice ID layer?
In jurisdictions where biometric-data handling requires informed consent (Illinois BIPA, similar regimes), our consent flow includes the voice-ID disclosure as part of the standard recording-consent capture. We are not lawyers and this is not legal advice; the disclosure is in plain language so you and your counsel can confirm it covers your jurisdiction.
What happens if AssemblyAI changes their diarization output format?
Our cohort layer and ECAPA pass operate on top of AAI's output but are not coupled to its specific schema. If AAI ships a model update that changes its hybridization characteristics, the ECAPA re-cluster catches more or fewer hybrids accordingly, and we recalibrate the splitting threshold against our evaluation set. We test against new AAI releases before promoting them.
Can I export my speaker library?
Yes. The export contains the named identities and their per-recording statistics. The raw embeddings are not exported by default; they are biometric data and we don't want them to leave the device casually. If you want them — for a portability reason or a compliance reason — there's an explicit "include embeddings" toggle that surfaces a clearer warning about what you're about to do.
Read the longer technical write-up
A more detailed PDF covering the cohort similarity threshold, the ECAPA splitting calibration, our internal evaluation methodology, and the BIPA / GDPR posture. Useful for evaluators.
No spam. We use ConvertKit. See our privacy policy.