MLX or CoreML? Both, and here is the table
A 600M-parameter local model beats Whisper Large v3 on a Mac. The backend choice is per-model, decided by latency and memory and GPU contention, not ideology.
10 Jun 2026 · 4 min
The framing people want is a winner. MLX or CoreML, pick one, build everything on it. I built an on-device speech stack for Apple silicon and the honest answer is both, chosen per model, because the two backends are good at different things and the differences are large enough to measure.
Start with the headline that surprised me most: a 600M-parameter ASR model running locally beats Whisper Large v3 on the same Mac. Local is not the compromise option anymore. It is, for this workload, the better option.
The table
LibriSpeech test-clean, n=200, word error rate and realtime factor:
| Model | Backend | WER | Realtime |
|---|---|---|---|
| Qwen3-ASR 1.7B, 8-bit | MLX | 1.52% | 30.5x |
| WhisperKit Large-v3 Turbo | CoreML / ANE | 1.71% | - |
| Parakeet TDT 0.6B | CoreML | 2.37% | 117x |
| Omnilingual 300M, 4-bit | MLX | 4.26% | 222x |
The smallest model, Omnilingual at 384MB, runs at 222x realtime and still lands under 5% WER. Parakeet gives you 117x at well under 3%. The accuracy leader, Qwen3-ASR, costs you speed but is still 30x faster than the audio is long. There is no single best row. There is a best row for "most accurate," a best row for "fastest," and a best row for "smallest," and they are different rows. That is the whole point of keeping both backends on the table.
Latency is not just throughput
Throughput hides a second axis. Voice activity detection runs on every chunk of audio, constantly, so its per-chunk latency matters more than its raw speed. Silero VAD on CoreML, on the Neural Engine, runs at roughly 0.27ms per chunk. The same model on MLX runs at roughly 2.1ms. That is 7.7x lower latency on CoreML, and the reason is not just the number: running VAD on the ANE leaves the GPU completely free for the ASR model that comes after it. On a device where everything shares one chip, getting a hot-path model off the GPU is worth as much as the latency win itself.
So VAD goes on CoreML and ANE. Not on principle. Because it is 7.7x faster there and it stops fighting the transcriber for the GPU.
The trap: embeddings are not portable
Here is the one that cost me a day. WeSpeaker speaker embeddings are not interchangeable across backends. MLX lays tensors out as NHWC, CoreML as NCHW, and the stats-pooling layer in the model flattens those two layouts differently. The embeddings come out structurally different. I measured cross-backend cosine similarity between embeddings of the same speaker at around 0.15, which is to say no usable similarity at all. If you enroll a speaker on MLX and then compare against an embedding computed on CoreML, you get garbage. So the rule is absolute: enroll and compare on the same backend, every time. The vectors are only meaningful relative to other vectors from the same layout.
A war story about masking
My first CoreML export of the ASR encoder was technically working and producing nonsense. WER 24.88%. The decoder kept emitting end-of-text right after the first period, as if every sentence was the last one.
The cause was attention masking. The export ran unmasked global self-attention over a mel spectrogram that had been zero-padded to a fixed length. With no mask, the padding was not ignored; it attended into the real frames and contaminated the tokens near the end, and the decoder read that contamination as "the audio is over." The fix was to rebuild the encoder with 100-frame chunks and block-attention windows, so attention stays inside real audio and never sees the padding. WER dropped from 24.88% to 1.52%. And because block attention is cheaper than global attention over the full padded sequence, encoder time also fell, from 113ms to 24ms. The correctness fix was also the performance fix, which does not happen often enough to expect it but is lovely when it does.
One more thing in that vein: pre-compile your Metal shader library. If you let it JIT-compile the shaders on first use, you eat that cost at runtime. Pre-compiling was worth about 5x on the affected path. It is a build-step change, not a code change, and it is free speed.
What local-first actually buys you
The reason any of this is mine to decide is that the whole stack runs on the device. There is no remote endpoint quietly making the backend choice for me. So I get to put VAD on the ANE, the speaker model wherever I enrolled it, and the ASR model on whichever backend wins for the accuracy and memory I need that day. Local-first does not mean a slower, sadder version of the cloud. It means you own the backend choice, per model, and you make it with a table in front of you instead of an ideology behind you.
- Estimate speaker count with the Laplacian eigengapDiarization needs to guess how many speakers are in a clip. The eigengap of the graph Laplacian gives you k without asking for it up front.Snippet
- whisper_scheduleA recording goes in, a speaker-labelled transcript comes out.Lab
- Chunk audio with VAD before you transcribeFeeding long silent audio to a transcriber wastes time and money. Split on speech first using webrtcvad, then send only the chunks that contain voice.Snippet
- Voice that never leaves the deviceTranscription, diarisation, and speech running entirely on Apple silicon, and why keeping voice local is a product decision before it is a technical one.Musing
- speech-swiftAn on-device speech stack for Apple Silicon, in Swift.Tool