Mira smart speaker
← Back to Projects
Consumer ElectronicsProduct DevelopmentHardware + Software

Mira: The Smart Speaker That Doesn't Spy on You

A privacy-first smart speaker built from scratch: custom hardware, local LLM inference, beamforming microphone array, and native Home Assistant integration. No cloud, no data collection, no compromise on audio quality.

<1.2s

Wake to Reply

15 tok/s

Local LLM Inference

7m+

Voice Range

Zero

Cloud Dependency

7B

On-device LLM

1

The Problem

Every smart speaker on the market is a surveillance device that happens to play music. Amazon Echo sends your voice to Amazon's servers. Google Home sends it to Google. Apple HomePod sends it to Apple. They all require an internet connection to function. They all require an account. They all retain your audio data, and the opt-out mechanisms are buried, incomplete, or quietly reversed in terms of service updates.

For the millions of privacy-conscious users, the choice has been binary: accept the surveillance or go without voice control entirely. An industry failure, plain and simple.

For the home automation community, the problem is even worse. Home Assistant users have spent years building local-first smart homes, keeping their data on their own hardware, avoiding cloud dependencies. But voice control still required phoning home. Every “turn off the kitchen lights” went through Amazon or Google before it reached the light switch three metres away.

We built Mira because we wanted a smart speaker that we'd put in our own homes. One that works entirely locally, integrates natively with Home Assistant, and sounds good enough that you'd choose it over a HomePod for music.

2

The Challenge

Running a smart speaker locally sounds simple until you try it. The big players have thousands of engineers and billions of dollars of GPU infrastructure dedicated to speech processing. We had a small team and an ARM chip.

The core technical challenge: fit an entire voice pipeline (wake word detection, speech-to-text, language understanding, intent extraction, text-to-speech) onto an embedded processor with enough headroom to run it all in real time, while simultaneously playing audio through a speaker that doesn't sound like a tin can.

The RK3588 gave us the compute. Eight ARM cores, a 6 TOPS neural processing unit, a Mali GPU capable of INT4 inference, and dedicated DSP cores for audio processing. Plenty of silicon, but the question was whether we could orchestrate it all within the latency budget. Users don't notice a reply that takes 500ms. They notice 3 seconds.

Audio quality matters too. Most DIY smart speakers sound terrible because their builders treat audio as an afterthought, a commodity speaker glued into a 3D-printed box. We wanted Mira to sound good enough that people would use it as their primary speaker, not a voice controller they tolerate.

3

The Voice Pipeline

The end-to-end voice pipeline runs six stages from microphone to speaker, all on-device, with total latency from wake word to spoken reply under 1.2 seconds.

Mic Array

7× MEMS capsules, circular config

Far-field capture · 7m+ range

Always on

DSP Pipeline

Beamforming, AEC, noise cancellation

RK3588 DSP cores · <5ms latency

~5ms

Wake Word

"Hey Mira" detection, VAD gating

Custom CNN · 98.6% accuracy

~20ms

Speech-to-Text

Whisper-based, on-device inference

Quantised model · Multi-accent

~300ms

LLM Reasoning

7B model, intent + entity extraction

llama.cpp · INT4 · 15 tok/s

~600ms

Text-to-Speech

Neural TTS, natural intonation

Piper TTS · 22kHz output

~200ms

Wake word → spoken reply: <1.2 seconds
Figure 1. Six-stage voice processing pipeline. All processing runs on the RK3588, wake word to spoken reply in under 1.2 seconds.

Wake Word Detection

The wake word engine runs continuously on the DSP cores, listening for “Hey Mira” without waking the main processor. A custom lightweight CNN trained on 50,000+ positive and negative samples achieves 98.6% detection accuracy with a false activation rate below 0.1%. On trigger, the engine hands off to the main pipeline and captures a 300ms audio buffer before the wake word for context.

Speech-to-Text

We run a quantised Whisper variant optimised for the RK3588's NPU. The model handles Australian English, British English, and American English accents natively, with graceful degradation on other accents. Transcription latency averages 300ms for a typical command, competitive with cloud STT services that have the advantage of data centre GPUs.

LLM Reasoning

Mira's reasoning runs on a 7B parameter language model via llama.cpp with INT4 quantisation on the Mali GPU. The model is fine-tuned for home automation intent extraction, parsing natural language commands into structured actions: entity (which device), domain (lights, climate, media), action (turn on, set, play), and parameters (brightness, temperature, playlist). Inference runs at ~15 tokens/second, fast enough for conversational response times.

Text-to-Speech

Mira speaks using Piper TTS, a neural text-to-speech engine that produces natural-sounding speech at 22kHz. We fine-tuned the voice model for a neutral Australian accent with warm tonal characteristics. The TTS output is mixed with any currently-playing audio through the DSP, with automatic ducking so Mira's reply doesn't compete with the music.

4

The Microphone Array

Seven MEMS microphone capsules arranged in a circular pattern with 60° spacing. The centre microphone provides a reference signal; the outer six enable 360° beamforming and spatial noise rejection.

C
1
2
3
4
5
6

7× MEMS · 60° spacing · 40mm radius

Far-field Range

7+ metres

Reliable wake word detection in noisy environments

Beam Steering

360° azimuth

Dynamically tracks speaker position in the room

Noise Rejection

-18dB

Spatial filtering suppresses off-axis interference

Echo Cancellation

Full-duplex

Barge-in support — speak while Mira is playing audio

Figure 2. MEMS microphone array configuration. 7 capsules, 40mm radius, 60° spacing. Beamforming enables 7m+ range with -18dB noise rejection.

The beamforming algorithm runs on the RK3588's dedicated DSP cores with sub-5ms latency. It steers the beam toward the speaker's position, tracking them as they move around the room, while suppressing off-axis noise by up to 18dB. Mira can reliably capture voice commands from across a noisy living room while music is playing from its own speaker.

Full-duplex acoustic echo cancellation enables barge-in: you can interrupt Mira mid-sentence without waiting for it to finish speaking. The AEC reference signal is tapped directly from the amplifier output, giving the cancellation algorithm a clean reference of exactly what's coming out of the speaker.

5

Hardware Architecture

Mira is a ground-up hardware design. Every component was selected for performance, power, thermal, and acoustic requirements. Nothing off-the-shelf, nothing compromised.

Compute

RK3588 SoC8-core ARM (4× A76 + 4× A55)
6 TOPS NPUDedicated neural accelerator
ARM Mali GPUINT4 LLM inference via llama.cpp
8GB LPDDR5Model + OS + audio buffers

Audio Input

7× MEMS MicrophonesCircular array, matched sensitivity
Beamforming DSPSpatial filtering, noise rejection
Echo CancellationFull-duplex AEC, barge-in support

Audio Output

TI TAS5805MClass-D amplifier, 23W
Full-range DriverCustom-tuned, neodymium motor
Passive RadiatorExtended bass, ported enclosure

Connectivity

Wi-Fi 6802.11ax, HA integration
Bluetooth 5.2Audio streaming, device pairing
USB-CDebug, firmware update, aux audio
Figure 3. Hardware architecture: compute, audio input, audio output, and connectivity subsystems.
6

Acoustic Design

Audio quality is where most smart speakers cut corners. A £50 Echo Dot uses a single 1.6-inch driver in a plastic shell. It sounds like a voice terminal that can also play background noise that vaguely resembles music.

Mira's acoustic design was led by our industrial designer with input from an acoustic engineer. The 52mm full-range driver uses a neodymium magnet motor and a paper/fibreglass composite cone for natural midrange reproduction. Two 65mm passive radiators in an opposed configuration extend bass response down to 55Hz while cancelling mechanical vibration: no buzz, no rattle, no resonance.

The ported aluminium enclosure is tuned to 62Hz, reinforcing the passive radiator output. The result is a speaker that produces genuine bass from a compact form factor, without the fake bass boost that consumer electronics companies typically apply.

The TI TAS5805M amplifier provides 23W of clean Class-D power with integrated DSP. We use the DSP for room-adaptive EQ correction, dynamic range compression (so voice replies and music are audibly balanced), and a brick-wall limiter that prevents distortion at high volumes.

The enclosure itself is machined from 6061 aluminium, not injection moulded plastic. Aluminium is acoustically dead (no resonant frequencies in the audible range), thermally conductive (it acts as a heatsink for the amplifier and SoC), and feels premium in the hand. Available in anodised black and natural silver.

An acoustic mesh covers the driver and microphone array, acoustically transparent, dust-resistant, and removable for cleaning. The mesh design was iterated through six prototypes to find the balance between acoustic transparency and physical protection.

TTS Output

22kHz PCM

DSP Processing

EQ, limiter, crossover

TAS5805M Amp

Class-D, 23W

Full-range Driver

52mm, neodymium

Passive Radiators

Dual 65mm

Amplifier

TI TAS5805M Class-D

23W peak, integrated DSP, thermal management

Driver

52mm full-range

Neodymium magnet, paper/fibreglass cone, custom surround

Passive Radiator

65mm × 2

Dual opposed for vibration cancellation, extended bass to 55Hz

Enclosure

Ported, 380mL

Machined aluminium chassis, tuned to 62Hz port frequency

Frequency Response

55Hz – 20kHz

±3dB, DSP-corrected room response curve

Max SPL

86dB @ 1m

Clean output, no audible distortion at rated power

Figure 4. Audio output signal path and acoustic specifications.
7

Home Assistant Integration

We built Mira for Home Assistant from day one. Native, local, zero-latency integration over MQTT and the Home Assistant WebSocket API.

When you say “Hey Mira, turn off the kitchen lights,” the LLM parses the intent, identifies the entity in Home Assistant, and sends the command directly over the local network. The light turns off before you finish lowering your hand. No cloud round-trip. No server processing. No lag.

Mira understands HA's entity model natively: lights, switches, climate, media players, locks, covers, scenes, automations, scripts. The LLM maps natural language to HA service calls with full parameter support: “set the bedroom to 22 degrees” becomesclimate.set_temperature(entity: climate.bedroom, temperature: 22)

Users configure Mira through a companion web app served locally on the device. The app includes a visual pipeline editor for creating custom voice commands and automation triggers, no coding required. Power users can also configure via YAML, because this is Home Assistant and that's how things are done.

Mira

Voice command parsed by LLM

MQTT + WebSocket

Home Assistant

Local network, zero cloud

Lights

"Turn off the kitchen lights"

Climate

"Set the bedroom to 22 degrees"

Media

"Play jazz in the living room"

Locks

"Lock the front door"

Scenes

"Activate movie night"

Automations

"Run the morning routine"

Figure 5. Home Assistant integration: MQTT + WebSocket over local network. Entity control across lights, climate, media, locks, scenes, and automations.
8

Privacy: The Whole Point

Privacy is the reason Mira exists. Every architectural decision was made through the lens of “does data need to leave this device?” The answer is always no.

No audio is ever transmitted anywhere. No text transcripts are stored beyond the current session. No usage analytics are collected. No accounts are required. No firmware updates phone home unless you explicitly check. The device doesn't even need an internet connection to function, only a local network connection to Home Assistant.

The firmware is built on Yocto Linux and the entire software stack is open source. You can audit every line of code running on the device. You can build the firmware from source and flash it yourself. You can verify that nothing is being exfiltrated with a packet capture on your router. We don't ask for trust. We make trust unnecessary.

A physical hardware mute switch disconnects the microphone array at the circuit level, electrically disconnected rather than software-muted. When the switch is engaged, an orange LED confirms the mics are dead. No firmware exploit can override a physical switch.

Privacy & ControlMiraAlexa / Google / Siri
Voice processing100% on-deviceCloud (Amazon/Google servers)
Data sent to cloudNothing. Ever.All audio after wake word
Internet requiredNo (fully offline)Yes (non-functional without)
Account requiredNoYes (Google/Amazon/Apple)
Voice data used for trainingNoYes (opt-out, sometimes)
Third-party skills/actionsNot needed (native HA)Required for most functions
Platform lock-inNoneDeep ecosystem lock
Open source firmwareYes (Yocto-based)No
Table 1. Privacy and control comparison — Mira vs. Alexa, Google Home, and Siri.
9

Industrial Design

The enclosure is machined from 6061-T6 aluminium billet. Cylindrical form factor, 95mm diameter, 120mm height. The proportions were driven by acoustics: the internal volume needed for the ported enclosure design dictated the minimum dimensions, and we optimised the form around that constraint.

Available in two finishes: Type III anodised black and natural brushed silver. The anodising provides a hard, scratch-resistant surface that also serves as an electrical insulator for the aluminium chassis.

The base is weighted with a steel disc for stability. Mira doesn't slide when you press the top-mounted controls. A silicone ring on the bottom prevents surface scratching and provides acoustic isolation from the table or shelf.

Top-mounted capacitive touch controls handle volume, play/pause, and mute. An LED ring around the top indicates status: blue when listening, green when processing, white when playing audio, orange when muted. Brightness is ambient-adaptive and can be fully disabled for bedrooms.

10

Software Platform

Mira runs a custom Yocto Linux distribution built for the RK3588 platform. The OS is minimal, with no desktop environment and no unnecessary services. Boot to ready in under 8 seconds.

The voice pipeline is orchestrated by a Rust-based daemon that manages the audio capture pipeline, model inference scheduling, Home Assistant communication, and media playback. Rust was chosen for its memory safety guarantees and zero-cost abstractions, critical for a real-time audio system where a garbage collection pause is an audible glitch.

OTA firmware updates are delivered as differential images, typically 20-50MB rather than the full 2GB rootfs. Updates are downloaded in the background and applied on next reboot with automatic rollback on failure. The device maintains two rootfs partitions (A/B) so a failed update never bricks the device.

The companion web app (React, served locally) provides device configuration, Home Assistant entity management, the visual pipeline editor, equaliser settings, and firmware management. It's accessible from any browser on the local network. No app store, no mobile app to install.

11

Where We Are

Mira is a fully functional product, from PCB layout to packaging. The hardware is finalised, the firmware is stable, and early units are in testing with home automation power users and accessibility-focused organisations who need reliable voice control without cloud dependencies.

Response latency sits under 1.2 seconds from wake word to spoken reply, competitive with cloud-based assistants on a local network. The beamforming array reliably captures voice commands from 7+ metres in a noisy room, tested with TV audio, kitchen appliances, and a two-year-old. (The two-year-old was the hardest noise source to suppress.)

The Home Assistant community has been particularly enthusiastic. Mira is the first voice-controlled device that integrates natively without requiring cloud bridges, third-party skills, custom firmware hacks, or the Wyoming protocol workarounds that current local voice solutions require. It works the way a smart speaker should have always worked.

We're currently working on multi-room audio (synchronised playback across multiple Mira units), speaker identification (different responses for different household members based on voice profiles, processed locally), and a Spotify Connect integration for users who want streaming alongside local media.

Technology Stack

Hardware

  • RK3588 SoC (custom PCB)
  • 7× MEMS mic array
  • TI TAS5805M amplifier
  • 6061-T6 aluminium enclosure

Firmware

  • Yocto Linux (custom distro)
  • Rust (voice pipeline daemon)
  • llama.cpp (INT4 quantised)
  • Piper TTS (neural voice)

AI / Audio

  • Custom wake word CNN
  • Whisper STT (quantised)
  • 7B LLM (fine-tuned)
  • Beamforming DSP pipeline

Integration

  • MQTT (Home Assistant)
  • HA WebSocket API
  • React companion app
  • A/B OTA updates

Local AI can transform consumer hardware. Let's talk about your project.

You have a hard problem. We should talk.