University of Pittsburgh

Local Earbud Speech Processing System for Privacy-Preserving Wake Word Activation

The system uses earphone speakers as microphones, filtering incoming audio between 50–1000 Hz, normalizing it, and dividing it into 20 ms frames to detect speech energy spikes above a dynamic threshold. It distinguishes the primary user’s voice by measuring enhanced low-frequency (0–1000 Hz) energy from bone conduction versus ambient sounds, then refines detection with an SVM classifier analyzing harmonic continuity. Once speech is confirmed, missing high-frequency components (2–8 kHz) of the wake-up word are reconstructed from a pre-recorded template via time alignment (syllable stretching based on pitch), frequency alignment (formant envelope matching), and energy scaling. All processing runs locally on a low-cost ESP32-based dongle with an ES8388 codec and BLE radio, gating only authenticated wake-up commands to preserve privacy and minimize power consumption.

Description

This approach exploits dual air- and bone-conduction pathways to create a distinctive acoustic signature, achieving low false positives (~1.4% FAR) and robust wake-word recognition (≈90% stationary, ≈84% mobile). By avoiding cloud processing, voice enrollment, and extra sensors, the fully on-device architecture under $9 makes hands-free activation accessible on standard earphones. The template-based high-frequency reconstruction outperforms traditional harmonic methods, ensuring accurate formant reproduction without sacrificing latency (≈200 ms) or battery life (≈19.3 h). This combination of precise speaker discrimination, privacy protection, and cost-effective design sets it apart.

Applications

- Hands-free voice assistant activation
- Privacy-preserving voice command authentication
- Noise-robust wearable voice interfaces
- Low-cost bone conduction microphones
- Wireless earphone voice recognition

Advantages

- Hands-free wake-word activation on standard earphones
- Robust speaker discrimination via dual air/bone conduction detection
- High recognition accuracy (~90% stationary, ~84% mobile) with low false rejection/acceptance rates
- Local low-latency processing (~200ms) safeguards privacy by avoiding cloud transmission
- Low power consumption (~212 mW) enabling ~19 hours of continuous use
- Cost-effective ($8.30) dongle implementation compatible with wired and wireless earphones
- No specialized sensors required, ensuring broad compatibility across earphone types

IP Status

Patent Pending