The system uses earphone speakers as microphones, filtering incoming audio between 50–1000 Hz, normalizing it, and dividing it into 20 ms frames to detect speech energy spikes above a dynamic threshold. It distinguishes the primary user’s voice by measuring enhanced low-frequency (0–1000 Hz) energy from bone conduction versus ambient sounds, then refines detection with an SVM classifier analyzing harmonic continuity. Once speech is confirmed, missing high-frequency components (2–8 kHz) of the wake-up word are reconstructed from a pre-recorded template via time alignment (syllable stretching based on pitch), frequency alignment (formant envelope matching), and energy scaling. All processing runs locally on a low-cost ESP32-based dongle with an ES8388 codec and BLE radio, gating only authenticated wake-up commands to preserve privacy and minimize power consumption.
Description
This approach exploits dual air- and bone-conduction pathways to create a distinctive acoustic signature, achieving low false positives (~1.4% FAR) and robust wake-word recognition (≈90% stationary, ≈84% mobile). By avoiding cloud processing, voice enrollment, and extra sensors, the fully on-device architecture under $9 makes hands-free activation accessible on standard earphones. The template-based high-frequency reconstruction outperforms traditional harmonic methods, ensuring accurate formant reproduction without sacrificing latency (≈200 ms) or battery life (≈19.3 h). This combination of precise speaker discrimination, privacy protection, and cost-effective design sets it apart.
Applications
- Hands-free voice assistant activation
- Privacy-preserving voice command authentication
- Noise-robust wearable voice interfaces
- Low-cost bone conduction microphones
- Wireless earphone voice recognition
Advantages
- Hands-free wake-word activation on standard earphones
- Robust speaker discrimination via dual air/bone conduction detection
- High recognition accuracy (~90% stationary, ~84% mobile) with low false rejection/acceptance rates
- Local low-latency processing (~200ms) safeguards privacy by avoiding cloud transmission
- Low power consumption (~212 mW) enabling ~19 hours of continuous use
- Cost-effective ($8.30) dongle implementation compatible with wired and wireless earphones
- No specialized sensors required, ensuring broad compatibility across earphone types
IP Status
Patent Pending