You can see from the benchmarks above Wakenet activation is 98-94% reliable depending on environmental conditions, all while minimizing false wake. We have reliable wake word detection across all of our supported wake words and clean speech for speech recognition from at least 25-30ft away (even without line of sight to device - around corners, etc) in acoustically challenging environments (acoustic echo, noise, etc). You can see the metrics of wake activation, false wake, resource utilization, etc here: In looking at this process (which is pretty much industry standard for commercial grade wake word implementations) the wake training process is, in fact, very involved. We will be using this process to train “Hi Willow” and other wake words as it makes sense. Also, while it is the same engine, we use the Alexa, Hi ESP, and Hi Lexin wake words, which have been trained and recorded by Espressif and professional audio engineers on 20,000 speech samples across 500 individual speakers (mix of genders, including 100 children) at distances of 1-3m. We have a multitude of ESP-SR and ESP-DSP tuning parameters for any of these features. Wake word is instant, as in imperceptible, and the VAD timeout is currently set to 100ms. Because of this functionality, ESP-SR has actually been tested and certified by Amazon themselves (I see the irony) for use as an Alexa platform device. We place AFE between the dual mic i2s hardware so that all audio fed to wake, on device recognition, and audio streaming to inference server has:Īdditionally, the ESP BOX enclosure has been acoustically engineered by Espressif with tuned microphone cavities, etc. Willow uses the absolute latest ESP-SR framework with their Audio Front End Framework. It is no co-incidence that Michael Hansen was employed for a while by Mycroft, before creating Rhasspy, and joining Nabu Casa. I spent some time donating voice samples to Mycroft, but sadly they ran out of money (stuff like beating a patent troll) and their second hardware device crowd-funding attempt failed (I personally lost several hundred pounds). Mycroft.ai built an open-source wake word detection system that works on a RPi3, however this took (from memory) about two years, and the result is specific to their wake word. record lots of different voices saying “Hey NAME_GOES_HERE”). I’d expect a project to donate wake word training data (e.g. Using an ESP32 with a mic array and streaming voice to a larger device (Intel NUC perhaps?) running the voice models works for STT where you’re using push-to-talk (only record when PTT), but could make real mess of your network if attempted for wake word detection as it needs to run continuously 7x24.įOSS projects like Mozilla Common Voice are collecting voice samples to help open projects train models, but I’m not sure what Nabu Casa is planning. accuracy giving both false and missed triggers). So - may be possible, but not for some time, and small CPUs are likely to limit the quality (i.e. It is a classic cost / complexity / quality trade-off, just as has been recently demonstrated with TTS and STT (high quality needs big hardware) MANY thousands of voice recordings), process the clips as training data into a model, then shrink the model into software capable of being deployed onto low-CPU edge devices to run continuously. Wake word detection may be possible on an ESP32, however there is significant engineering work required to collect training data (i.e.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |