Happy New Year! I am kicking off with a very interesting and funny idea (there has not been any actual work on this yet). I will code-named it: the F-Word Detector for now. Short for “using a locally-trained machine learning model to detect and monitor the use of profanity in working environments” 😎.
We are all human beings, and prone to occasional lapses of expressing our frustration with a project, the task we are working on, the cold morning coffee, or anything that might have stepped in the way of your perfect day. Yet, we also want to be better individuals, and avoid these becoming chronic, especially in common areas like the office.
This is how I came up with the idea ( still, only an idea ) of creating a detector application, which would monitor (more on this below) conversations around the office and count the usage of the F-word. When the count reaches a certain threshold, a certain thing would happen, like a funny meme GIF would be sent around on Slack, as a notice. Remember Not Hotdog from the HBO show Silicon Valley?
Ideally, a machine learning model should be able to distinguish between speech that contains the F-word, and speech that doesn’t. There are two big challenges though. One is purely the privacy aspect. I bet that no one would like it that a machine records their speech and even less so, sends it to servers overseas to get it analysed. The other aspect is the cost of having such a thing run all the time. Most cloud infrastructure providers have their own cloud-based speech-to-text APIs where they charge per number of requests. I can easily imagine how such a fun experiment can end up being money trap, costing thousands of dollars in usage fees.
So, our goal can be achieved only by manually training and using a machine learning model, to classify speech. Moreover, the app must only store recorded audio temporarily, as well as avoid using raw audio during either the training or testing periods. This, as well as open-sourcing the project are key ingredients in ensuring participants nothing of what they say will ever get stored, or potentially used against them (who said GDPR?).
On paper, achieving this is way not as difficult as it sounds. We need an app that does the following:
- Constantly listen to sounds coming from the device’s microphone
- Turn the sounds between any two periods of silence into spectrogram images
- During the training phase, half of the spectrums will be marked as ones containing the F-word, and the half, as one which do not contain it.
- Having the the two distinct data sets, we can train a convolutional neural network (CNN) to try to distinguish between them
- Having the trained model, the app will follow steps 1. and 2. and count a +1 , for every time when the ML model indicates an occurrence of the F-word beyond a certain threshold.
As for turning audio in a bitmap and using a CNN, which is especially effective for classifying bitmap data, is a tried and proven trick. There are quite a few experiments, which use CNNs in a combination with spectrogram images, to classify different genres of music. It is important to point out once again that our app will eventually distinguish between two classes only (F-word or not). In contrast, our model would be very unsuitable for something like speech-to-text recognition.
The next step in my journey is to try to implement a working version of the app, as well as find people willing enough to participate. I will try to document my progress on the blog.
How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow, Keras & React Native
Recommending music on Spotify with deep learning – Sander Dieleman
Audio Classification : A Convolutional Neural Network Approach