Last week, I shared the above statement on LinkedIn but largely left it unexplained. I felt the need to elaborate a little. After all, it seems funny for someone spending so much time using, building, and benefiting from machine learning automation to say something like that.
It’s all Statistics #
I am a scientist by heart. I believe in explainable and observable phenomena. That’s why I find Statistics so fundamental to modern life - it helps interpret large quantities of raw data and predict the occurrence of outcomes with a given level of certainty. Ever wanted to know the chance of getting a Heads or Tails when tossing a coin? Toss the coin enough times, record the outcomes, and divide the number of all Heads and all Tails, by the total number of trials. If you toss the coin long enough, the probability of each group would be pretty close to 50% (even though it may not necessarily be exactly 50%).
Interestingly enough, we can detect outliers and anomalies in the same way. Say, the coin was slightly rigged to produce Tails more often than Heads. Well, we’ll experiment with multiple coins, find the average (mean) of all probabilities across all coins, and compare the outcomes of each coin with those. Our rigged coin will stand out as the outlier.
My example above was trivial, but in the same way, people way smarter than me make important decisions - by collecting and aggregating large quantities of raw data into probabilistic outcomes.
ML is just that - a form of applied Statistics. It’s a reverse plot of sorts, where we the computer a bunch of raw data and let it derive a generalized representation of the phenomenon, consisting of a single mathematical function. Our coin-toss example would look like this.
First, we feed the machine with a bunch of raw data:
101
100
100
101
100
100
100
100
101
...
The first 1
and 0
are the features - the Heads and Tails, respectively. Note that they are always the same in our case, but in complex problems, they usually vary greatly (and, of course, can be many more). The third column represents the observed outcome in each try.The machine learning algorithm’s goal is to develop a formula (function) that simulates the phenomenon of tossing a coin.
That’s all there is. Beyond all the computational brute-forcing and mathematical complexity, a machine learning model ends up being a mathematical function that describes the input data. Pretty much like we do in Statistics, but this time, the computer “discovers” the method of analysis by lots of trial and error.
In simpler terms, by feeding that model (or function) a 1
and a 0
, we expected it to give something lie that back to us:
# The probabilities of either Heads or Tails occurring in a coin toss.
[0.499999997653, 0,500000002347]
The model’s answers might be slightly different each time, but given a level of rounding and tolerance, they’ll likely always approach 50%.
From coin-toss prediction to GANs and LLMs #
In one way or another, every single development in the field has been based on the same idea. Indeed, the raw training data and the corresponding models are way more complex, but the principle is the same. Whether you generate an image with Midjourney or your school homework with ChatGPT, it’s the same function at play behind the scenes, dictating the probabilities of the next pixel having a specific color or a text character occurring after a particular text combination.
They are much more approachable to ordinary human beings, but to speak about the computer’s “understanding” and “feeling” inputs is quite a stretch. What goes as “understanding” is once again based on pure probabilistic Statistics. Ask ChatGPT to write an essay about the Eiffel Tower, and it will do it. It already has enough input probabilities to match against its enormous database of parameters. The rest is, well, Statistics.
The safest and, probably, the most productive way to use generative AI is to not use it as generative AI. Instead, use it to explain, convert, or modify.
But how do they seem to generate so many new ideas and content out of a simple prompt?
Are those ideas new, really? In fact, is anything we as humans do truly new and genuine? I’d argue that every “new” idea in existence has been the result of years or centuries of evolution. The only new in it was the fact that someone believed in it and fought for it to become reality.
In our history as a species, we have collected so much information that it is unfathomable for us to collect all the dots all the once. To do that would require keeping our brains analyzing all the time, and that simply wasn’t possible, given the fact that we were chased by wild beasts or by one another for most of history.
I truly believe that we have already found the answers to many of our questions. They are just flying somewhere out there across the neurons, in the big black holes of our knowledge. We are still not mentally capable of being good co-workers and parents, keeping the food from burning on the stove and thinking about those things at the same time.
But you know who is good at brute-forcing their way in, given the right direction? That’s right, computers!
Computers can help find the solution, but we still need to define the problem #
This is precisely what I wanted to conclude at when I started writing this blog post. Every single machine learning model out there, regardless of its complexity, has been trained by humans, with data collected and interpreted by humans. It contains traces of every single fear, bias, or limiting belief we have. Thus, at its very core, machine learning algorithms and models are biased too.
Is this bad? Not necessarily - we just have to acknowledge that those models, like all things made by the human hand and mind, have limitations. Like all human inventions before them, they can help us move a step forward and, sometimes, two steps back. Say, a large language model helps scientists discover a new cancer treatment through which a new global pandemic could be caused. The machine can present the human with different options and the probabilities of each one happening. The one who should ultimately be responsible for pushing the button is the same one typing the prompt - the human.
ML is a great tool, but it remains a tool #
And, like with all tools, we should understand its inner workings enough to say when it works and when it doesn’t. This is my big concern with deep learning models - no one understands how they work anymore - so LLMs often produce well-sounding but totally unreasonable logic. And the more we try to generalize their “understanding,” the worse the problem will get.
My preferred way of working with ML is producing tiny, dedicated models that solve a single problem sufficiently well that save me and my clients time to do other things. Alerting and anomaly detection is a good example. While it would never work well enough not to need human judgment, anomaly detection saves a ton of money on frequent manual checks and inspections. One still does the checks, only less often.
The same applies to GANs and LLMs. It’s solving these narrow problems where I think ML will bring the most benefits. The rest is a toy for the masses, which will ultimately get too expensive to give away for free.
Further Reading #
Have something to say? Join the discussion below 👇
Want to explore instead? Fly with the time capsule 🛸
You may also find these interesting
The Human in the Loop
Our desire for optimization and productivity drives us to abdicate effort and skill in exchange for rapidity
3 (+1) Things Evernote Got Right
A look at three features that made Evernote click for me back in the day — ones that modern note-taking apps still struggle to replicate fully.
Epic Rap Battles of Programming: Python vs. OCaml
Two programming language giants appear on stage for a massive rap battle. Who will win?