We’re so excited to share some of the theory behind our innovative machine learning technology. We recently published a detailed technical report on Arxiv that’s freely readable by the public.
In a nutshell, it explains how we can:
• build effective learning models from very small datasets
• generate explanations for decisions/predictions
• provide auditable, provable links between training data and predictions
• determine data feature importance
• train “targetless” models
• generate surprisingly useful synthetic data
• intelligently fill in gaps in data
…and so much more.
The paper is highly technical and assumes the reader has a strong background in computer science, machine learning, information theory, and statistics. Honestly, many of us don’t even understand it all! So we’ve written this blog post to discuss our algorithms at a high level, and explain why they’re so useful for understandable machine learning.
We believe strongly that humanity deserves full understanding and explanation of machine learning decisions. Our technology, in development for years for the defense sector, provides a significant improvement over commercially-available systems. We’re glad to see more organizations thinking about interpretability– The Wall Street Journal reported that the majority of executives are very concerned about how machine leaning systems use their data to make business decisions. Diveplane’s technology provides exactly the explanations they need with direct, provable, causal connection between training data and decisions made.
Dr. Hazard’s efforts started with a venerable technique that’s taught in every introductory machine learning class: “k-nearest neighbors” (or “kNN”). kNNs work just like you’d think – to decide, the model simply looks at “k” most similar situations. To decide what price is best for selling a given stock, the system might look at the five most similar stock sales in its training data, then average those sales together to make the answer. The model is little more than the data, some basic statistical calculations, and a few parameters defining similarity. It’s simple, it’s clear, and very repeatable – and thus, very explainable.
While kNNs are still taught in university, and they’re known to be clearly explainable, they’re rarely used for serious applications due to a host of drawbacks and scalability problems. Until now, that is, because we’ve overcome most of these drawbacks. We’ve made kNNs fast enough to be useful for a large class of executive-level problems. We can’t handle natural language processing or images (yet!), but we have support for hundreds of features and near-infinite training cases, so our system can already handle almost everything else.
(The technical report doesn’t reveal all those kNN improvements and efficiency-gaining techniques — we love sharing our ideas and improving the state of the art, but we do need to keep our competitive advantage.)
Information Theory + Machine Learning
Our next big reveal is that we’ve deeply intertwined machine learning with information theory. Information theory is best known as the mathematics behind data compression and information entropy. (When your password is called “strong” or your password generator says it has 88 bits of entropy, that’s information theory.) This combination yields tremendous advantages. As an example, you might want your ML system to tell you which training data was most relevant to its last decision. Imagine you could choose, mathematically, to use the most interesting data points? Or the least surprising? This has an incredible impact on the classic exploit-versus-explore tradeoffs in AI.
As a more concrete example, consider a Diveplane model trained to play chess. If you lower the desired information surprisal, the ML system will “play it safe”, and not make surprising moves – though it might still beat you with common tactics! And if you were to increase the surprisal, the model would behave like a far more unconventional player. Imagine how this “conventional-or-unconventional” surprisal option might be valuable in exploring new gene therapies. Or automated system control? Vehicle navigation?
What if you use this aspect of our system to create new training data based on previous training data? You could create surprising new data sets based on the data sets that had just the smallest seeds of the interesting aspects of the data sets. It might then magnify those interesting aspects of the dataset. Now consider the impact this would have on reinforcement learning, where obtaining and training on the hard-to-create edge cases can be costly, or even dangerous (e.g., car crash scenarios for self-driving cars). We’ve honestly just scratching the surface of what’s possible.
Here’s a few of the techniques discussed in the paper:
Making Decisions with Conviction
Diveplane’s algorithms can measure how much conviction we have in any prediction made by the machine learning system. (We call it the Hazard Surprisal Ratio, or Hazard Ratio, much to Dr. Hazard’s dismay.) The technical report goes into extensive detail on how we measure the Hazard Ratio in various ways, such as conviction of familiarity or conviction of prediction. It can be generally thought of as the ratio between how surprising a given datapoint is, versus how surprising the rest of the model’s datapoints are. These conviction measures are incredibly powerful.
For example, when a model provides you with prediction conviction alongside each prediction it makes for you, it’s telling you “how sure it is that it’s right.” This also includes residuals, which are error bounds around the prediction. You might use that information to determine which decisions can be automated (high conviction), versus which should be flagged for human review first (low conviction).
Conviction also aids with fielding machine learning models very quickly. Traditional ML models will start giving answers as soon as they have a few training data points. It also means they’ll start giving answers that are probably wrong, but they’ll do so with 100% certainty, the only way they know how. Diveplane’s system will provide high conviction answers in areas it has sufficient training, and low conviction in areas it doesn’t — so you can start using the system very quickly and get initial results right away. Conviction can also guide you towards which predictions would benefit most from additional training data.
Impute: Filling in Missing Data
Sometimes, we must make predictions based on incomplete data. The data might be incomplete because of accidentally deleted data, purposefully deleted data, or data that was never collected, such as from faulty sensors. For example, the telemetry data from a race car might briefly be missing the steering wheel angle, or a residential loan decision might be missing the applicant’s zip code. Many ML techniques cannot train from an incomplete record, but by analyzing all the available data, our platform can make reasonable assumptions about the missing data and keep an audit trail of what data was imputed. And with tunable surprisal, the missing data can be filled in with “vanilla” conventional data, or reasonable-yet-surprising data.
Data synthesis techniques are used to generate novel training data from a training data set. That synthetic data might be used in combination with the original data, to fill out an insufficient training data set; some deep learning models require significant amounts of training data that isn’t always available. Synthetic data can also be used to hide private/personal data, by generating an entirely new “fake” data set and then discarding the original data. Assuming the synthetic data reflects reasonable “theoretical” individuals (which can be ensured with our techniques), a model trained on synthetic data should still give valid results and predictions.
Our combination of machine learning and information theory can be extraordinarily effective for data synthesis. For example, during synthesis we can increase/decrease the surprisal, and thus generate data that is more/less surprising than the original data. We can also create synthetic datasets that have the same high-level properties as the original data, such as maintaining the same statistical distribution across data fields. It’s a quite valuable result in its own right, and we’re looking into productizing data synthesis as a separate tool.
Deeper Forms of Explanation
In addition to conviction as an explanation mechanism, the paper discusses our many other useful explanation and audit techniques: archetype and counterfactual cases; whether features are outside local model ranges; local-to-whole-model conviction ratios; “less similar” cases; feature residuals; local model complexity; fractal dimensionality; and more. We take our Understandable AITM seriously around here!
The paper ends with a discussion of our next exciting project — building better Reinforcement Learning. Reinforcement Learning (or RL) is one of the major types of machine learning (along with supervised and unsupervised learning). It’s often thought of as agents acting in an environment, trying to find better solutions by testing and learning and searching. It’s also the hottest field in AI: Microsoft’s CTO recently called RL one of their two top strategic priorities.
A key part of reinforcement learning is the tradeoff between exploration (of unknown territory) versus exploitation (of current knowledge). As you might imagine, our combination of information theory and machine learning techniques lends itself especially well to adjusting that balance.
Reinforcement learning often involves an AI making mistakes as it tries and fails, and those failures can be quite expensive. Because our data synthesis techniques are fully interpretable back to the relevant data, humans can understand what the AI is trying to test and learn, and the human can filter out unhelpful approaches. That leads to reinforcement learning systems that reach their goals faster and more cheaply than ever before.
When we’re ready, we’ll share more about reinforcement learning opportunities, why interpretability matters so much to reinforcement learning, and Diveplane’s advantages in this area.
We’re so excited to share some of the magic behind the curtain, and to lay out our research directions. In future posts, we’ll provide more detail on these platform features. And of course, we’re always hiring smart programmers and data scientists to help develop the future of AI.
Again, here’s the link to the paper.