What if Google’s What-If Tool doesn’t explain why TensorFlow makes its decisions?

 

Toolkits attempting to explain deep learning AI models might offer interesting rationalizations, but fail to interpret why the black box made its decision.

Google’s acclaimed What-If Tool (WIT) enables users to probe decision models and visualize bias. Because Google’s TensorFlow is far and away the most popular machine learning tool in the world, this was a critical advancement for the adoption of AI.

At Diveplane, we’ve dedicated ourselves to keeping the humanity in AI, and building understandable AI systems, so we were thrilled at the news. The notion of erasing bias in decision making by handing decisions to AI is enticing. However, asking Siri to play a song you like is one thing; a computer making a hiring decision, or driving a car, is another entirely. We need to trust the result. And highly-regulated industries, like finance services and healthcare, must be able to explain decision-making processes with great clarity.

Opening AI’s black box is top of the to-do list for anyone and everyone in the field: scientists at Explainable AI workshops at the leading academic conference (AAAI); analysts from PwC, McKinsey and on Fortune.com; and well-known academics like Thomas Dietterich and Cynthia Rudin debating on Twitter.

In an effort to assuage concerns and build trust, each major cloud provider independently published tools or papers intended to root out bias within the last year [1] [2] [3] [4] [5]. While the efforts are to be applauded, the fundamental issue is not addressed: the deep learning models used by nearly all today’s AI are black boxes, by design. Attempts to explain predictions from their opaque decision models—after the fact—are rationalizations. And of course, just like in everyday life, rationalizations from deep learning models are not necessarily true or useful explanations.

What-If Is Not Necessarily What Is - Google Sign

Photo by Arthur Osipyan on Unsplash

 

The What-If Tool works by building a surrogate model from the original source data used to build a TensorFlow model. Google then offers a clever web interface that visualizes the data classifications so you can inspect the data and see how certain data points share a likeness to one another. This tool implies we are inspecting our deep learning model and we can use it to identify, and therefore remove, bias to improve the fairness or accuracy of our results. This sounds great. But to see how much we are really learning about the influences on the TensorFlow model, we ran an experiment.

We wanted to know: Do similar and counterfactual points influence the model? And if so, by how much?

  • First, we used the same UCI census income data Google used to demo the tool on their website and we carefully replicated the model on our internal servers.
  • Then we selected a sample person at random from the data in which TensorFlow used census data to predict if that person earned more than $50,000 a year.
  • Next, we asked WIT to show us the 100 most-similar cases to our sample case, because we would expect these were the cases that heavily influenced how TensorFlow made its classification.
  • We then changed the classification for the 100 similar cases to not making over $50,000. If these cases are so influential, changing them should now affect how TensorFlow classifies our sample case.
  • We then re-ran the prediction. Not only did the classification in our sample case rarely ever change, but the probability offered that it was correct often was not significantly changed.

 

What does it mean?
First, this experiment shows the explanations we see in Google’s What-If Tool are not necessarily indicative of what is happening inside the machine learning model. The size of the data set is just under 49,000 instances (an instance being the data representing an individual person). We trained some small decision trees on the data with accuracy of 85%, and found the number of instances in each node of most of ends of the decision tree varied from 10’s to 1000’s of instances using a hold out of 1/3 of the data. This led us to see that if we flipped the results of 50 to 200 instances, that should be enough to be noticed by an accurate model. We wanted to make sure that the magnitude of the change was significant enough to be detected, and not simply looking for overfit.

We flipped the yes/no answer to over $50,000 income on each data point for the most-similar cases as provided by the techniques using the What-If Tool and retrained the model. However, many times, this did not seem to have a significant impact on the result or its probabilities. We then tried flipping randomly selected instances and compared the results. Out of 30 runs flipping 50 instances, randomly selected points had greater impact on the model than What-If’s results 3 times! Across other runs flipping 100 and 500 instances, including both the L1 and L2 options, we would often see 2-10% of the instances influenced more randomly chosen points than the What-If Tool rationalizations.

This means if you used the WIT to try to improve your own model, or to remove bias, there is a small chance that your edits would have no statistical impact. Given that at least one study found that humans lie on average of 2 to 3 times every ten minutes in a conversation where they are trying to seem competent, shouldn’t we expect better from machines? This data set involves anonymized sensitive personal data—so it is guiding people incorrectly on exactly the sort of problem most in need of explanation and careful introspection, where the wrong 10% can have a big difference, or when randomly changing some data can have a larger effect on a decision than a deliberate intentional change.

 

Where do we go from here?
If the deep learning founding-father, Geoffrey Hinton, says AI needs to start over before it can go forward, then the answer to understandable machine learning is not going to come from doing more of the same and hoping for a different result. And this includes open-sourcing fairness toolkits that, at best, rationalize a black box decision model. At worst, they falsely give the dangerous assumption that negative bias has been removed.

The promising news is when we change these same similar cases in Diveplane’s machine learning platform, the classification in the sample case changed, something we rarely saw using What-If Tool. We’ve posted a technical paper detailing our breakthroughs in machine learning, and will follow with a blog post detailing how we address this issue.