By Sarah Gillespie

Published January 31, 2022

What are trust scores?

The trust scores approach mixes the topology of points in a multidimensional space with calculus to be effective at detecting the accuracy of a unique machine learning prediction. Each unique prediction has its own unique trust score. This concept was introduced in the To Trust Or Not To Trust A Classifier, presented at Neural Information Processing Systems in 2018. The trust score is a remarkably flexible ratio with a purpose of describing how much a user should trust an algorithm’s classification of a specific point. Creating a trust score has two steps: first, remove the outlying points, and second, calculate the ratio of the distance from the testing sample to the nearest classes similar and nearest class different from the testing sample. The trust score for a testing sample is likely to agree with the point’s Bayes-optimal classifier. This means the trust score is a clue to decide if an algorithm’s prediction about a test point is correct, rather than a Type I or II Error.


Code to calculate trust scores yourself

While the initial trust score code was done in Python 2.7, it was updated to be usable in Python 3. I have adapted the given Python code to work in R and R Studio by using the reticulate package, a comprehensive set of tools for interoperability between Python and R.

Step 1: Gather materials.

Download the entire repository. It is important to have the trustscores.py anf trustscore_evaluation.py in the same folder as the .Rmd file to import the trust score functions that are in that Python file.

Step 2: Open R Studio and run the following code chunks.

The code in this tutorial creates trust scores based on penguin species type predictions with data from the Palmer Penguins package. I tried to make this code as generalized as possible to easily swap in different data.

If the code is in R, then write the code chunk name with “r” aznd if the code is in Python then write the code chunk as “python”. The code type is detailed in the first line of each individual code chunk.

# R code chunk
# Load the R libraries. Do install.packages() as needed.
library(palmerpenguins) # bring in data about penguin species
library(dplyr) # tools for data wrangling in R
library(reticulate) # provides a comprehensive set of tools for interoperability 
                    # between Python and R.
# R code chunk
# configure Python
reticulate::py_config() # Double check that reticulate is actually using a 
                        # new conda environment.
reticulate::py_install("sklearn", pip = TRUE) # force install with pip. sklearn
                                              # wasn't coming up via anaconda.
reticulate::py_install("matplotlib")
reticulate::py_install("keras")
reticulate::py_install("pandas")

# setting up the Python environment and bringing in the required Python packages 
# is important.
# Running this code will likely take 5 to 10 minutes if these Python libraries
# are not already installed.

# Common trouble shooting:
# if missing a package, try adding its name in an additional argument of 
# py_install. if py_install isn't working then try adding the pip = TRUE
# argument to try installing the library through pip rather than anaconda.
# Python code chunk

# import required Python packages
import numpy as np
import trustscore # have trustscore.py file in the same folder as 
                  # the .Rmd file to import it
import trustscore_evaluation  # haave trustscore_evaluation.py in the 
                              # same folder as the .Rmd file to import it
import numpy as np
import matplotlib.pyplot as plt
import keras
import pandas as pd

There might be some scary errors about “dlerror: cudart64_110.dll not found”. It’s just a warning and ignorable.

Step 3: Import data and do any necessary data wrangling to clean the data.

# R code chunk

# data
penguins_df <- penguins %>%
    dplyr::filter(complete.cases(.))
# remove all lines with missing values so the model can fit the data
 
penguins_data <- penguins_df %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)

# target: species
penguins_target <- penguins_df %>%
  dplyr::select(species)

Step 4: Bring the data into Python.

# Python code chunk
# Import penguins from R into Python
penguins_data = r.penguins_data

penguins_target = r.penguins_target

Step 5: Encode catagorical variables to be numbers rather than words.

# Python code chunk

# Python function: encode each penguin species to be a number.
# purpose: make it easier for the model to assign a species prediction and its trust score

dictionary = {}
current_index = 0

def encode(value):
    if value in dictionary:
      return dictionary[value]
    else:
      global current_index
      
      new_index = current_index
      dictionary[value] = new_index
      current_index = current_index + 1
      
      return new_index

X_penguins = penguins_data

y_penguins = list(map(encode, penguins_target.values.ravel()))
# at this point y_penguins is a list. We will need to make it a numpy array later.

# dictionary coordinating each penguin species connected to its assigned number
print(dictionary)

Step 6: Run specific machine learning model.

Options: from sklearn.svm import LinearSVC from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression

Logistic Regression
# Python code chunk

# import the model
from sklearn.linear_model import LogisticRegression

# Train logistic regression on digits.
model = LogisticRegression()

# fit the model
model.fit(X_penguins, y_penguins)

# Implement the model / create outputs on testing set.
y_pred = model.predict(X_penguins)

You may get a “failed to converge” and “max number of iterations reached” warning. This warning describes the data not fitting well to the assigned model rather than a tech issue. This warning can safely be ignored.

Step 7: Compute the trust scores model

# Python code chunk

# Initialize trust score.
trust_model = trustscore.TrustScore()

# convert the inputs into numpy arrays
X_penguins = X_penguins.to_numpy()
y_penguins = np.array(y_penguins)

# fit the trust scores model
trust_model.fit(X_penguins, y_penguins)

# the model will be fitted. It produces no output.

Step 8: Calculate unique trust scores

# Python code chunk

# Compute trusts score from the above model, given (unlabeled) testing examples and (hard) model predictions.
trust_score = trust_model.get_score(X_penguins, y_pred)

# prints the trust scores for each point in the inputted data set
print(trust_score)

Step 9: calculate a single trust score

given x value
# Python code chunk

# GIVEN X VALUE
# pick a specific X value from the given data
X_value_given = np.array([X_penguins[1]])

print("X_value_given:" + str(X_value_given))

y_pred_single = np.array([y_pred[1]])

# Compute trusts score from the above model, given (unlabeled) testing examples and (hard) model predictions.
trust_score_specific = trust_model.get_score(X_value_given, y_pred_single)

# prints the trust scores for each point in the inputted data set
print(trust_score_specific)
theoretical x value
# Python code chunk

# THEORETICAL X VALUE
# make up your own X value to generate a trust score
# put whatever values in X_value_theoretical to find the predicted trust score
# based on the earlier fitted model

X_value_theoretical = np.array([[  34.1,   23.0,  123. , 3210. ]])

print("X_value:" + str(X_value_theoretical))

y_pred_single = np.array([y_pred[1]])

# Compute trusts score from the above model, given (unlabeled) testing examples and (hard) model predictions.
trust_score_specific = trust_model.get_score(X_value_theoretical, y_pred_single)

print(trust_score_specific)

The output from an array [[ 34.1 23. 123. 3210. ]] is a trust score of 1.060239741504233. The array that produces this score is not in the original data set. There is no penguin with these specific wing and height dimensions. Finding the trust score for this collection of made up numbers is a way to see how distinct the imaginary penguin species classification is compared to the next most likely species classification.


How to interpret the scores?

If trust score > 1, then the different class is geographically closer to the testing sample than its predicted class. This is a red flag for the testing sample’s categorization being incorrect.


Further Steps:

Try other model options:

from sklearn.svm import LinearSVC

from sklearn.ensemble import RandomForestClassifier

Additionally, consider alternate scores to consider the trustability of the model’s output. One intuitive but less rigoriously researched scoring system is confidence and credibility scores.


References

The code for the To Trust Or Not To Trust A Classifier paper is located on Github and has been updated for Python 3.