# Quality Analysis

The Quality Analysis is a series of tests that are designed to check if a model can handle real-life scenarios and noisy data. You could see this as stress testing your machine learning model. The Quality Analysis outputs a Quality Score based on the performance of the model at each test. The individual test scores are also available in the Quality Report if you wish to dig into the finer details of the Analysis. The Quality Analysis also outputs the Performance Score, a measure of your model's prediction performance.

## Quality Score

Relying on a single test or metric to assess quality would imply a magic one representing all risks associated with using a given model. Since such a number does not exist, Snitch performs multiple tests to explore different aspects of the quality of your machine learning model. Each test that Snitch performs gets its own score between 0 and 100. The scores of the tests are aggregated into three intermediate scores: the Feature Contribution Score, the Random Noise Robustness Score, and the Extreme Noise Robustness Score. The Quality Score is a weighted average of these three intermediate scores.

By having a layered approach, you can assess the overall quality of your models with confidence, while also being able to dig deeper into the details of the analysis. You can even compare your models' quality on the basis of the layer of your choice!

#### Feature Contribution Validation

The Quality Analysis produces the Feature Contribution Score to check if your model's predictions are fairly distributed between the input variables. The Feature Contribution Score accounts for 45% of the Quality Score.

The contribution of each feature to the model's predictions is estimated by the Shapley additive explanation method ( Lundberg and Lee 2017) or the LIME method (Ribeiro, Singh and Guestrin 2020) depending on their computational cost for the model (the LIME method is preferred when the Shapley method is expected to be too costly). The Feature Contribution Score is based on the inequality in input features contribution to the predictions measured by the Gini coefficient and by whether a small cluster of input features explain an undue share of the predictions. The more unequal the distribution of the feature contribution, the lower the score.

#### Random Noise Validation

The Quality Analysis produces the Random Noise Robustness Score to check if your model is robust to the introduction of noisy data. The Score accounts for 45% of the global Quality Score.

The Random Noise Validation creates synthetic perturbed examples based on the training observations. The model's drop in performance is measured for noise introduced for all input variables and individual input variables. The resulting Random Noise Robustness Score is inversely proportional to the model's performance drops.

#### Extreme Noise Validation

The Quality Analysis produces the Extreme Noise Robustness Score to check if your model is robust to the introduction of worst-case scenario noisy data. The Score accounts for 10% of the global Quality Score.

Note that the extreme noise validation is currently only supported for Tensorflow/Keras models.

The Extreme Noise Validation creates synthetic perturbed examples based on the training observations. Unlike the Random Noise Robustness validation, the direction of the induced noise is targeted towards the greatest decrease in the model's performance. We use the fast gradient sign method developed by Goodfellow, Shlens, and Szegedy ( 2014) to generate these synthetic examples. The model's drop in performance is measured for noise introduced for all input variables and individual input variables. The resulting Extreme Random Noise Robustness Score is inversely proportional to the model's performance drops.

## Performance Score

The Performance Score is a generic metric of your model's prediction performance. It ranges from 0 to 100 and works both for regression and classification, allowing you to compare all your models together. The performance score is given by the F1 score in percentage for classification models and 100 minus the mean average percentage error for regression models (negative values are increased to 0).

$Score_{classification} = F1_{score}$ $Score_{regression} = max(100 - MAPE, 0)$