“…for a better data science activity”
On 26th January 2021, Data Science Milan has organized a webMeetup hosting Michael Munn to talk about “Machine Learning Design Patterns” book which is a co-author. Topics of the talk: Rebalancing, Useful Overfitting and Explainable Predictions.
“Machine Learning Design Patterns”, by Michael Munn, ML Solutions Engineer at Google
Rebalancing is a typical activity that come from the classification task. The number of examples that belong to each class may be referred as the class distribution. In a classification problem the number of observations of the data set for each class may not be balanced, so the distribution of classes is not equal and it’s skewed. Usual examples of this task are: Churn Prediction, Fraud Detection, Anomaly Detection, Spam Detection….
Typically the most machine learning models used in a classification task work with the assumption of an equal number of samples for each class and the results will be poor predictive performance, especially for the minority class that is usually the most interesting and important to predict.
What are the strategies to face an imbalanced data set?
First of all, the right performance metric. Accuracy that represents the number of correct predictions divided by the total number of predictions made, is not the right choice for imbalanced data set because the high level of accuracy is reachable by an easy model that is able to predict only the majority class. It’s a good practice to use the Confusion Matrix, a breakdown of predictions into a table showing correct predictions (on the diagonal) and the types of incorrect predictions made. The goal is to maximize both the performance of the Precision, a measure of a classifier exactness and the Recall, a measure of a classifier completeness. For imbalanced data set is more suitable to use the F1 score metric or the Area Under the ROC Curve (AUC).
Try sampling methods on the data set. For this goal there are two main approaches.
-Over-sampling: randomly sample (with replacement) the minority class to reach the same size of the majority class. There is also SMOTE (Synthetic Minority Over-sampling Technique), an over-sampling method that creates synthetic samples from the minority class instead of creating copies from it.
-Down-sampling: randomly subset the majority class to reach the same size of the minority class.
Try weighted classes. You can use the same algorithms but updated to be cost-sensitive, with penalties for misclassification errors from one class more than the other. These penalties move the model to pay more attention to the minority class.
The goal of a machine learning model is to generalize and be able to make reliable predictions on new unseen data. Given a data set with non-linear data points, you are in underfitting situation when you try to fit your data with a linear model, a poor model. On the other side you are in overfitting situation if you try to fit your data with a high degree polynomial model that perfectly hit any single point, but fails to predict additional observations. To avoid these situations is a good practice to split the data set into a train and a test set. You fit the model on the train set and the test set is used to provide an unbiased evaluation of the model applied on data never seen before. In this way you are able to check the training/test error curves with the goal to reach a good fit when both curves are close to each other with a low error.
Facing physics-based model or dynamical system (for instance working on satellite images) with the use of partial differential equations (PDE) to approximate a solution, because they don’t have a closed form solution, can be replaced with the use of machine learning models saving computational time. In these situations, overfitting can be useful, but two conditions need to meet: no noise, so the labels are accurate for all the instances and a complete data set at disposal. In this case, overfitting becomes interpolating the data set.
Explainable AI is the process of understanding and communicating how or why a machine learning model makes certain predictions. It’s becoming more and more necessary because really powerful models tend to be very hard to explain like deep neural networks. Model understanding has impact on many stakeholders: engineers can improve performance and build better models, on the other side from customers there is an increased trust for what the model is doing and helps Regulators in the compliance and reporting activities.
Understanding a model’s behaviour is crucial to many tasks such as: explain why an individual data point received that prediction, debug odd behaviour from a model, present the gist of the model to stakeholders….
Explainable methods can be split into intrinsic methods and post-hoc. The first methods are referred to simple models, as decision trees and sparse linear models, structures lend to understand results. With post-hoc are applied methods that analyse the model after training activity, and in this way you can choice an understanding at a local level, looking for the contributes from individual inputs or at global level, looking for an averaged contribute by inputs. In both cases you can decide to apply a model-specific method, that works for specific models as definition, otherwise you can decide to apply a model agnostic method, that would work for whatever model, it’s just treating the models like inputs. Which techniques you use depends a lot on your use case, and it also may depend on your data type.
Valliappa Lakshmanan, Sara Robinson, Michael Munn, “Machine Learning Design Patterns”, O’Reilly, 2020.
Written by Claudio G. Giancaterino