Introduction by example¶
rexmex is recommender system evaluation metric library. It consists of utilities for recommender system evaluation. First, it provides a comprehensive collection of metrics for the evaluation of recommender systems. Second, it includes a variety of classes and methods for reporting and plotting the performance results. Implemented metrics cover a range of well-known metrics and newly proposed metrics from data mining conferences and prominent journals.
Overview¶
We shortly overview the fundamental concepts and features of rexmex through simple examples. These are the following:
Design philosophy¶
rexmex is designed with the assumption that end users might want to use the evaluation metrics and utility functions without using the metric sets and score cards. Because of this, the evaluation metrics and utility functions (e.g. binarisation and normalisation) can be used independently from the rexmex library.
Synthetic toy datasets¶
rexmex is designed with the assumption that the predictions and the ground truth are stored in a pandas DataFrame
. In our example we assume that this DataFrame
has at least two columns y_true
and y_score
. The first one contains the ground truth labels/ratings while the second one contains the predictions. Each row in the DataFrame
is a single user - item like pairing of a source and target with ground truth and predictions. Additional columns represent groupings of the model predictions. Our library provides synthetic data which can be used for testing the library. The following lines import a dataset and print the head of the table.
from rexmex.dataset import DatasetReader
reader = DatasetReader()
scores = reader.read_dataset()
print(scores.head())
source_id target_id source_group target_group y_true y_score
0 0 322 0 0 1 0.244729
1 0 435 0 0 1 0.245294
2 0 2839 0 1 1 0.182597
3 0 3348 0 2 1 0.798757
4 0 5805 0 3 1 0.569672
Let us overview the structure of the scores DataFrame
used in our example before we look at the core functionalities of the library. First of all we observe that: it is unindexed, has 6 columns and each row is a prediction. The first two columns source_id
and target_id
correspond to the user and item identifiers. The next two columns source_group
and target_group
help with the calculation of group performance metrics. Finally, y_true
is a vector of ground truth values and y_score
represents predicted probabilities.
Evaluation metrics¶
The generic design rexmex involves classification metrics that exist on the appropriate namespace. For example the pr_auc_score
is on the rexmex.metrics.classification
namespace, because it is a classification metric. Functions that are on the same name space have the same signature. This specific function takes a target and prediction vector (we use the toy dataset) and outputs the precision recall area under the curve value as a float
.
from rexmex.metrics.classification import pr_auc_score
pr_auc_value = pr_auc_score(scores["y_true"], scores["y_score"])
print("{:.3f}".format(pr_auc_value))
0.919
Metric sets¶
A MetricSet()
is a base class which inherits from dict
and contains the name of the evaluation metrics and the evaluation metric functions as keys. Each of these functions should have the same signature. There are specialised MetricSet()
variants which inherit from the base class such as the ClassificationMetricSet()
. The following example prints the classification metrics stored in this metric set.
from rexmex.metricset import ClassificationMetricSet
metric_set = ClassificationMetricSet()
metric_set.print_metrics()
{'negative_likelihood_ratio', 'false_discovery_rate', 'matthews_correlation_coefficient', 'markedness', 'hit_rate', 'roc_auc', 'specificity', 'negative_predictive_value', 'informedness', 'diagnostic_odds_ratio', 'threat', 'false_omission_rate', 'balanced_accuracy', 'average_precision', 'false_positive_rate', 'recall', 'miss_rate', 'precision', 'fowlkes_mallows_index', 'positive_likelihood_ratio', 'accuracy', 'positive_predictive_value', 'true_positive_rate', 'critical_success_index', 'sensitivity', 'f1', 'pr_auc', 'selectivity', 'true_negative_rate', 'false_negative_rate', 'fall_out', 'prevalence_threshold'}
Metric sets also allow the filtering of metrics which are interesting for a specific application. In our case we will only keep 3 of the metrics: roc_auc
, pr_auc
and accuracy
.
metric_set.filter_metrics(["roc_auc", "pr_auc", "accuracy"])
metric_set.print_metrics()
{'roc_auc', 'pr_auc', 'accuracy'}
Score cards¶
Score cards allow the calculation of performance metrics for a whole metric set with ease. Let us create a scorecard and reuse the filtered metrics with the scorecard. We will calculate the performance metrics for the toy example. The ScoreCard()
constructor uses the metric_set
instance and the generate_report
method uses the scores from earlier. The result is a DataFrame
of the scores.
from rexmex.scorecard import ScoreCard
score_card = ScoreCard(metric_set)
report = score_card.generate_report(scores)
print(report)
roc_auc accuracy pr_auc
0 0.792799 0.684188 0.900368
The score cards allow the advanced reporting of the performance metrics. We could also group on the source_group
and target_group
keys and get specific subgroup performances. Just like this:
report = score_card.generate_report(scores, ["source_group", "target_group"])
print(report)
roc_auc accuracy pr_auc
source_group target_group
0 0 0 0.580047 0.575397 0.912304
1 0 0.574348 0.576501 0.921771
2 0 0.561425 0.574823 0.916881
3 0 0.578944 0.567548 0.924734
4 0 0.578380 0.575534 0.921595
5 0 0.565587 0.553327 0.927879
1 1 0 0.634120 0.655391 0.891537
2 0 0.623156 0.622572 0.892552
3 0 0.625783 0.632247 0.891662
4 0 0.637060 0.633891 0.893618
5 0 0.592871 0.603951 0.887618
2 2 0 0.858091 0.854871 0.930708
3 0 0.834505 0.834299 0.909151
4 0 0.837995 0.838511 0.909978
5 0 0.834062 0.832095 0.908878
3 3 0 0.835858 0.843683 0.920581
4 0 0.822551 0.820680 0.901642
5 0 0.817555 0.816832 0.903191
4 4 0 0.921859 0.921260 0.947434
5 0 0.938959 0.938816 0.958132
Utility functions¶
A core idea of rexmex is the use of wrapper
functions to help with recurring data manipulation. Our utility functions can be used to wrap the metrics when the predictions need to be transformed the y_score
values are not binary. Because of this most evaluation metrics are not meaningful. However wrapping the classification metrics in the binarize
function ensures that there is a binarization step. Let us take a look at this example snippet:
from rexmex.metrics.classification import accuracy_score
from rexmex.utils import binarize
new_accuracy_score = binarize(accuracy_score)
accuracy_value = new_accuracy_score(scores.y_true, scores.y_score)
print("{:.3f}".format(accuracy_value))
0.684