Introduction by example

rexmex is recommender system evaluation metric library. It consists of utilities for recommender system evaluation. First, it provides a comprehensive collection of metrics for the evaluation of recommender systems. Second, it includes a variety of classes and methods for reporting and plotting the performance results. Implemented metrics cover a range of well-known metrics and newly proposed metrics from data mining conferences and prominent journals.

Overview


We shortly overview the fundamental concepts and features of rexmex through simple examples. These are the following:

Design philosophy

rexmex is designed with the assumption that end users might want to use the evaluation metrics and utility functions without using the metric sets and score cards. Because of this, the evaluation metrics and utility functions (e.g. binarisation and normalisation) can be used independently from the rexmex library.

Synthetic toy datasets

rexmex is designed with the assumption that the predictions and the ground truth are stored in a pandas DataFrame. In our example we assume that this DataFrame has at least two columns y_true and y_score. The first one contains the ground truth labels/ratings while the second one contains the predictions. Each row in the DataFrame is a single user - item like pairing of a source and target with ground truth and predictions. Additional columns represent groupings of the model predictions. Our library provides synthetic data which can be used for testing the library. The following lines import a dataset and print the head of the table.

from rexmex.dataset import DatasetReader

reader = DatasetReader()
scores = reader.read_dataset()

print(scores.head())
   source_id  target_id  source_group  target_group  y_true   y_score
0          0        322             0             0       1  0.244729
1          0        435             0             0       1  0.245294
2          0       2839             0             1       1  0.182597
3          0       3348             0             2       1  0.798757
4          0       5805             0             3       1  0.569672

Let us overview the structure of the scores DataFrame used in our example before we look at the core functionalities of the library. First of all we observe that: it is unindexed, has 6 columns and each row is a prediction. The first two columns source_id and target_id correspond to the user and item identifiers. The next two columns source_group and target_group help with the calculation of group performance metrics. Finally, y_true is a vector of ground truth values and y_score represents predicted probabilities.

Evaluation metrics

The generic design rexmex involves classification metrics that exist on the appropriate namespace. For example the pr_auc_score is on the rexmex.metrics.classification namespace, because it is a classification metric. Functions that are on the same name space have the same signature. This specific function takes a target and prediction vector (we use the toy dataset) and outputs the precision recall area under the curve value as a float.

from rexmex.metrics.classification import pr_auc_score

pr_auc_value = pr_auc_score(scores["y_true"], scores["y_score"])
print("{:.3f}".format(pr_auc_value))
0.919

Metric sets

A MetricSet() is a base class which inherits from dict and contains the name of the evaluation metrics and the evaluation metric functions as keys. Each of these functions should have the same signature. There are specialised MetricSet() variants which inherit from the base class such as the ClassificationMetricSet(). The following example prints the classification metrics stored in this metric set.

from rexmex.metricset import ClassificationMetricSet

metric_set = ClassificationMetricSet()
metric_set.print_metrics()
{'negative_likelihood_ratio', 'false_discovery_rate', 'matthews_correlation_coefficient', 'markedness', 'hit_rate', 'roc_auc', 'specificity', 'negative_predictive_value', 'informedness', 'diagnostic_odds_ratio', 'threat', 'false_omission_rate', 'balanced_accuracy', 'average_precision', 'false_positive_rate', 'recall', 'miss_rate', 'precision', 'fowlkes_mallows_index', 'positive_likelihood_ratio', 'accuracy', 'positive_predictive_value', 'true_positive_rate', 'critical_success_index', 'sensitivity', 'f1', 'pr_auc', 'selectivity', 'true_negative_rate', 'false_negative_rate', 'fall_out', 'prevalence_threshold'}

Metric sets also allow the filtering of metrics which are interesting for a specific application. In our case we will only keep 3 of the metrics: roc_auc, pr_auc and accuracy.

metric_set.filter_metrics(["roc_auc", "pr_auc", "accuracy"])
metric_set.print_metrics()
{'roc_auc', 'pr_auc', 'accuracy'}

Score cards

Score cards allow the calculation of performance metrics for a whole metric set with ease. Let us create a scorecard and reuse the filtered metrics with the scorecard. We will calculate the performance metrics for the toy example. The ScoreCard() constructor uses the metric_set instance and the generate_report method uses the scores from earlier. The result is a DataFrame of the scores.

from rexmex.scorecard import ScoreCard

score_card = ScoreCard(metric_set)
report = score_card.generate_report(scores)
print(report)
    roc_auc  accuracy    pr_auc
0  0.792799  0.684188  0.900368

The score cards allow the advanced reporting of the performance metrics. We could also group on the source_group and target_group keys and get specific subgroup performances. Just like this:

report = score_card.generate_report(scores, ["source_group", "target_group"])
print(report)
                              roc_auc  accuracy    pr_auc
source_group target_group                                
0            0            0  0.580047  0.575397  0.912304
             1            0  0.574348  0.576501  0.921771
             2            0  0.561425  0.574823  0.916881
             3            0  0.578944  0.567548  0.924734
             4            0  0.578380  0.575534  0.921595
             5            0  0.565587  0.553327  0.927879
1            1            0  0.634120  0.655391  0.891537
             2            0  0.623156  0.622572  0.892552
             3            0  0.625783  0.632247  0.891662
             4            0  0.637060  0.633891  0.893618
             5            0  0.592871  0.603951  0.887618
2            2            0  0.858091  0.854871  0.930708
             3            0  0.834505  0.834299  0.909151
             4            0  0.837995  0.838511  0.909978
             5            0  0.834062  0.832095  0.908878
3            3            0  0.835858  0.843683  0.920581
             4            0  0.822551  0.820680  0.901642
             5            0  0.817555  0.816832  0.903191
4            4            0  0.921859  0.921260  0.947434
             5            0  0.938959  0.938816  0.958132

Utility functions

A core idea of rexmex is the use of wrapper functions to help with recurring data manipulation. Our utility functions can be used to wrap the metrics when the predictions need to be transformed the y_score values are not binary. Because of this most evaluation metrics are not meaningful. However wrapping the classification metrics in the binarize function ensures that there is a binarization step. Let us take a look at this example snippet:

from rexmex.metrics.classification import accuracy_score
from rexmex.utils import binarize

new_accuracy_score = binarize(accuracy_score)
accuracy_value = new_accuracy_score(scores.y_true, scores.y_score)
print("{:.3f}".format(accuracy_value))
0.684