RexMex Documentation

rexmex is recommender system evaluation metric library. It consists of utilities for recommender system evaluation. First, it provides a comprehensive collection of metrics for the evaluation of recommender systems. Second, it includes a variety of classes and methods for reporting and plotting the performance results. Implemented metrics cover a range of well-known metrics and newly proposed metrics from data mining conferences and prominent journals.

Rexmex

Score Cards

class CoverageScoreCard(metric_set: Mapping[str, Callable[[numpy.array, numpy.array], float]], all_users: Collection[str], all_items: Collection[str])[source]

Coverage scorecard can be used to aggregate coverage-related metrics, plot those, and generate performance reports.

generate_report(recs_to_evaluate: pandas.core.frame.DataFrame, grouping: Optional[List[str]] = None)pandas.core.frame.DataFrame[source]

A method to calculate (aggregated) coverage/performance metrics based on a dataframe of predictions. It assumes that the dataframe has the user and item keys in the dataframe.

Parameters
  • recs_to_evaluate (pd.DataFrame) – A dataframe holding the recommendations (users, items). Contains

  • user and item. (columns) –

  • grouping (list) – A list of performance grouping variable names (e.g., different recommender settings).

Returns

The performance report.

Return type

report (pd.DataFrame)

get_coverage_metrics(recommendations: List[Tuple])pandas.core.frame.DataFrame[source]

Gets all coverage (performance) values using the defined metric_set. It expects a list of tuples of user/item combinations, e.g., [(user_1, item_1), (user_2, item1),]. The space of possible users and items to recommend is defined during initalisation of this class.

Parameters
  • List[Tuple] (recommendations) – recommendations of items to users, made by the evaluated system.

  • user has to decide which score or confidence levels to use prior to calling this ScoreCard. (The) –

Returns

The coverage (performance) metrics calculated from the recommendations.

Return type

performance_metrics (pd.DataFrame)

metric_set: Mapping[str, Callable[[numpy.array, numpy.array], float]]
class ScoreCard(metric_set: Mapping[str, Callable[[numpy.array, numpy.array], float]])[source]

A scorecard can be used to aggregate metrics, plot those, and generate performance reports.

filter_scores(scores: pandas.core.frame.DataFrame, training_set: pandas.core.frame.DataFrame, testing_set: pandas.core.frame.DataFrame, validation_set: pandas.core.frame.DataFrame, columns: List[str])pandas.core.frame.DataFrame[source]

A method to filter out those entries which also appear in either the training, testing or validation sets. The original is here: <https://papers.nips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf>. :param scores: A dataframe with the scores. :type scores: pd.DataFrame :param training_set: A dataframe of training data points. :type training_set: pd.DataFrame :param testing_set: A dataframe of testing data points. :type testing_set: pd.DataFrame :param validation_set: A dataframe of validation data points. :type validation_set: pd.DataFrame :param columns: A list of column names used for cross-referencing. :type columns: list

Returns

The scores for data points which are not in the reference sets.

Return type

scores (pd.DataFrame)

generate_report(scores_to_evaluate: pandas.core.frame.DataFrame, grouping: Optional[List[str]] = None)pandas.core.frame.DataFrame[source]

A method to calculate (aggregated) performance metrics based on a dataframe of ground truth and predictions. It assumes that the dataframe has the y_true and y_score keys in the dataframe.

Parameters
  • scores_to_evaluate (pd.DataFrame) – A dataframe with the scores and ground-truth - it has the y_true

  • y_score keys. (and) –

  • grouping (list) – A list of performance grouping variable names.

Returns

The performance report.

Return type

report (pd.DataFrame)

get_performance_metrics(y_true: numpy.array, y_score: numpy.array)pandas.core.frame.DataFrame[source]

A method to get the performance metrics for a pair of vectors.

Parameters
  • y_true (np.array) – A vector of ground truth values.

  • y_score (np.array) – A vector of model predictions.

Returns

The performance metrics calculated from the vectors.

Return type

performance_metrics (pd.DataFrame)

metric_set: Mapping[str, Callable[[numpy.array, numpy.array], float]]
print_metrics()[source]

Printing the name of metrics.

Metric Sets

class ClassificationMetricSet[source]

A set of classification metrics with the following metrics included:

Area Under the Receiver Operating Characteristic Curve
Area Under the Precision Recall Curve
Average Precision
F-1 Score
Matthew’s Correlation Coefficient
Fowlkes-Mallows Index
Precision
Recall
Specificity
Accuracy
Balanced Accuracy
class CoverageMetricSet[source]

A set of coverage metrics with the following metrics included: | Item Coverage | User Coverage

class MetricSet[source]
A metric set is a special dictionary that contains metric

name keys and evaluation metric function values.

add_metrics(metrics: List[Tuple])[source]

A method to add metric functions from a list of function names and functions.

Parameters

metrics (List[Tuple]) – A list of metric name and metric function tuples.

Returns

The metric set after the metrics were added.

Return type

self

filter_metrics(filter: Collection[str])[source]

A method to keep a list of metrics.

Parameters

filter – A list of metric names to keep.

Returns

The metric set after the metrics were filtered out.

Return type

self

print_metrics()[source]

Printing the name of metrics.

class RankingMetricSet[source]

A set of ranking metrics with the following metrics included:

class RatingMetricSet[source]

A set of rating metrics with the following metrics included:

Mean Absolute Error
Mean Squared Error
Root Mean Squared Error
Mean Absolute Percentage Error
Symmetric Mean Absolute Percentage Error
Coefficient of Determination
Pearson Correlation Coefficient
normalize_metrics()[source]

A method to normalize a set of metrics.

Returns

The metric set after the metrics were normalized.

Return type

self

Ranking Metrics

average_precision_at_k(relevant_items: numpy.array, recommendation: numpy.array, k=10)[source]

Calculate the average precision at k (AP@K) of items in a ranked list.

Parameters
  • relevant_items (array-like) – An N x 1 array of relevant items.

  • recommendation (array-like) – An N x 1 array of ordered items.

  • k (int) – the number of items considered in the predicted list.

Returns

The average precision @ k of a predicted list.

Return type

AP@K (float)

Original

average_recall_at_k(relevant_items: List, recommendation: List, k: int = 10)[source]

Calculate the average recall at k (AR@K) of items in a ranked list.

Parameters
  • relevant_items (array-like) – An N x 1 array of relevant items.

  • recommendation (array-like) – An N x 1 array of items.

  • k (int) – the number of items considered in the predicted list.

Returns

The average precision @ k of a predicted list.

Return type

AR@K (float)

discounted_cumulative_gain(y_true: numpy.array, y_score: numpy.array)[source]

Computes the Discounted Cumulative Gain (DCG), a sum of the true scores ordered by the predicted scores, and then penalized by a logarithmic discount based on ordering.

Parameters
  • y_true (array-like) – An N x M array of ground truth values, where M > 1 for multilabel classification problems.

  • y_score (array-like) – An N x M array of predicted values, where M > 1 for multilabel classification problems..

Returns

Discounted Cumulative Gain

Return type

DCG (float)

gmean_rank(relevant_items: Sequence[rexmex.metrics.ranking.X], recommendation: Sequence[rexmex.metrics.ranking.X])float[source]

Calculate the geometric mean rank (GMR) of items in a ranked list.

Parameters
  • relevant_items – An N x 1 sequence of relevant items.

  • recommendation – An N x 1 sequence of ordered items.

Returns

The mean reciprocal rank of the relevant items in a predicted.

hits_at_k(relevant_items: numpy.array, recommendation: numpy.array, k=10)[source]

Calculate the number of hits of relevant items in a ranked list HITS@K.

Parameters
  • relevant_items (array-like) – An 1 x N array of relevant items.

  • recommendation (array-like) – An 1 x N array of predicted arrays

  • k (int) – the number of items considered in the predicted list

Returns

The number of relevant items in the first k items of a prediction.

Return type

HITS@K (float)

intra_list_similarity(recommendations: List[list], items_feature_matrix: numpy.array)[source]

Calculate the intra list similarity of recommended items. The items are represented by feature vectors, which compared with cosine similarity. The predicted consists of item indices, which are used to fetch the item features.

Parameters
  • recommendations (List[list]) – A M x N array of predicted, where M is the number of predicted and N the number of recommended items

  • items_feature_matrix (matrix-link) – A N x D matrix, where N is the number of items and D the number of features representing one item

Returns

Average intra list similarity across predicted

Return type

(float)

Original

kendall_tau(relevant_items: numpy.array, recommendation: numpy.array)[source]

Calculate the Kendall’s tau, measuring the correspondence between two lists.

Parameters
  • relevant_items (array-like) – An 1 x N array of items.

  • recommendation (array-like) – An 1 x N array of items.

Returns

The tau statistic. p-value (float): two-sided p-value for null hypothesis that there’s no association between the predicted.

Return type

Kendall tau (float)

mean_average_precision_at_k(relevant_items: List[list], recommendations: List[list], k: int = 10)[source]

Calculate the mean average precision at k (MAP@K) across predicted lists. Each prediction should be paired with a list of relevant items. First predicted list is evaluated against the first list of relevant items, and so on.

Example usage:

import numpy as np
from rexmex.metrics.predicted import mean_average_precision_at_k

mean_average_precision_at_k(
    relevant_items=np.array(
        [
            [1,2],
            [2,3]
        ]
    ),
    predicted=np.array([
        [3,2,1],
        [2,1,3]
    ])
)
>>> 0.708333...
Parameters
  • relevant_items (array-like) – An M x N array of relevant items.

  • recommendations (array-like) – An M x N array of recommendation lists.

  • k (int) – the number of items considered in the predicted list.

Returns

The mean average precision @ k across recommendations.

Return type

MAP@K (float)

mean_average_recall_at_k(relevant_items: List[list], recommendations: List[list], k: int = 10)[source]

Calculate the mean average recall at k (MAR@K) for a list of recommendations. Each recommendation should be paired with a list of relevant items. First recommendation list is evaluated against the first list of relevant items, and so on.

Parameters
  • relevant_items (array-like) – An M x R list where M is the number of recommendation lists, and R is the number of relevant items.

  • recommendations (array-like) – An M x N list where M is the number of recommendation lists and N is the number of recommended items.

  • k (int) – the number of items considered in the recommendation.

Returns

The mean average recall @ k across the recommendations.

Return type

MAR@K (float)

mean_rank(relevant_items: Sequence[rexmex.metrics.ranking.X], recommendation: Sequence[rexmex.metrics.ranking.X])float[source]

Calculate the arithmetic mean rank (MR) of items in a ranked list.

Parameters
  • relevant_items – An N x 1 sequence of relevant items.

  • recommendation – An N x 1 sequence of ordered items.

Returns

The mean rank of the relevant items in a predicted.

mean_reciprocal_rank(relevant_items: List, recommendation: List)[source]

Calculate the mean reciprocal rank (MRR) of items in a ranked list.

Parameters
  • relevant_items (array-like) – An N x 1 array of relevant items.

  • recommendation (array-like) – An N x 1 array of ordered items.

Returns

The mean reciprocal rank of the relevant items in a predicted.

Return type

MRR (float)

normalized_discounted_cumulative_gain(y_true: numpy.array, y_score: numpy.array)[source]

Computes the Normalized Discounted Cumulative Gain (NDCG), a sum of the true scores ordered by the predicted scores, and then penalized by a logarithmic discount based on ordering. The score is normalized between [0.0, 1.0]

Parameters
  • y_true (array-like) – An N x M array of ground truth values, where M > 1 for multilabel classification problems.

  • y_score (array-like) – An N x M array of predicted values, where M > 1 for multilabel classification problems..

Returns

Normalized Discounted Cumulative Gain

Return type

NDCG (float)

normalized_distance_based_performance_measure(relevant_items: List, recommendation: List)[source]

Calculates the Normalized Distance-based Performance Measure (NPDM) between two ordered lists. Two matching orderings return 0.0 while two unmatched orderings returns 1.0.

Parameters
  • relevant_items (List) – List of items

  • recommendation (List) – The predicted list of items

Returns

Normalized Distance-based Performance Measure

Return type

NDPM (float)

Metric Definition: Yao, Y. Y. “Measuring retrieval effectiveness based on user preference of documents.” Journal of the American Society for Information science 46.2 (1995): 133-145.

Definition from: Shani, Guy, and Asela Gunawardana. “Evaluating recommendation systems.” Recommender systems handbook. Springer, Boston, MA, 2011. 257-297

novelty(recommendations: List[list], item_popularities: dict, num_users: int, k: int = 10)[source]

Calculates the capacity of the recommender system to to generate novel and unexpected results.

Parameters
  • recommendations (List[list]) – A M x N array of items, where M is the number of predicted lists and N the number of recommended items

  • item_popularities (dict) – A dict mapping each item in the recommendations to a popularity value. Popular items have higher values.

  • num_users (int) – The number of users

  • k (int) – The number of items considered in each recommendation.

Returns

novelty

Return type

(float)

Metric Definition: Zhou, T., Kuscsik, Z., Liu, J. G., Medo, M., Wakeling, J. R., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107(10), 4511-4515.

Original

personalization(recommendations: List[list])[source]

Calculates personalization, a measure of similarity between recommendations. A high value indicates that the recommendations are disimillar, or “personalized”.

Parameters

recommendations (List[list]) – A M x N array of predicted items, where M is the number of predicted lists and N the number of items

Returns

personalization

Return type

(float)

Original

rank(relevant_item: rexmex.metrics.ranking.X, recommendation: Sequence[rexmex.metrics.ranking.X])float[source]

Calculate the rank of an item in a ranked list of items.

Parameters
  • relevant_item – a target item in the predicted list of items.

  • recommendation – An N x 1 sequence of predicted items.

Returns

The rank of the item.

reciprocal_rank(relevant_item: rexmex.metrics.ranking.X, recommendation: Sequence[rexmex.metrics.ranking.X])float[source]

Calculate the reciprocal rank (RR) of an item in a ranked list of items.

Parameters
  • relevant_item – a target item in the predicted list of items.

  • recommendation – An N x 1 sequence of predicted items.

Returns

The reciprocal rank of the item.

Return type

RR (float)

spearmans_rho(relevant_items: numpy.array, recommendation: numpy.array)[source]

Calculate the Spearman’s rank correlation coefficient (Spearman’s rho) between two lists.

Parameters
  • relevant_items (array-like) – An 1 x N array of items.

  • recommendation (array-like) – An 1 x N array of items.

Returns

Spearman’s rho. p-value (float): two-sided p-value for null hypothesis that both predicted are uncorrelated.

Return type

(float)

Classification Metrics

accuracy_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the accuracy score for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of .

Return type

(float)

average_precision_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of average precision.

Return type

average_precision (float)

balanced_accuracy_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the balanced accuracy for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the balanced accuracy score.

Return type

balanced_accuracy (float)

condition_negative(y_true: numpy.array)float[source]

Calculate the number of instances which are negative.

Parameters

y_true (array-like) – An N x 1 array of ground truth values.

Returns

The number of negative instances.

Return type

cn (float)

condition_positive(y_true: numpy.array)float[source]

Calculate the number of instances which are positive.

Parameters

y_true (array-like) – An N x 1 array of ground truth values.

Returns

The number of positive instances.

Return type

cp (float)

critical_success_index(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the critical success index (duplicate of threat_score()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The critical success index value.

Return type

ts (float)

diagnostic_odds_ratio(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the diagnostic odds ratio.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The diagnostic odds ratio value.

Return type

dor (float)

f1_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the F-1 score for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the F-1 score.

Return type

f1 (float)

fall_out(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the fall out (duplicate of false_positive_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The fall out value.

Return type

fpr (float)

false_discovery_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the false discovery rate.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The false discovery rate value.

Return type

fdr (float)

false_negative(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the number of false negatives.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The number of false negatives.

Return type

fn (float)

false_negative_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the false negative rate (duplicated in miss_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The false negative rate value.

Return type

fnr (float)

false_omission_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the false omission rate.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The false omission rate value.

Return type

fomr (float)

false_positive(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the number of false positives.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The number of false positives.

Return type

fp (float)

false_positive_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the false positive rate (duplicated in false_positive_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The false positive rate value.

Return type

fpr (float)

fowlkes_mallows_index(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the Fowlkes-Mallows index.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The the Fowlkes-Mallows index value.

Return type

fm (float)

hit_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the hit rate (duplicate of true_positive_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The hit rate.

Return type

tpr (float)

informedness(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the informedness.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The informedness value.

Return type

bm (float)

markedness(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the markedness.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The markedness value.

Return type

mk (float)

matthews_correlation_coefficient(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate Matthew’s correlation coefficient for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of Matthew’s correlation coefficient.

Return type

mat_cor (float)

miss_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the miss rate (duplicate of false_negative_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The miss rate value.

Return type

fnr (float)

negative_likelihood_ratio(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the negative likelihood ratio.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The negative likelihood ratio value.

Return type

lr_minus (float)

negative_predictive_value(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the negative predictive value (duplicted in precision_score()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The negative predictive value.

Return type

npv (float)

positive_likelihood_ratio(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the positive likelihood ratio.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The positive likelihood ratio value.

Return type

(float)

positive_predictive_value(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the positive predictive value (duplicated in precision_score()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The positive predictive value.

Return type

ppv (float)

pr_auc_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the precision recall area under the curve (PR AUC) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the precision-recall area under the curve.

Return type

pr_auc (float)

precision_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the precision for a ground-truth prediction vector pair.

Duplicate of positive_predictive_value(), but with an alternate implementation using sklearn.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of precision.

Return type

precision (float)

prevalence_threshold(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the prevalence threshold score.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The prevalence threshold value.

Return type

pthr (float)

recall_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the recall for a ground-truth prediction vector pair.

Duplicate of true_positive_rate(), but with alternate implementation from sklearn.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of recall.

Return type

recall (float)

Note

It’s surprising that the sklearn implementation of TPR needs to be binarized but the rexmex implementation does not

roc_auc_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the AUC for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the area under the curve.

Return type

auc (float)

selectivity(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the selectivity (duplicate of true_negative_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The selectivity score.

Return type

tnr (float)

sensitivity(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the sensitivity (duplicate of true_positive_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The sensitivity score.

Return type

tpr (float)

specificity(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the specificity (duplicate of true_negative_rate()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The specificity score.

Return type

tnr (float)

threat_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the threat score.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The threat score value.

Return type

ts (float)

true_negative(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the number of true negatives.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The number of true negatives.

Return type

tn (float)

true_negative_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the true negative rate (duplicated in specificity() and selectivity()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The true negative rate.

Return type

tnr (float)

true_positive(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the number of true positives.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The number of true positives.

Return type

tp (float)

true_positive_rate(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the true positive rate (duplicated in hit_rate(), sensitivity(), and recall_score()).

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The true positive rate.

Return type

tpr (float)

Coverage Metrics

item_coverage(possible_users_items: Tuple[List[Union[int, str]], List[Union[int, str]]], recommendations: List[Tuple[Union[int, str], Union[int, str]]])float[source]

Calculates the coverage value for items in possible_users_items[1] given the collection of recommendations. Recommendations over users/items not in possible_users_items are discarded.

Parameters
  • possible_users_items (Tuple[List[Union[int, str]], List[Union[int, str]]]) – contains exactly TWO sub-lists,

  • one with users (first) –

  • with items (second) –

  • recommendations (List[Tuple[Union[int, str], Union[int, str]]]) – contains user-item recommendation tuples,

  • [ (e.g.) –

Returns: item coverage (float): a metric showing the fraction of items which got recommended at least once.

user_coverage(possible_users_items: Tuple[List[Union[int, str]], List[Union[int, str]]], recommendations: List[Tuple[Union[int, str], Union[int, str]]])float[source]

Calculates the coverage value for users in possible_users_items[0] given the collection of recommendations. Recommendations over users/items not in possible_users_items are discarded.

Parameters
  • possible_users_items (Tuple[List[Union[int, str]], List[Union[int, str]]]) – contains exactly TWO sub-lists,

  • one with users (first) –

  • with items (second) –

  • recommendations (List[Tuple[Union[int, str], Union[int, str]]]) – contains user-item recommendation tuples,

  • [ (e.g.) –

Returns: user coverage (float): a metric showing the fraction of users who got at least one recommendation out of all possible users.

Rating Metrics

mean_absolute_error(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the mean absolute error (MAE) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The mean absolute error value.

Return type

mae (float)

mean_absolute_percentage_error(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the mean absolute percentage error (MAPE) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The mean absolute percentage error value.

Return type

mape (float)

mean_squared_error(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the mean squared error (MSE) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The mean squared error value.

Return type

mse (float)

pearson_correlation_coefficient(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the Pearson correlation coefficient for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the correlation coefficient.

Return type

rho (float)

r2_score(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the coefficient of determination (R^2) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The coefficient of determination value.

Return type

r2 (float)

root_mean_squared_error(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the root mean squared error (RMSE) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the root mean squared error.

Return type

rmse (float)

symmetric_mean_absolute_percentage_error(y_true: numpy.array, y_score: numpy.array)float[source]

Calculate the symmetric mean absolute percentage error (SMAPE) for a ground-truth prediction vector pair.

Parameters
  • y_true (array-like) – An N x 1 array of ground truth values.

  • y_score (array-like) – An N x 1 array of predicted values.

Returns

The value of the symmetric mean absolute percentage error.

Return type

smape (float)

Utility

class Annotator[source]

A class to wrap annotations that generates a registry.

annotate(*, lower: float, upper: float, higher_is_better: bool, link: str, description: str, name: Optional[str] = None, lower_inclusive: bool = True, upper_inclusive: bool = True, binarize: bool = False, duplicate_of: Optional[Callable[[numpy.array, numpy.array], float]] = None)[source]

Annotate a function.

duplicate(other, *, name: Optional[str] = None, binarize: Optional[bool] = None)[source]

Annotate a function as a duplicate.

Metric

A function that can be called on y_true, y_score and return a floating point result

alias of Callable[[numpy.array, numpy.array], float]

binarize(metric)[source]

Binarize the predictions for a ground-truth - prediction vector pair.

Parameters

metric (function) – The metric function which needs a binarization pre-processing step.

Returns

The function which wraps the metric and binarizes the probability scores.

Return type

metric_wrapper (function)

normalize(metric)[source]

Normalize the predictions for a ground-truth - prediction vector pair.

Parameters

metric (function) – The metric function which needs a normalization pre-processing step.

Returns

The function which wraps the metric and normalizes predictions.

Return type

metric_wrapper (function)

Synthetic Datasets

class DatasetReader[source]

Class to read synthetic test datasets.

read_dataset(dataset: str = 'erdos_renyi_example')[source]

Method to read the dataset. :param dataset: Dataset of interest, one of:

("erdos_renyi_example"). Default is ‘erdos_renyi_example’.

Returns

The example dataset for testing the library.

Return type

data (pd.DataFrame)

Installation

Install the package by using pip:

$ pip install rexmex

Upgrade your outdated RexMex version by using:

$ pip install rexmex --upgrade

To check your current package version just simply run:

$ pip freeze | grep rexmex

Introduction by example

rexmex is recommender system evaluation metric library. It consists of utilities for recommender system evaluation. First, it provides a comprehensive collection of metrics for the evaluation of recommender systems. Second, it includes a variety of classes and methods for reporting and plotting the performance results. Implemented metrics cover a range of well-known metrics and newly proposed metrics from data mining conferences and prominent journals.

Overview


We shortly overview the fundamental concepts and features of rexmex through simple examples. These are the following:

Design philosophy

rexmex is designed with the assumption that end users might want to use the evaluation metrics and utility functions without using the metric sets and score cards. Because of this, the evaluation metrics and utility functions (e.g. binarisation and normalisation) can be used independently from the rexmex library.

Synthetic toy datasets

rexmex is designed with the assumption that the predictions and the ground truth are stored in a pandas DataFrame. In our example we assume that this DataFrame has at least two columns y_true and y_score. The first one contains the ground truth labels/ratings while the second one contains the predictions. Each row in the DataFrame is a single user - item like pairing of a source and target with ground truth and predictions. Additional columns represent groupings of the model predictions. Our library provides synthetic data which can be used for testing the library. The following lines import a dataset and print the head of the table.

from rexmex.dataset import DatasetReader

reader = DatasetReader()
scores = reader.read_dataset()

print(scores.head())
   source_id  target_id  source_group  target_group  y_true   y_score
0          0        322             0             0       1  0.244729
1          0        435             0             0       1  0.245294
2          0       2839             0             1       1  0.182597
3          0       3348             0             2       1  0.798757
4          0       5805             0             3       1  0.569672

Let us overview the structure of the scores DataFrame used in our example before we look at the core functionalities of the library. First of all we observe that: it is unindexed, has 6 columns and each row is a prediction. The first two columns source_id and target_id correspond to the user and item identifiers. The next two columns source_group and target_group help with the calculation of group performance metrics. Finally, y_true is a vector of ground truth values and y_score represents predicted probabilities.

Evaluation metrics

The generic design rexmex involves classification metrics that exist on the appropriate namespace. For example the pr_auc_score is on the rexmex.metrics.classification namespace, because it is a classification metric. Functions that are on the same name space have the same signature. This specific function takes a target and prediction vector (we use the toy dataset) and outputs the precision recall area under the curve value as a float.

from rexmex.metrics.classification import pr_auc_score

pr_auc_value = pr_auc_score(scores["y_true"], scores["y_score"])
print("{:.3f}".format(pr_auc_value))
0.919

Metric sets

A MetricSet() is a base class which inherits from dict and contains the name of the evaluation metrics and the evaluation metric functions as keys. Each of these functions should have the same signature. There are specialised MetricSet() variants which inherit from the base class such as the ClassificationMetricSet(). The following example prints the classification metrics stored in this metric set.

from rexmex.metricset import ClassificationMetricSet

metric_set = ClassificationMetricSet()
metric_set.print_metrics()
{'false_positive_rate', 'selectivity', 'true_positive_rate', 'critical_success_index', 'false_omission_rate', 'prevalence_threshold', 'positive_likelihood_ratio', 'precision', 'accuracy', 'markedness', 'recall', 'diagnostic_odds_ratio', 'informedness', 'balanced_accuracy', 'pr_auc', 'true_negative_rate', 'negative_likelihood_ratio', 'threat', 'false_discovery_rate', 'f1', 'matthews_correlation_coefficient', 'false_negative_rate', 'miss_rate', 'fall_out', 'fowlkes_mallows_index', 'hit_rate', 'specificity', 'positive_predictive_value', 'roc_auc', 'average_precision', 'sensitivity', 'negative_predictive_value'}

Metric sets also allow the filtering of metrics which are interesting for a specific application. In our case we will only keep 3 of the metrics: roc_auc, pr_auc and accuracy.

metric_set.filter_metrics(["roc_auc", "pr_auc", "accuracy"])
metric_set.print_metrics()
{'pr_auc', 'roc_auc', 'accuracy'}

Score cards

Score cards allow the calculation of performance metrics for a whole metric set with ease. Let us create a scorecard and reuse the filtered metrics with the scorecard. We will calculate the performance metrics for the toy example. The ScoreCard() constructor uses the metric_set instance and the generate_report method uses the scores from earlier. The result is a DataFrame of the scores.

from rexmex.scorecard import ScoreCard

score_card = ScoreCard(metric_set)
report = score_card.generate_report(scores)
print(report)
    roc_auc  accuracy    pr_auc
0  0.792799  0.684188  0.900368

The score cards allow the advanced reporting of the performance metrics. We could also group on the source_group and target_group keys and get specific subgroup performances. Just like this:

report = score_card.generate_report(scores, ["source_group", "target_group"])
print(report)
                              roc_auc  accuracy    pr_auc
source_group target_group                                
0            0            0  0.580047  0.575397  0.912304
             1            0  0.574348  0.576501  0.921771
             2            0  0.561425  0.574823  0.916881
             3            0  0.578944  0.567548  0.924734
             4            0  0.578380  0.575534  0.921595
             5            0  0.565587  0.553327  0.927879
1            1            0  0.634120  0.655391  0.891537
             2            0  0.623156  0.622572  0.892552
             3            0  0.625783  0.632247  0.891662
             4            0  0.637060  0.633891  0.893618
             5            0  0.592871  0.603951  0.887618
2            2            0  0.858091  0.854871  0.930708
             3            0  0.834505  0.834299  0.909151
             4            0  0.837995  0.838511  0.909978
             5            0  0.834062  0.832095  0.908878
3            3            0  0.835858  0.843683  0.920581
             4            0  0.822551  0.820680  0.901642
             5            0  0.817555  0.816832  0.903191
4            4            0  0.921859  0.921260  0.947434
             5            0  0.938959  0.938816  0.958132

Utility functions

A core idea of rexmex is the use of wrapper functions to help with recurring data manipulation. Our utility functions can be used to wrap the metrics when the predictions need to be transformed the y_score values are not binary. Because of this most evaluation metrics are not meaningful. However wrapping the classification metrics in the binarize function ensures that there is a binarization step. Let us take a look at this example snippet:

from rexmex.metrics.classification import accuracy_score
from rexmex.utils import binarize

new_accuracy_score = binarize(accuracy_score)
accuracy_value = new_accuracy_score(scores.y_true, scores.y_score)
print("{:.3f}".format(accuracy_value))
0.684

External resources

Evaluation Strategies

  • Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko: Translating Embeddings for Modeling Multi-relational Data Paper

Metrics

  • Tao Zhou, Zoltan Kuscsik, Jian-Guo Liu, Matus Medo, Joseph R. Wakeling, Yi-Cheng Zhang: Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems Paper

  • Yiyu Yao: Measuring Retrieval Effectiveness Based on User Preference of Documents Paper

  • Shani, Guy, and Asela Gunawardana: Evaluating Recommendation Systems Paper

Survey Papers

  • Gunnar Schröder, Maik Thiele, Wolfgang Lehner: Setting Goals and Choosing Metrics for Recommender System Evaluations Paper

  • Iman Avazpour, Teerat Pitakrat, Lars Grunske, John Grundy: Dimensions and Metrics for Evaluating Recommendation Systems Paper