Use Fairness Metrics

Author:Nasim Sonboli, Robin Burke
Version:October 23, 2021

1. Introduction

Fairness can be defined in a number of ways for studying the output of recommender systems. We distinguish between consumer-side fairness (where consumers are the individuals getting recommendations from the system) and provider-side fairness (where providers are the individuals associated with the items being recommended). Consumer-side fairness asks whether the system is fair to different groups of users: for example, job seekers getting recommendations of job listings. Provider-side fairness asks whether the system is fair to different groups of item providers: for example, musical artists whose tracks are being recommended.

2. Feature files

All of the fairness metrics in librec-auto assume a numeric feature. For metrics implemented in Java, this value must be binary (0 or 1) with 1 corresponding to the protected value of the feature. For metrics implemented in Python, it is possible to specify protected values of a feature.

In order for a fairness metric to work, it needs to get the association between users (or items) and their features from a feature file.

All feature files have the same CSV format.

id,feature id,value

where id is a user id when user features are being stored and an item id when the file has the features of items.

If there are multiple features associated with a given id, there will be multiple lines, one for each feature. If the value for a feature is 0, it can be omitted.

Feature files are stored in the same directory as the data being used in your study.

3. Configuration

In order to use a fairness metric, you will need to specify the feature file and the protected feature(s) as well as the metric name.

The feature information is added in the top-level features section of the configuration file as in this example.

<features>
        <appender-class>net.librec.data.convertor.appender.ItemFeatureAppender</appender-class>
        <item-feature-file>item-features.csv</item-feature-file>
        <protected-feature name="fea:gdp" type="item" values="1 2 3 4">gdp-2021</protected-feature>
        <protected-feature name="fea:gender" type="item">borrower-gender</protected-feature>
</features>

For user features, the UserFeatureAppender class is used and the corresponding user-feature-file element.

Note that multiple possible protected features are identified and named. The names are used to refer to the features in other parts of the configuration. Note also that the GDP feature (fea:gdp) has associated values and the gender feature does not. The gender feature will therefore be treated as binary as discussed above. The error handling associated with feature files is not very robust. If your files don’t match the specification here, the system make break in difficult to predict ways. The fea prefix is used as a convention to distiguish between different types of names used in a configuration file (for example, alg is used to label algorithms in libraries). The name can be any legal XML attribute name.

The value inside of the element (for example, gdp-2021 for the feature named fea:gdp) is assumed to be the feature id found in the appropriate features file. Again, the error checking is not very robust and probably unexplained errors will crop up if there is some mismatched.

Note that the feature declaration system is different from prior versions of librec-auto. Because of this method, it is possible to have multiple protected features declared and to have different protected features for algorithms, metrics and re-rankers. Not all aspects of this capability are fully realized in the current release, hence the limitations (especially with metrics and algorithms integrated with LibRec and implemented in Java).

In writing a configuration using a fairness metric, the reference to the protected feature will occur differently depending on whether the metric is implemented in Java as part of LibRec or in Python as a script invoked by librec-auto. For Java-based metrics, an element with the following form should appear as part of the <metric> element:

<protected-feature ref="fea:gender"/>

The ref attribute tells the system to look up the feature declaration defined in the features element. Currently, only a single feature and feature type are available to be accessed across all Java-implemented metrics.

For Python-based metrics, the protected feature is a script parameter and specified similarly to other script parameter, but the ref syntax is still used. Example:

<param ref="fea:gdp"/>

when parsed this will turn into the following parameter passed to the script. The metric will have to read the feature data from the configuration file to get the rest of the information about the feature.

<param name="protected-feature">gdp-2021</param>

4. Available fairness metrics

The available metrics for fairness are listed below. Those marked “binary” require, as noted above, binary protected/unprotected group features in the feature file.

Provider-side

  • (Provider) Statistical Parity (`psp`). (binary) This metric computes the exposure (% of recommendation list) belonging to the protected and unprotected groups and reports the unprotected value minus the protected value. It ranges from +1 (the recommendation lists consist only of protected items) to -1 (the recommendation lists consist only of unprotected items).
  • Discounted Proportional (Provider) Fairness (`DPPF`). (binary) This metric measures how much utility a group of (item) provider gets out of the system compared to everyone else. The utility of a provider group is calcualted by summation of DCGs of all the items belonging to the provider group. Utility is calculated based on Normalized Discounted Cumulative Gain (nDCG) @topN. Larger (less negative) score is better (more fair). This metric is discussed by Kelly, F. P. et al. in “Rate control for communication networks: shadow prices, proportional fairness and stability.”, Journal of the Operational Research society in 1998.
  • P Percent Rule (`PPR`). (binary) This metric states that the ratio between the percentage of subjects having a certain sensitive attribute value assigned the positive decision outcome and the percentage of subjects not having that value also assigned the positive outcome should be no less than p%. The rule implies that each group has a positive probability of at least p% of the other group. The 100%-rule implies perfect removal of disparate impact on group-level fairness and a large value of p is preferred. The final result should be greater than or equal to “p%” to be considered fair. This is derived from the “80%-rule” supported by the U.S. Equal Employment Opportunity Commission. PPercentRuleEvaluator is based on the 80%-rule discussed by Dan Biddle in “Adverse Impact and Test Validation: A Practitioner’s Guide to Valid and Defensible Employment Testing” book, 2006. It is also based on the p% rule discussed by Zafar et al. in “Fairness Constraints: Mechanisms for Fair Classification”, AISTATS 2017.
\[\begin{split}\\ min(\frac{a}{b}, \frac{b}{a}) >= p/100 \\ where \ a = P[Y=1|s=1], \\ and \ b = P[Y=1|s=0]\end{split}\]
  • Diversity by feature (`FeatureDiversity`). This metric is calculated as the average dissimilarity of all pairs of items in the recommended list at a specific cutoff position. Although, in this extended version, the similarity matrix is computed based on the similarities in item features instead of ratings over an item. For more details please refer to “Avoiding monotony: improving the diversity of recommendation lists”
\[D = 1 - \frac{\sum_{i,j \in L}{s(i,j)}}{\frac{k(k-1)}{2}}\]
  • Gini Index (`GiniIndex`). (binary) This metric is a horizontal equity measure and it calculates the degree of inequality in a distribution. Fairness in this context means individuals with equal ability/needs should get equal resources. This is the measure of fair distribution of items in recommendation lists of all the users. The probability of an item is assumed to be the probability to be in a recommendation result list (Estimated by count of this item in all recommendation list divided by the count of recommendation lists). The ideal (maximum fairness) case is when this distribution is uniform. The Gini-index of uniform distribution is equal to zero and so smaller values of Gini-index are desired. For more details refer to “Recommender systems and their impact on sales diversity” by Fleder, D.M., Hosanagar, K in the Proceedings of the 8th ACM conference on Electronic commerce 2007.
  • Item Coverage (`ICOV`). This metric calculates the ratio of the unique items recommended to users to the total unique items in dataset (test & train).

Consumer-side

  • `` (Consumer) Statistical Parity (csp).`` (binary) This metric measures the statistical parity between the total precision of the protected group (p) and that of the unprotected group (u). This metrics measures the difference between the average precision of the protected and unprotected group.
\[f = (\sum_{n=1}^{|p|} {precision} / |p|) - (\sum_{m=1}^{|u|} {precision} / |u|)\]

Note

  • If the size of both groups is zero, it will return 0.
  • If the size of the unprotected group is zero, the average precision of protected is returned.
  • If the size of the protected group is zero, the average precision of unprotected is returned.
  • otherwise the above formula is computed.
  • Miscalibration (`CalibrationEvaluator` or `miscalib`). This method is based on calculating KullbackLeibler Divergence or KL-Divergence. Miscalibration measures the discrepancy between the distribution of the various (past) areas of interest of a user and that of her recommendation list. The higher this divergence is the more unbalanced user’s recommendation list is. This method was introduced by Harald Steck in “Calibrated recommendations.” in Proceedings of the 12th ACM conference on recommender systems. ACM, 2018.

Note

  • It is zero in case of perfect calibration.
  • It is very sensitive to small discrepancies between the two distributions.
  • It favors more uniform and less extreme distributions.
  • Discounted Proportional (Consumer) Fairness (`DPCF`). (binary) This metric measures how much utility a group of users gets out of the system compared to everyone else. The below formula computes the sum of the log of this quantity over all groups (discount). Utility is calculated based on Normalized Discounted Cumulative Gain (nDCG) @topN. Larger (less negative) score is better (more fair). This metric is discussed by Kelly, F. P. et al. in “Rate control for communication networks: shadow prices, proportional fairness and stability.”, Journal of the Operational Research society in 1998.
\[f = \sum_{g \in G}{log(\frac{u_g}{\sum_{g\prime \in G}{u_{g\prime}}})}\]
  • Value Unfairness (`VALUNFAIRNESS`). (binary) This unfairness occurs when one class of users is consistently given higher or lower predictions than their true preferences. Larger values shows that estimations for one class is consistently over-estimated and the estimations for the other class is consistently under-estimated.
\[ \begin{align}\begin{aligned}U_val = \frac{1}{n} \sum_{j=1}^{n}{\Big|(E_{g}[y]_j - E_{g}[r]_j) - (E_{\neg g}[y]_j - E_{\neg g}[r]_j)\Big|},\\where :math:`E_{g}[y]_j` is the average predicted score for the jth item from disadvantaged users, :math:`E_{\neg g}[y]_j` is the average predicted score for advantaged users, :math:`E_{g}[r]_j` and :math:`E_{\neg g}[r]_j` are the average ratings for the disadvantaged and advantaged users, respectively.\end{aligned}\end{align} \]

Note

Absolute Unfairness, Value Unfairness, Over-estimation Unfairness, Under-estimation Unfairness and non-parity Unfairness are proposed by Sirui Yao and Bert Huang in “Beyond Parity: Fairness Objective for Collaborative Filtering” , NeurIPS 2017.

  • Absolute Unfairness (`ABSUNFAIRNESS`). (binary) This metric measures the inconsistency in the absolute estimation error across the user types. Absolute unfairness is unsigned, so it captures a single statistic representing the quality of prediction for each user type. This measure doesn’t consider the direction of the error. If one user type has small reconstruction error and the other user type has large reconstruction error, one type of user has the unfair advantage of good recommendation, while the other user type has poor recommendation. One group might always get better recommendations than the other group.
\[U_abs = \frac{1}{n} \sum_{j=1}^{n}{\Big|\Big|E_{g}[y]_j - E_{g}[r]_j| - |E_{\neg g}[y]_j - E_{\neg g}[r]_j \Big|\Big|}\]
  • Under-estimation Unfairness (`UNDERESTIMATE`). (binary) This metric measures the inconsistency in how much the predictions underestimate the true ratings. Underestimation unfairness is important in settings where missing recommendations are more critical than extra recommendations.
\[U_{under} = \frac{1}{n} \sum_{j=1}^{n}{\Big|max\left\{0,E_{g}[r]_j - E_{g}[y]_j\right\} - max\left\{0,E_{\neg g}[r]_j - E_{\neg g}[y]_j\right\}\Big|}\]
  • Over-estimation Unfairness (`OVERESTIMATE`).. (binary) This metric measures the inconsistency in how much the predictions overestimate the true ratings. Overestimation unfairness may be important in settings where users may be overwhelmed by recommendations, so providing too many recommendations would be especially detrimental. For example, if users must invest llarge amounts of time to evaluate each recommended item, overestimating essentially costs the user time. Thus, uneven amounts of overestimation could cost one type of user more time than the other.
\[U_{over} = \frac{1}{n} \sum_{j=1}^{n}{\Big|max\left\{0,E_{g}[y]_j - E_{g}[r]_j\right\} - max\left\{0,E_{\neg g}[y]_j - E_{\neg g}[r]_j\right\}\Big|}\]
  • Non-parity Unfairness (`NONPAR`).. (binary) This metric is based on the regularization term introduced by Kamishima et al. [17] can be computed as the absolute difference between the overall average ratings of protected users and those of unprotected users:
\[U_par = \left\Big| E_{g}[y] - E_{\neg g}[y] \right\Big|\]