Unfair Ratings in Online Reputation Systems | Social and Economic Transformation in the Digital Era

This section looks at this problem of unfair online ratings in more detail. More specifically, our goal is to study a number of unfair rating scenarios and analyze their effects in compromising the reliability of an online reputation system.

The setting of this chapter is a large-scale B2C marketplace, such as eBay or eLance.com, where consumers transact with a large number of sellers. In a typical transaction t, a buyer b contracts with a seller s for the provision of a service. Upon conclusion of the transaction, b provides a numerical ratingR^s_b(t), reflecting some attribute Q of the service offered by s as perceived by b (ratings can only be submitted in conjunction with a transaction). For the sake of simplicity, I assume that R^a_b (t)is a scalar quantity, although in most transactions there are several quality attributes and R^a_b (t) would be a vector.

I further assume the existence of a ratings aggregation mechanism, whose goal is to store and process past ratings in order to calculate reliable personalized "reputation" estimates) for seller s upon request of a prospective buyer b. In settings where the attribute Q for which ratings are provided is subjectively measurable, there exist four scenarios where buyers and/or sellers can intentionally try to "rig the system," resulting in biased reputation estimates that deviate from a "fair" assessment of attribute Q for a given seller:

Unfair Ratings by Buyers

Unfairly high ratings ("ballot stuffing"): A seller colludes with a group of buyers in order to be given unfairly high ratings by them. This will have the effect of inflating a seller's reputation, therefore allowing that seller to receive more orders from buyers and at a higher price than she deserves.
Unfairly low ratings ("bad-mouthing"): Sellers can collude with buyers in order to "bad-mouth" other sellers that they want to drive out of the market. In such a situation, the conspiring buyers provide unfairly negative ratings to the targeted sellers, thus lowering their reputation.

Discriminatory Seller Behavior

Negative discrimination: Sellers provide good service to everyone except a few specific buyers that they "don't like." If the number of buyers being discriminated upon is relatively small, the cumulative reputation of sellers will be good and an externality will be created against the victimized buyers.
Positive discrimination: Sellers provide exceptionally good service to a few select individuals and average service to the rest. The effect of this is equivalent to ballot stuffing. That is, if the favored group is sufficiently large, their favorable ratings will inflate the reputation of discriminating sellers and will create an externality against the rest of the buyers.

The observable effect of all four above scenarios is that there will be a dispersion of ratings for a given seller. If the rated attribute is not objectively measurable, it will be very difficult or impossible to distinguish ratings dispersion due to genuine taste differences from that which is due to unfair ratings or discriminatory behavior.

In the following analysis, I assume the use of collaborative filtering techniques in order to address the issue of subjective ratings (Goldberg, Nichols, Oki, & Terry, 1992; Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994; Shardanand & Maes, 1995; Billsus & Pazzani, 1998). More specifically, I assume that, in order to estimate the personalized reputation of s from the perspective of b, some collaborative filtering technique is used to identify the nearest-neighbor set N of b. N includes buyers who have previously rated s and who are the nearest neighbors of b, based on the similarity of their ratings (for other commonly rated sellers) with those of b. Sometimes, this step will filter out all unfair buyers. Suppose, however, that the colluders have taken collaborative filtering into account and have cleverly picked buyers whose tastes are similar to those of b in everything else except their ratings of s. In that case, the resulting set N will include some fair raters and some unfair raters.

Effects When Seller Behavior is Steady Over Time

The simplest scenario to analyze is one where we can assume that seller behavior, and therefore the attribute Q that is being rated by buyers, remains steady over time. That means, collaborative filtering algorithms can take into account all ratings in their database, no matter how old.

To make our analysis more concrete, I will make the assumption that fair ratings can range between [R_min, R_max] and that they follow a distribution of the general form:

(1)

which in the rest of the chapter will be approximated by . The introduction of minimum and maximum rating bounds corresponds nicely with common practice. For example, Amazon.com allows buyers to rate products on a scale from 1 to 5. The assumption of normally distributed fair ratings requires more discussion. It is based on the previous assumption that those ratings belong to the nearest-neighbor set of a given buyer, and therefore represent a single taste cluster. Within a taste cluster, it is expected that fair ratings will be relatively closely clustered around some value and hence the assumption of normality.

In this chapter I focus on the reliable estimation of the unknown quality attribute Q. Suppose that the true value of Q is equal to . The goal of a reliable reputation system is the calculation of a fair mean reputation estimate (MRE) that is equal to or very close to the mean of the fair ratings distribution in the nearest-neighbor set (an unbiased estimator of ). Ideally, therefore:

(2)

The goal of unfair raters is to strategically introduce unfair ratings in order to maximize the distance between the actual MRE calculated by the reputation system and the fair MRE. More specifically the objective of a ballot-stuffing agent is to maximize the MRE while bad-mouthing agents aim to minimize it. Note that, in contrast to the case of fair ratings, it is not safe to make any assumptions about the form of the distribution of unfair ratings. Therefore, all analyses in the rest of this chapter will calculate system behavior under the most disruptive possible unfair ratings strategy.

I will only analyze the case of ballot stuffing; the case of bad mouthing is symmetric. Assume that the initial collaborative filtering step constructs a nearest-neighbor set N, in which the proportion of unfair raters is d and the proportion of fair raters is (1-δ). Furthermore, my analysis assumes that the actual MRE is taken to be the sample mean of the most recent rating given to s by each qualifying rater in N. This simple estimator is consistent with the practice of most current-generation commercial recommender systems (Schafer, Konstan, & Riedl, 2001). In that case, the actual MRE is approximately equal to:

(3)

where _u is the mean value of unfair ratings. The strategy that maximizes the above MRE is one where _u = R_max, i.e., where all unfair buyers give the maximum possible rating to the seller.

I define the mean reputation estimate bias for a contaminated set of ratings to be:

(4)

In the above scenario, the maximum MRE bias is given by:

(5)

Figure 2 tabulates some values of B_max for several different values m and d, in the special case where ratings range from [0,9]. For the purpose of comparing this baseline case with the "immunization mechanisms" described in this chapter, I have highlighted biases above 5% of the ratings range (i.e., biases greater than 0.5 points on ratings which range from 0-9). As can be seen, equation (5) can result in very significant inflation of a seller's MRE, especially for small and large δ.

Percentage of unfair ratings	Fair Mean Reputation Estimate (R_min = 0, R_max = 9)
Percentage of unfair ratings	0	2	4	6	8
	Reputation Bias
9%	*0.81*	*0.63*	0.45	0.27	0.09
18%	*1.62*	*1.26*	*0.90*	*0.54*	0.18
27%	*2.43*	*1.89*	*1.35*	*0.81*	0.27
36%	*3.24*	*2.52*	*1.80*	*1.08*	0.36
45%	4.05	3.15	2.25	1.35	0.45

Figure 2: Maximum MRE Bias when MREs are Based on the Mean of the Ratings Set; Highlighted Cells Indicate Biases Above 5% of the Ratings Range

Effects When Seller Behavior Varies Over Time

This section expands our analysis by discussing some additional considerations, which arise in environments where seller behavior may vary over time. I identify some additional unfair rating strategies that can be very disruptive in such environments.

In real-life trading communities, sellers may vary their service quality over time, improving it, deteriorating it, or even oscillating between phases of improvement and phases of deterioration. In his analysis of the economic effects of reputation, Shapiro (1981) proved that, in such environments, the most economically efficient way to estimate a seller's reputation (i.e., the way that induces the seller to produce at the highest quality level) is as a time-discounted average of recent ratings. Shapiro proved that efficiency is higher: (1) the higher the weight placed on recent quality ratings, and (2) the higher the discount factor of older ratings.

In this chapter I base my analysis on an approach that approximates Shapiro's desiderata, but is simpler to implement and analyze. The principal idea is to calculate time-varying personalized MREs) as averages of ratings submitted within the most recent time window W = [t-ε, t]. This is equivalent to using a time-discounted average calculation where weights are equal to 1 for ratings submitted within W and 0 otherwise. More specifically, in order to calculate a time varying personalized MRE) , we first use collaborative filtering in order to construct an initial nearest-neighbor set Ninitial. Following that we construct the active nearest-neighbor set Nactive, consisting only of those buyers u ∈ Ninitial who have submitted at least one rating for s within W. Finally, we base the calculation of . on ratings where u ∈ Nactive and t ∈ W.

According to equation (5), the maximum reputation bias due to unfair ratings is proportional to the ratio δ of unfair ratings that "make it" into the active nearest-neighbor set Nactive. Therefore, an obvious strategy for unfair buyers is to try to increase δ by "flooding" the system with unfair ratings. Zacharia, Moukas, and Maes (1999) touch upon this issue and propose keeping only the last rating given by a given buyer to a given seller as a solution. In environments where reputation estimates use all available ratings, this simple strategy ensures that eventually δ can never be more than the actual fraction of unfair raters in the community, usually a very small fraction. However, the strategy breaks down in environments where reputation estimates are based on ratings submitted within a relatively short time window (or where older ratings are heavily discounted). The following paragraph explains why.

Let us assume that the initial nearest-neighbor set N_initial contains m fair raters and n unfair raters. In most cases n would be much smaller than m. Assume further that the average inter-arrival time of fair ratings for a given seller is λ, and that personalized MREs are based only on ratings for s submitted by buyers u ∈ N_initial within the time window W = [t – kλ, t]. Based on the above assumptions, the average number of fair ratings submitted within W would be equal to k. To ensure accurate reputation estimates, the width of the time window W should be relatively small; therefore k should generally be a small number (say, between five and 20). For k much smaller than m, I can assume that every rating submitted within W is from a distinct fair rater. Assume now that unfair raters flood the system with ratings at a frequency much higher than the frequency of fair ratings. If the unfair ratings frequency is high enough, every one of the n unfair raters will have submitted at least one rating within the time window W. As suggested by Zacharia et al. (1999), I keep only the last rating sent by each rater. Even using that rule, however, the above scenario would result in an active nearest-neighbor set of raters where the fraction of unfair raters is δ = n/(n+k). This expression results in δ ≥ 0.5 for n ≥ k, independent of how small n is relative to m. For example, if n= 10 and k= 5, δ = 10/(10+5) = 0.67. We therefore see that, for relatively small time windows, even a small (e.g., five to 10) number of colluding buyers can successfully use unfair ratings flooding to dominate the set of ratings used to calculate MREs and completely bias the estimate provided by the system.

The results of this section indicate that even a relatively small number of unfair raters can significantly compromise the reliability of online reputation systems. This requires the development of effective measures for addressing the problem. The next section proposes and analyzes several such measures.