Original PDF Flash format improved-neighborhood-based-collaborative-filtering  


Improved Neighborhood Based Collaborative Filtering

Improved Neighborhood-based Collaborative Filtering
Robert M. Bell and Yehuda Koren
AT&T Labs – Research
180 Park Ave, Florham Park, NJ 07932
{rbell,yehuda}@research.att.com
ABSTRACT
1.
INTRODUCTION
Recommender systems based on collaborative filtering predict user
Recommender systems analyze patterns of user interest in items
preferences for products or services by learning past user-item re-
or products to provide personalized recommendations for items that
lationships. A predominant approach to collaborative filtering is
will suit a user’s taste. In essence, recommender systems attempt
neighborhood based (“k-nearest neighbors"), where a user-item pref-
to profile user preferences and model the interaction between users
erence rating is interpolated from ratings of similar items and/or
and products. Increasingly, their excellent ability to characterize
users. In this work, we enhance the neighborhood-based approach
and recommend items within huge collections represent a comput-
leading to a substantial improvement of prediction accuracy, with-
erized alternative to human recommendations. Because good per-
out a meaningful increase in running time. First, we remove certain
sonalized recommendations can add another dimension to the user
so-called “global effects" from the data to make the different ratings
experience, some of the largest e-commerce web sites have invested
more comparable, thereby improving interpolation accuracy. Sec-
in high-quality recommender systems. Two known examples are
ond, we show how to simultaneously derive interpolation weights
the web merchant Amazon.com and the online movie rental com-
for all nearest neighbors. Unlike previous approaches where each
pany Netflix, which make the recommender system a salient part of
interpolation weight is computed separately, simultaneous interpo-
their web sites. For a recent survey on recommender systems, refer
lation accounts for the many interactions between neighbors by
to [1].
globally solving a suitable optimization problem, also leading to
Broadly speaking, recommender systems use either of two strate-
improved accuracy. Our method is very fast in practice, generat-
gies. The content based approach profiles each user or product al-
ing a prediction in about 0.2 milliseconds. Importantly, it does not
lowing programs to associate users with matching products. For
require training many parameters or a lengthy preprocessing, mak-
example, a movie profile might describe its genre, the participating
ing it very practical for large scale applications. The method was
actors, its box office popularity, etc. User profiles could include de-
evaluated on the Netflix dataset. We could process the 2.8 million
mographic information or answers to a suitable questionnaire. Of
queries of the Qualifying set in 10 minutes yielding a RMSE of
course, content based strategies require gathering external informa-
0.9086. Moreover, when an extensive training is allowed, such as
tion that might not be available or easy to collect.
SVD-factorization at the preprocessing stage, our method can pro-
We focus on an alternative strategy that relies only on past user
duce results with a RMSE of 0.8982.
behavior–e.g., their previous transactions or product ratings–and
does not require the creation of explicit profiles. This approach is
known as Collaborative Filtering (CF), a term coined by the de-
Categories and Subject Descriptors
velopers of the first recommender system - Tapestry [6]. CF an-
H.2.8 [Database Management]: Database Applications—Data Min-
alyzes relationships between users and interdependencies among
ing
products, in order to identify new user-item associations. Besides
avoiding the need for extensive data collection about items or users,
CF requires no domain knowledge. In addition, it offers the po-
General Terms
tential to uncover patterns that would be difficult or impossible to
Algorithms
profile using content based techniques. This has led to many pa-
pers (e.g., [8]), research projects (e.g., [9]) and commercial systems
(e.g., [10]) based on CF.
Keywords
In a more abstract manner, the CF problem can be cast as missing
collaborative filtering, recommender systems, k nearest neighbors,
value estimation: we are given a user-item matrix of scores with
matrix factorization
many missing values, and our goal is to estimate the missing values
based on the given ones. The known user-item scores measure the
amount of interest between respective users and items. They can be
explicitly given by users that rate their interest in certain items or
Permission to make digital or hard copies of all or part of this work for
might be derived from historical purchases. We call these user-item
personal or classroom use is granted without fee provided that copies are
scores ratings, and they constitute the input to our algorithm.
not made or distributed for profit or commercial advantage and that copies
The most common form of CF is the neighborhood-based ap-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
proach (also known as “k Nearest Neighbors" or kNN, for short).
permission and/or a fee.
These kNN methods identify pairs of items that tend to be rated
KDDCup’07, August 12, 2007, San Jose, California, USA.
similarly or like-minded users with similar history of rating or pur-
Copyright 2007 ACM 978-1-59593-834-3/07/0008 ...$5.00.
7

chasing, in order to deduce unknown relationships between users
Then, the estimated value of rui is taken as a weighted average
and items. Merits of the neighborhood-based approach that have
of the neighbors’ ratings:
made it popular include: its intuitiveness, sparing the need to train
s
and tune many parameters, and the ability to easily explain to the
uv rvi
rui ←
v∈N(u;i)
(1)
user the reasoning behind a recommendation.
s
v∈N(u;i)
uv
Three major components characterize the kNN approach: (1)
The similarities – denoted by s
data normalization, (2) neighbor selection, and (3) determination
uv – play a central role here as they
are used both for selecting the members of N(u; i) and for weight-
of interpolation weights. Our experience has shown little differ-
ing the above average. Common choices are the Pearson correla-
ence among various neighbor selection strategies (e.g., distance-
tion coefficient and the closely related cosine similarity. Various
based vs. correlation-based). However, the two other components,
methods differ by their how they normalize or center data prior to
namely data normalization and interpolation weights, have proved
activating interpolation rule (1). For example, prediction quality
vital to the success of the scheme. Accordingly, in this work we
can be improved by correcting for user-specific means. We present
revisit these two components and suggest novel methods to accom-
a more comprehensive treatment to data normalization in Section
plish them. We significantly improve the accuracy of kNN ap-
3.
proaches without meaningfully affecting running time. Our two
An analogous alternative to the user-oriented approach is the
main contributions are:
item-oriented approach [10, 14]. In those methods, a rating is esti-
1. It is customary to normalize the data before activating a kNN
mated using known ratings made by the same user on similar items.
method. This brings different ratings to a closer level, which
Now, to estimate the unknown rui, we identify a set of neighboring
allows a better mixing thereof. Usually this is achieved by
items N(i; u) that other users tend to rate similarly to their rating
adjusting for the varying mean ratings across users and/or
of i. Analogous to above, all items in N(i; u) must have been rated
items. In Section 3 we offer a more comprehensive treatment
by u. Then, in parallel to (1), the estimated value of rui is taken as
of this normalization issue, which considers additional ef-
a weighted average of the ratings of neighboring items:
fects that are readily available in virtually all ratings datasets.
sijruj
This allows us to explain and eliminate much of the interfer-
rui ←
j∈N(i;u)
(2)
ing variability from the data and to work with residuals that
s
j∈N(i;u)
ij
are more amenable to mutual interpolation.
As with the user-user similarities, the item-item similarities (de-
2. Past kNN methods relate items (or users) by various heuris-
noted by sij) are typically taken as either correlation coefficients
tic variants of correlation coefficients, which allowed direct
or cosine similarities. Sarwar et al. [14] recommend using an
interpolation from neighbors’ scores. In Section 4 we of-
adjusted cosine similarity, where ratings are translated by deduct-
fer a rigorous alternative to these interpolation weights based
ing user-means before computing the cosine similarity. They also
on global optimization of a cost function pertaining to all
found that item-oriented approaches deliver better quality estimates
weights simultaneously. This results in another improvement
than user-oriented approaches while allowing more efficient com-
of estimation quality with a minor increase in running time.
putations. This is because, typically, the number of items is signifi-
cantly lower than the number of users, which allows pre-computing
An encouraging evaluation of the results on the Netflix Prize
all item-item similarities for retrieval as needed.
user-movie dataset [3] is provided in Section 5.
Neighborhood-based methods became very popular because they
are intuitive and relatively simple to implement. In particular, they
2.
RELATED WORKS
do not require tuning many parameters or an extensive training
stage. They also provide a concise and intuitive justification for
2.1
General Framework
the computed predictions. This enables presenting the user a list
of similar items that he or she has previously rated, as the basis for
Before discussing previous works, we define our notational con-
the estimated rating. This way, the user actually understands the
ventions. We are given ratings about m users and n items, arranged
reasoning behind the recommendation, allowing him/her to better
in an m × n matrix R = {rui}1 u m,1 i n. We anticipate three
assess its relevance (e.g., downgrade the estimated rating if it is
characteristics of the data that may complicate prediction. First,
based on an item that he or she no longer likes), or even encourage
the numbers of users and items may be very large, as in the Netflix
the user to alter outdated ratings.
data, with the former likely much larger than the latter. Second, an
However, neighborhood-based methods raise some concerns:
overwhelming portion of the user-item matrix (e.g., 99%) may be
unknown. Third, the pattern of observed data may be very nonran-
1. The similarity function (suv or sij), which directly defines
dom, i.e., the amount of observed data may vary by several orders
the interpolation weights, is arbitrary. Different CF algo-
of magnitude among users or among items.
rithms use somewhat different similarity measures; all are
We reserve special indexing letters for distinguishing users from
trying to quantify the elusive notion of user- or item-similarity.
items: for users u, v, and for items i, j, k.
We could not find any fundamental justification for the cho-
2.2
Neighborhood-based collaborative filtering
sen similarities. Consider an extreme example, where a par-
ticular item is predicted perfectly by some other item. In that
The most common approach to CF is the neighborhood-based
case, we would want the latter item to receive all the weight,
approach. Its original form, which was shared by virtually all ear-
but that is impossible for bounded similarity scores like the
lier CF systems, is the user-oriented approach; see [8] for a good
Pearson correlation coefficient.
analysis. Such user-oriented methods estimate unknown ratings
based on recorded ratings of like minded users. More formally,
2. Another problem is that previous neighborhood-based meth-
in order to estimate the unknown rating rui, we resort to a set of
ods do not account for interactions among neighbors. Each
users N(u; i) that tend to rate similarly to u (“neighbors"), and that
similarity between an item i and a neighbor j ∈ N(i; u) is
actually rated item i (i.e., rvi is known for each v ∈ N(u; i)).
computed independently of the content of N(i; u) and the
8

other similarities: sik for k ∈ N(i; u) − {j}. For exam-
3.
NORMALIZING BY REMOVING GLOBAL
ple, suppose that our items are movies, and the neighbors set
EFFECTS
contains three movies that are highly correlated with each
Although neighborhood and factorization based CF are both pow-
other (e.g., sequels such as “Lord of the Rings 1–3"). An al-
erful prediction methods, there are several reasons to precede either
gorithm that ignores the similarity of the three movies when
technique with simple models that estimate what we call “global
determining their interpolation weights, may end up essen-
effects." First, there may be large user and item effects–i.e., sys-
tially triple counting the information provided by the group.
tematic tendencies for some users to give higher ratings than other
users and for some items to receive higher ratings than others. The
3. By definition, the interpolation weights sum to one, which
basic kNN interpolation method detailed in equations (1)-(2) re-
may cause overfitting. Suppose that an item has no useful
quires ratings where user and item effects have been taken out in
neighbors rated by a particular user. In that case, it would
order to, e.g., avoid predicting too high for low-rated items that
be best to ignore the neighborhood information, staying with
happen to have a lot of neighbors with high average ratings, and
the current data normalization. Nevertheless, the standard
vice versa. Or, to adjust for the fact that some items were mostly
neighborhood formula uses a weighted average of ratings for
rated by users that tend to rate higher, while some other items were
the uninformative neighbors.
rated by users that tend to rate lower.
Second, one may have access to information about either the
items or users that can benefit the model. Although factorization
4. Neighborhood methods may not work well if variability dif-
offers the potential to detect such structure through estimation of la-
fers substantially among neighboring items (users)
tent variables, directly incorporating variables such as movie genre
and user demographics may be more effective and does not require
In a recent work [2], we overcame most of these shortcomings,
training a factorization model. While such content based analysis
but the result was a rather slow algorithm, whose running time was
is beyond the scope of this paper, we do consider two types of eas-
several orders of magnitude slower than previous neighborhood-
ily derived variables–the number of ratings of an item or by a user
methods, thereby limiting its practical applicability. In this work,
and the average rating of an item or by a user. Variables like these
we show how to solve the aforementioned issues of neighborhood
allow us, for example, to distinguish users who like the most com-
based approaches, without compromising running time efficiency.
monly rated movies best from those who prefer more specialized
fare (after controlling for both movie and user effects).
2.3
Factorization-based collaborative filtering
Third, there may be characteristics of specific ratings, such as the
Matrix factorization is an alternative approach to CF. It is related
date of the rating, that explain some of the variation in scores. For
to the so-called model-based approaches, which train a compact
example, a particular user’s ratings may slowly, or suddenly, rise
model that explains the full ratings matrix. It is not directly re-
over time, above and beyond any change explained by the inherent
lated to the methods discussed in this paper, but we survey it here
quality of the items being rating. Similarly, ratings for some movies
because it can be integrated with our methods. In a way, factoriza-
may fall with time after their initial release dates, while others stand
tion complements the more localized neighborhood-based methods
the test of time quite well. To the extent that patterns like these
by providing a higher level perspective to the ratings data. Our
occur, neither factorization nor kNN could be expected to detect
experiments show that such a combination can improve upon the
the patterns.
application of either method alone; see Section 5.
The goal of a factorization-based CF is to uncover latent features
3.1
Methods
of the given data that explain the ratings. This can be achieved by
Our strategy is to estimate one “effect" at a time, in sequence
employing matrix factorization techniques such as Singular Value
(i.e., the main effect for items, the main effect for users, a user-time
Decomposition (SVD) or Principal Components Analysis (PCA).
interaction, etc.). At each step, we use residuals from the previous
Given an m × n matrix R, SVD computes the best rank-f approxi-
step as the dependent variable for the current step. Consequently,
mation Rf , which is defined as the product of two rank-f matrices
after the first step, the rui refer to residuals, rather than raw ratings.
Pm×f and Qn×f , where f
m, n. That is, Rf = P QT mini-
For each of the effects mentioned above, our goal is to estimate
mizes the Frobenius norm R − Rf F among all rank-f matrices.
either one parameter for each item or one parameter for each user.
In this sense, the matrix Rf captures the f most prominent fea-
For the rest of this subsection, we describe our methods for esti-
tures of the data, leaving out less significant features of the data
mating user specific parameters; the method for items is perfectly
that might be mere noise. Consequently, each unknown rating, rui
analogous.
is estimated as Rf .
ui
We denote by xui the explanatory variable of interest correspond-
Applying an SVD-based technique to CF raises unique difficul-
ing to user u and item i. For user main effects, the xui’s are iden-
ties due to the sparsity issue. The conventional SVD computation
tically 1. For other global effects, we center xui for each user by
requires that all entries of R are known. In fact, the goal of SVD
subtracting the mean of xui for that user. In each case, our model
is not properly defined when some entries of R are missing. Previ-
is:
ous works have adapted matrix factorizations techniques to handle
r
missing data in a variety of ways [2, 5, 7, 12, 13]. In all cases, a crit-
ui = θuxui + error
(3)
ical issue is to avoid overfitting for items and users with relatively
With sufficient ratings for user u, we might use the unbiased esti-
sparse data.
mator:
Factorization-based approaches tend to produce competitive re-
ruixui
sults in terms of accuracy. However, their major drawback is a
ˆ
θu =
i
x 2
lengthy training phase that is required for building the model. When-
i
ui
ever these factorization methods are applicable, they can be used
where each summation is over all items rated by u. However, for
in conjunction with the neighborhood-based methods discussed in
sparse data, some values of ˆ
θu may be based on very few observa-
this paper, as will be explained later in Section 5.
tions, thereby resulting in unreliable estimates.
9

To avoid overfitting, we shrink individual values of ˆ
θ
Effect
RMSE
Improvement
u towards
a common value. Shrinkage can be motivated from a Bayesian
Overall mean
1.1296
NA
perspective. Suppose that the true θu are independent random vari-
Movie effect
1.0527
.0769
ables drawn from a normal distribution,
User effect
0.9841
.0686
User×Time(user)1/2
0.9809
.0032
θu ∼ N(µ, τ 2)
User×Time(movie)1/2
0.9786
.0023
for known µ and τ 2, while
Movie×Time(movie)1/2
0.9767
.0019
Movie×Time(user)1/2
0.9759
.0008
ˆ
θu|θu ∼ N(θu, σ2u)
User×Movie average
0.9719
.0040
User×Movie support
0.9690
.0029
for known σ2u. Then, the best estimator for θu is its posterior mean:
Movie×User average
0.9670
.0020
τ 2 ˆ
θ
Movie×User support
0.9657
.0013
E(θ
u + σ2

u| ˆ
θu) =
τ 2 + σ2u
Table 1: RMSE for Netflix probe data after adding a series of
a linear combination of the empirical estimator ˆ
θu and the common
global effects to the model
mean µ. The parameter σ2u can be estimated from the formula for
the variance of a weighted mean, while µ can be estimated by the
the movie’s first rating by anyone. Somewhat surprisingly, the lat-
mean of the θu’s (optionally weighted by nu). Empirical Bayes [4]
ter interaction contributes almost as much as the first. Two addi-
suggests that the maximum likelihood estimate of τ 2 can be found
tional time interactions concentrate on the complementary movie
as the solution to:
viewpoint. That is, for each fixed movie they model how its rat-
[(ˆ
θu − µ)2 − σ2
ings change against time. Once again, the time variables are either
τ 2 =
u
u]/(τ 2 + σ2
u)2
(τ 2 + σ2
square root of days since first rating of the movie, or square root of
u
u)−2
days since first rating by the user.
In practice, we used a slightly simpler estimator for θu by as-
The next two effects interact users with movie characteristics.
suming µ = 0 and σ2u is proportional to 1/nu, which yields:
We measure the popularity of a movie in two different ways: (1)
its average rating; (2) its support, which is the number of ratings
n ˆ
uθu
associated with the movie. These effects—User×Movie average
nu + α
and User×Movie support—measure how users change their rat-
where n
ings based on the popularity of the movies. We try to estimate
u is the number of ratings by user u and α is a constant.
We determined α by cross validation.
here how closely a user follows the public opinion, or perhaps the
opposite, how contrarian is the user. The most effective interaction
3.2
Example
is between users and movie average–indicating that users varied in
terms of how much they agree with the consensus.
We illustrate the impact of estimating successive global effects
Finally, we interact movies with user characteristics, regressing
on the Netflix data [3]. The dataset is based on more than 100 mil-
the ratings of each movie on the mean or support of the users.
lion ratings of movies performed by anonymous Netflix customers
Although these interactions are more difficult to motivate, each
and is described in details in Section 5. Here, we report results on
contributes an additional modest decrease of the RMSE. Overall,
the Probe set, which is a test set containing about 1.4 million rat-
the eight interactions combine to reduce the RMSE by 0.0184 to
ings compiled by Netflix. Ratings are predicted by accumulating
0.9657.
a series of global effects. Accuracy of predictions is measured by
their root mean squared error (RMSE). The effects that we describe
here should be readily available in every user-item ratings system,
4.
A NEIGHBORHOOD RELATIONSHIPS
and are not particular to the Netflix data.
MODEL
Table 1 shows how the RMSE for the probe set declines with
In this section we assume that global effects have already been
the inclusion of each successive global effect. Not surprisingly,
removed, so whenever we refer to a rating we actually mean the
by far the largest improvements in RMSE are associated with the
residual remaining after removing global effects from the original
two sets of main effects–the movie effect and the user effect. These
rating data. However, our discussion here is completely general, so
two main effects are comparable to double centering enhanced with
the following method applies whether global effects are removed
shrinkage. They reduce the RMSE from 1.1296 based on use of the
or not.
mean rating in the training data, down to 0.9841.
We need to estimate the unknown rating by user u of item i, that
Of more interest are various interactions. The first interaction
is rui. As with all neighborhood-based methods, our first step is
term, User×Time(user)1/2, refers to allowing each user’s rating to
neighbor selection. We focus on an item-oriented method. Among
change linearly with the square root of the number of days since
all items rated by u, we select the K most similar to i, N(i; u),
the user’s first rating. Exploratory analysis indicated that this trans-
by using a similarity function such as the correlation coefficient, as
formation improved the fit relative to the untransformed number of
described later in this section. Typical values of K lie in the range
days. This term, which allows for the type of drift hypothesized
of 20–50; see Section 5.
above, reduces the RMSE by 0.0032. Technically, in (3) we set
Given a set of neighbors N(i; u), we need to compute interpola-
each xui to the square root of number of days elapsed since the
tion weights {wij |j ∈ N(i; u)} that will enable the best prediction
first rating by user u till the time u rated item i. Then, for each
rule of the form:
fixed user u, we center all computed values of xui, and calculate
ˆ
θ
rui ←
wijruj
(4)
u in order to regress rui on xui.
Similarly, User×Time(movie)1/2, allows each user’s rating to
j∈N(i;u)
change linearly with the square root of the number of days since
For notational convenience assume that the K neighbors in N(i; u)
10

are indexed by 1, . . . , K , and the corresponding interpolation weights
the corresponding K × K matrix ˆ
A and the vector ˆb ∈ RK:
are arranged within w ∈ RK .
|U(j, k)| · ¯
A
We seek a formal computation of the interpolation weights, that
ˆ
A
jk + β · avg
jk =
(11)
stems directly from their usage within prediction rule (4). As ex-
|U(j, k)| + β
plained earlier, it is important to derive all interpolation weights
ˆ
|U(i, j)| · ¯b
b
j + β · avg
(12)
simultaneously to account for interdependencies among the neigh-
j =
|U(i, j)| + β
bors. We achieve these goals by defining a suitable optimization
The parameter β controls the extent of the shrinkage. A typical
problem.
value when working with residuals of full global effects is β =
To start, we consider a hypothetical dense case, where all users
500.
but u rated both i and all its neighbors in N(i; u). In that case, we
Our best estimate for A and b are ˆ
A and ˆb, respectively. There-
could learn the interpolation weights by modeling the relationships
fore, we modify (6) so that the interpolation weights are defined as
between item i and its neighbors through a least squares problem:
the solution of the linear system:

2
ˆ
Aw = ˆb
(13)
min
wijrvj
(5)
w
rvi −

The resulting interpolation weights are used within (4) in order to
v=u
j∈N(i;u)
predict rui. When working with residuals of global effects, they
should be added back to the predicted r
Notice that the only unknowns here are the w
ui, which completes the
ij ’s. The optimal so-
computation.
lution to the least squares problem (5) is found by differentiation as
Notice that this method addresses all four concerns raised in Sec-
a solution of a linear system of equations. From a statistics view-
tion 2.1. First, interpolation weights are derived directly from the
point, it is equivalent to the result of a linear regression (without
ratings, not based on any similarity measure. Second, the interpo-
intercept) of rvi on the rvj for j ∈ N(i; u). Specifically, the opti-
lation weights formula explicitly accounts for relationships among
mal weights are given by:
the neighbors. Third, the sum of the weights is not constrained
to equal one. If an item (or user) has only weak neighbors, the
Aw = b
(6)
estimated weights may all be very small. Fourth, the method au-
tomatically adjusts for variations among items in their means or
Here, A is a K × K matrix defined as:
variances.
Ajk =
rvjrvk
(7)
4.1
Preprocessing
v=u
An efficient computation of an item-item neighborhood-based
method depends on precomputing certain values associated with
Similarly the vector b ∈ RK satisfies:
each movie-movie pair to enable their rapid retrieval.
First, we need a quick access to all item-item similarities, the
b
s
j =
rvjrvi
(8)
ij values. These similarities are required for identifying the K
v=u
neighbors that constitute N(i; u). Here, we usually take sij as the
Pearson correlation coefficient between i and j calculated on their
For a sparse ratings matrix there are likely to be very few users
common raters. It is important to shrink sij based on their support,
who rated i and all its neighbors N(i; u). Accordingly, it would be
e.g., multiplying the correlation by: |U(i, j)|/(|U(i, j)| + α) for
unwise to base A and b as given in (7)–(8) only on users with com-
some small α. An alternative, which works similarly well, is based
plete data. Even if there are enough users with complete data for
on the mean squared error between items:
A to be nonsingular, that estimate would ignore a large proportion
|U(i, j)|
of the information about pairwise relationships among ratings by
sij =
(r
the same user. However, we can still estimate A and b, up to the
u∈U(i,j)
ui − ruj )2 + α
same constant, by averaging over the given support, leading to the
The second set of values that should be precomputed is all pos-
following reformulation:
sible entries of ˆ
A and ˆb. To this end, for each two items i and j, we
compute:
r
¯
vj rvk
A
rvirvj
jk =
v∈U(j,k)
(9)
|U(j, k)|
¯
Aij =
v∈U(i,j)
|U(i, j)|
r
Then, the aforementioned baseline value avg, which is used in
¯
vj rvi
bj =
v∈U(i,j)
(10)
(11)-(12), is taken as the average entry of the precomputed n × n
|U(i, j)|
matrix ¯
A. In fact, we recommend using two different baseline val-
where U(j, k) is the set of users who rated both items j and k.
ues, one by averaging the non-diagonal entries of ¯
A and another
This is still not enough for overcoming the sparseness issue. The
one by averaging the diagonal entries. This accounts for the fact
averages represented by ¯
Ajk or ¯bj may differ by orders of magni-
that the diagonal entries are expected to have an inherently higher
tude in terms of the number of users included in the average. As
average because they sum only non-negative values. Finally, we
discussed in the previous section, averages based on relatively low
derive a full n × n matrix ˆ
A from ¯
A by (11). Here, the non-
support (small values of |U(j, k)|) can generally be improved by
diagonal average is used when deriving the non-diagonal entries
shrinkage towards a common mean. Thus, we compute a baseline
of ˆ
A, whereas the diagonal average is used when deriving the diag-
value which is defined by taking the average of all possible ¯
Ajk
onal entries of ˆ
A.
values. Let us denote this baseline value by avg; its precise com-
Because of symmetry, it is sufficient to store the values of sij
putation is described in the next subsection. Accordingly, we define
and ˆ
Aij only for i
j. We allocated one byte for each individual
11

NonNegativeQuadraticOpt (A ∈ RK×K , b ∈ RK )
4.3
User-oriented implementation
% Minimize xT Ax − 2bT x s.t. x
0
In the above discussion we derived an item-oriented approach,
but a parallel idea was successfully applied in a user-oriented fash-
do
ion, by switching the roles of users and items throughout the above
r ← Ax − b % the residual, or “steepest gradient"
discussion. However, as with previous neighborhood-based ap-
% find active variables - those that are pinned because of
proaches, an item-oriented approach enables a much faster imple-
% nonnegativity constraint, and set respective ri’s to zero
mentation by precomputing and storing in main memory a full
for i = 1, . . . , k do
item-item matrix containing all sij and ˆ
Aij values for future re-
if xi = 0 and ri < 0 then
trieval. Typically, the larger number of users will complicate such
ri ← 0
a precomputation in terms of both time and space, and out-of-
end if
core storage will be required. Moreover, our experience with the
end for
Netflix data, under various settings, showed that an item-oriented
α ← rT r % max step size
approach consistently delivered more accurate predictions than a
rT Ar
% adjust step size to prevent negative values:
user-oriented one. This advantage of item-oriented methods strength-
for i = 1, . . . , k do
ens a similar observation by Sarwar et al. [14]. However, when a
if r
user-oriented approach is computationally feasible, it can be used
i < 0 then
α ← min(α, −x
for improving prediction accuracy by mixing its results with those
i/ri)
end if
of the item-oriented approach as in [15].
end for
x ← x + αr
5.
EXPERIMENTAL STUDY
while r <
% stop when residual is close to 0
We evaluated our algorithms on the Netflix data of more than
return x
100 million movie ratings [3]. We are not aware of any publicly
available CF dataset that is close to the scope and quality of this
dataset.
Figure 1: Minimizing a quadratic function with non-negativity
To maintain compatibility with results published by others, we
constraints
adopted some standards that were set by Netflix, as follows. First,
quality of the results is measured by their root mean squared error
(RMSE), a measure that puts more emphasis on large errors com-
value, so overall space complexity for n items is exactly n(n + 1)
pared with the alternative of mean absolute error. In addition, we
bytes. E.g., for the Netflix dataset which contains 17770 movies,
report results on two test sets compiled by Netflix, one is known
overall memory requirements are 300MB, easily fitting within core
as the Probe set and the other is known as the Quiz set. These two
memory. A more comprehensive system, which includes data on
test sets were constructed similarly, with both containing about 1.4
100,000 items would require about 10GB of space, which can fit
million of the most recent movie ratings (by date) performed by
within core memory of current 64bit servers. For even larger sys-
the users. The true ratings of the Probe set are provided. However,
tems, disk resident storage of item-item values is still practical, as
we do not know the true ratings for the Quiz set, which are held
evident by the Amazon item-item recommender system, which is
by Netflix being the subject of the Netflix Prize contest. Netflix
accessing stored similarity information for several million catalog
provides RMSE values to competitors that submit their predictions
items [10]. Preprocessing time is quadratic in n and linear in the
for the Quiz set. The benchmark is Netflix’s proprietary CF sys-
number of ratings. The time required for computing all s
tem, Cinematch, which achieved a RMSE of 0.9514 on this Quiz
ij ’s and
ˆ
A
set. The two test sets contain many more ratings by users that do
ij ’s on the Netflix data (which contains 100 million ratings) was
about 15 minutes on a Pentium 4 PC. Notice that preprocessing can
not rate much and are harder to predict. In a way, they represent
be easily parallelized.
real requirements from a CF system, which needs to predict new
The fact that all possible entries of matrix ˆ
A are precomputed,
ratings from older ones, and to equally address all users, not only
saves the otherwise lengthy time needed to construct it. Thus, af-
the heavy raters.
ter quickly retrieving the relevant entries of ˆ
A, we can compute the
5.1
Experiments with the Probe set
interpolation weights by solving a K × K system of equations.
For typical values of K (between 20 and 50), this time is compa-
Table 2 compares our method to a kNN method that takes inter-
rable to the time needed for computing the K nearest neighbors,
polation weights as Pearson correlation coefficients (shrunk to im-
which is common to all neighborhood-based approaches. Hence,
prove results), which represent the common approach to the prob-
while our method relies on a much more detailed computation of
lem. All methods use item neighborhoods. Reported results are
the interpolation weights compared to previous methods, it does
RMSE values computed on the Probe set. We apply the methods
not significantly increase running time; see Section 5.
after varying stages of preprocessing the data. First, we applied
the methods to the raw data, where scores were not normalized
at all. Then, we removed the first two effects – the user effect
4.2
Calculation of the weights
and the item effect, which is similar to double-centering. When
The interpolation weights can be computed by solving (13) us-
no neighborhood-interpolation is performed, such a normalization
ing standard linear equations solvers. However, we experienced
alone delivers a RMSE of 0.9841, as can be seen in the second
a modest increase in accuracy when w is constrained to be non-
column of the table. A more substantial normalization is achieved
negative, which avoids certain redundant overfitting. This leads to
by accounting for all the global effects, as described in Section 3.
a quadratic program which can be solved by calling “NonNega-
Global effects on their own lower the Probe RMSE to 0.9657, be-
tiveQuadraticOpt( ˆ
A, ˆb)", as described in Figure 1. The function
fore applying any neighborhood interpolation.
is based on the principles of the Gradient Projection method; see,
Finally, our method can be combined with the factorization ap-
e.g., [11].
proach described in Subsection 2.3. Notice that unlike the previ-
12

No interpolation
Correlation-based interpolation
Jointly derived interpolation
Data normalization
(k = 0)
k = 20
k = 35
k = 50
k = 20
k = 35
k = 50
none (raw scores)
NA
0.9947
1.002
1.0085
0.9536
0.9596
0.9644
double centering
0.9841
0.9431
0.9470
0.9502
0.9216
0.9198
0.9197
global effects
0.9657
0.9364
0.9390
0.9413
0.9194
0.9179
0.9174
factorization
0.9167
0.9156
0.9142
0.9142
0.9071
0.9071
0.9071
Table 2: Comparing our interpolation scheme against conventional correlation-based interpolation, by reporting RMSEs on the
Probe set. Various levels of data normalization (preprocessing) are shown, and different sizes of neighborhoods (
K) are considered.
ously mentioned normalizations, factorization requires an exten-
5.2
Experiments with the Quiz set
sive training in order to learn the factors. Our implementation de-
When computing predictions for the Quiz set, the test set held
livered a RMSE of 0.9167 on the Probe set, before considering any
by Netflix, we subsume the Probe set into our training data. This
neighborhood information. A kNN method can be used to improve
makes Quiz results better than the Probe results. Accordingly, our
factorization results, by applying it to the residuals remaining after
method produced a RMSE of 0.9086 (4.5% improvement over Net-
subtracting the estimates generated by factorization. Effectively,
flix system) for the quiz set when applied to residuals of global ef-
error is smoothed by considering local information that the rather
fects. When applied to residuals of factorization the RMSE dropped
regional factorization approaches tend to miss.
to 0.8982 (5.6% improvement over Netflix system), at the cost of
The remaining columns of Table 2 contain results using the two
requiring the training of a factorization model.
methods for determining interpolation weights. We provide results
These results are comparable to recently reported results for the
representing the spectrum of viable choices for neighborhood size
Netflix data by methods that require an extensive training phase
– K. For each value of K, the same neighborhoods are used for
[2, 12]. Moreover, when mixed with other methods, results can be
the two methods, the only difference lies in the determination of
substantially improved. For example, an alternative method to in-
interpolation weights. The main findings are as follows.
troducing neighborhood-awareness into factorization-based results
was described in [2]. While the accuracy of the two methods is
• The jointly derived interpolation weights proposed in this pa-
similar, they differ considerably in their nature, and thus produce
per uniformly outperformed standard correlation based weights.
different ratings. Thus, they can be mixed effectively to produce
The improvement was 0.0071 when applied after factoriza-
a more accurate solution. Their mixtures achieve solutions with
tion, but was much larger with less extensive data normaliza-
RMSEs around 0.8900 (6.45% improvement over Netflix system),
tions.
which would currently rank high on the Netflix Prize Leaderboard1,
even before mixing with other known approaches.
• Use of the neighborhood model with our proposed interpo-
lation weights substantially improved the predictions across
all the data normalizations. The neighborhood approach re-
6.
SUMMARY
duced RMSE by 0.0096 even after the much more time con-
Collaborative filtering through neighborhood-based interpolation
suming factorization method. Notably, the correlation-based
(“kNN") is probably the most popular way to create a recommender
neighborhood model produced a much more modest improve-
system. The success of these methods depends on the choice of the
ment of 0.0025 relative to factorization.
interpolation weights, which are used to estimate unknown ratings
from neighboring known ones. Nevertheless, the literature lacks
• The best neighborhood size K varied depending on the ex-
a rigorous way to derive these weights. In this work we showed
tent of prior data normalization. Interestingly, we found no
how the interpolation weights can be computed as a global solu-
difference in RMSE over the range 20 to 50 neighbors for the
tion to an optimization problem that precisely reflects their role.
best combination–jointly derived interpolation weights after
Comparison to past kNN methods was made on the Netflix data,
factorization.
with encouraging results suggesting a significant improvement of
prediction accuracy without a meaningful increase in running time.
• Even with incorporation of neighborhood information, more
Our second, complementary contribution is related to data nor-
complete data normalization improved predictions. How-
malization. Normalization is essential to kNN methods, as oth-
ever, there were diminishing returns. For example, while
erwise mixing ratings pertaining to different unnormalized users
global effects reduced RMSE by 0.0184 before the neigh-
or items will lead to inferior results. This work offered a com-
borhood model, only 0.0023 survived after incorporation of
prehensive approach to data normalization, which is related to re-
the neighborhood model.
moving 10 effects that can be readily observed in user-item rating
data. Those effects cause much of the data variability and mask the
As for running time, the correlation based method required about
fundamental relationships between ratings. Hence, their removal
200 seconds to process the whole Probe set, regardless of the value
brings the ratings closer and facilitates improved estimation accu-
of K. Our method, with K = 20, required a slight increase in time
racy.
to about 235 seconds for the whole Probe set, where the added
time represents the effort needed to solve the K × K least squares
problem for each query. Not surprisingly, increasing K from 20 to
7.
REFERENCES
35 or 50 also raises running time to around 355 seconds (K = 35)
[1] G. Adomavicius and A. Tuzhilin, “Towards the Next
or 470 seconds (K = 50), due to the growing effort needed for
Generation of Recommender Systems: A Survey of the
solving the larger least squares problems. Even so, the slower mode
State-of-the-Art and Possible Extensions", IEEE
of our method (K = 50) processed a single rating query in less than
0.4 millisecond.
1www.netflixprize.com/leaderboard
13

Transactions on Knowledge and Data Engineering 17
(2005), 634–749.
[2] R. M. Bell, Y. Koren and C. Volinsky, “Modeling
Relationships at Multiple Scales to Improve Accuracy of
Large Recommender Systems", Proc. 13th ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining
, 2007.
[3] J. Bennet and S. Lanning, “The Netflix Prize",
www.cs.uic.edu/~liub/KDD-cup-2007/
NetflixPrize-description.pdf.
[4] B. Efron and C. Morris, “Data analysis using Stein’s
estimator and its generalization", Journal American
Statistical Association
70 (1975), 311–319.
[5] S. Funk, “Netflix Update: Try This At Home",
sifter.org/~simon/journal/20061211.html,
2006.
[6] D. Goldberg, D. Nichols, B. M. Oki and D. Terry, “Using
Collaborative Filtering to Weave an Information Tapestry",
Communications of the ACM 35 (1992), 61–70.
[7] K. Goldberg, T. Roeder, D. Gupta and C. Perkins,
“Eigentaste: A Constant Time Collaborative Filtering
Algorithm", Information Retrieval 4 (2001), 133–151.
[8] J. L. Herlocker, J. A. Konstan, A. Borchers and John Riedl,
“An Algorithmic Framework for Performing Collaborative
Filtering", Proc. 22nd ACM SIGIR Conference on
Information Retrieval
, pp. 230–237, 1999.
[9] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon and
J. Riedl, “GroupLens: Applying Collaborative Filtering to
Usenet News", Communications of the ACM 40 (1997),
77–87, www.grouplens.org.
[10] G. Linden, B. Smith and J. York, “Amazon.com
Recommendations: Item-to-item Collaborative Filtering",
IEEE Internet Computing 7 (2003), 76–80.
[11] J. Nocedal and S. Wright, Numerical Optimization, Springer
(1999).
[12] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted
Boltzmann Machines for Collaborative Filtering", Proc. 24th
Annual International Conference on Machine Learning
,
2007.
[13] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl,
“Application of Dimensionality Reduction in Recommender
System – A Case Study", WEBKDD’2000.
[14] B. Sarwar, G. Karypis, J. Konstan and J. Riedl, “Item-based
Collaborative Filtering Recommendation Algorithms", Proc.
10th International Conference on the World Wide Web
, pp.
285-295, 2001.
[15] J. Wang, A. P. de Vries and M. J. T. Reinders,“Unifying
User-based and Item-based Collaborative Filtering
Approaches by Similarity Fusion", Proc. 29th ACM SIGIR
Conference on Information Retrieval
, pp. 501–508, 2006.
14