----------------------- REVIEW 1 ---------------------
PAPER: 92
TITLE: Top-N Recommendation with Missing Implicit Feedback
AUTHORS: Daryl Lim, Gert Lanckriet and Julian McAuley
OVERALL EVALUATION: 1 (weak accept)
REVIEWER'S CONFIDENCE: 5 (expert)
RELEVANCE FOR RECSYS: 5 (excellent)
NOVELTY: 4 (good)
TECHNICAL QUALITY: 3 (fair)
SIGNIFICANCE: 3 (fair)
PRESENTATION AND READABILITY: 4 (good)
Consider for Best Paper/Best Poster Award: 3 (No)
----------- REVIEW -----------
General: this paper focuses on top N recommendations. The authors first define a reasonable model for data, claiming that in the dataset many good (relevant) items are not marked as relevant, and the items marked as relevant are only a uniform sample of all the relevant items. The authors then present an evaluation metric, ADG, that should provide the same estimation over the (unknown) set of all relevant items. The authors then proceed to suggest a latent factor model that can be optimized for ADG. The authors provide some experiments, which are perhaps the weakest part of the paper.
Major comments:
1) The authors mix between the prediction model and the evaluation score – in section 3 the authors first describe a specific latent factor prediction model, and then describe the evaluation metric over this specific model. Tying the evaluation metric together with the prediction model is not good, because it doesn’t allow us to measure performance on other prediction models. I don’t see why the specific of the prediction model would change anything with respect to the properties of the method.
2) The evaluation is not great. The ADG method, much like NDCG and AUC, has little meaning in real life, unlike precision or recall which can be translated into terms that interest business people. Hence, a good evaluation would have focused on the correlation between ADG and precision and recall. To do that, one would expect a larger set of domains and methods, and show that the method preference order by precision and recall persists in ADG.
3) The domains for the testing are also not great. The ratings domains are being cast into relevant/not relevant, which adds noise to the experiment. It would have been better if the authors would have used some retail dataset, like the one offered in the current RecSys challenge, or something similar.
4) A second contribution is the optimization procedure for the suggested model given ADG as the optimization criterion. To understand the usefulness of this model, I would expect a comparison with more methods – which would also be useful for point 2 above. Specifically, comparing with a simple popularity method is a must in top n recs. Popularity typically performs quite well on these tasks, and getting better scores is essential for showing value in a new method. I would also like to see comparison with simple item-item (CP, Jaccard) methods. Finally, CofiRank has shown very good results on ranking and top n recs, and should also be included in the testbed.
Minor comments:
1) “evaluation measures computed on the observed feedback may not accurately reflect performance” – agreed, but one can argue that they provide a lower bound on the true performance, which is a good thing.
2) The implicit/explicit notation is annoying. When you care about the movies that the user will watch, data over previous watched movies is a very explicit signal. 15 years ago someone coined the term implicit rating, when people still cared mainly about rating prediction, and this annoying term stuck. I think that it is time that the community will stop calling selection events, such as purchases, movie watching, story reading, and so forth implicit feedback.
3) In equation 1 – consider using dot product instead of <,>
4) the definition of I(k), should it be I(k1>k2)?
5) Algorithm 1, step 14, you probably mean f(u,i-), not f(u,v)
6) when claiming that the performance on validation and set is identical regardless of their size, you probably mean – when the set are large enough, because otherwise the sample statistics will be very noisy.
----------------------- REVIEW 2 ---------------------
PAPER: 92
TITLE: Top-N Recommendation with Missing Implicit Feedback
AUTHORS: Daryl Lim, Gert Lanckriet and Julian McAuley
OVERALL EVALUATION: 1 (weak accept)
REVIEWER'S CONFIDENCE: 4 (high)
RELEVANCE FOR RECSYS: 4 (good)
NOVELTY: 4 (good)
TECHNICAL QUALITY: 4 (good)
SIGNIFICANCE: 3 (fair)
PRESENTATION AND READABILITY: 3 (fair)
Consider for Best Paper/Best Poster Award: 3 (No)
----------- REVIEW -----------
The paper presents a novel evaluation metric, “Average Discounted Gain”, to measure the performance of a top-N recommender system under the MNAR missing data model. The authors claim that the proposed metric is able to give an unbiased estimation of performance on partially observed interactions.
The performance estimation under missing data models is an interesting topic in the RecSys community and the novelty of the idea and the way the authors presented the new metric makes the work an interesting contribution. However there are still important issues which are either unclear or not covered very well in the paper:
- The OPT-ADG algorithm is not clearly explained and it is very difficult for the reader to reproduce the algorithm. For example there is no explanation of the objective functions in equation 1, what are the parameters? How are the parameters updated in gradient step? What is the bias term?
- What is the range of the objective function (eq. 1) and why is the difference between the scores in equation 5 added by 1 (similarly, line 7 of Algorithm1)
- In practice most of the datasets are very sparse. However in this work the authors "densified" the datasets and performed the experiments on dense subsets of the datasets. How can you assure similar performance can be achieved when datasets are sparse? Also, why did the authors use datasets such as MovieLens and Amazon which are explicit feedback datasets by default?
- Concerning the experiments, the reported number of iterations (1M) is extremely high and in practice such a high number is not scalable. Why such a big number? If the algorithm becomes effective in such a high number of iterations, then the usefulness of the algorithms in practice becomes questionable.
My overall evaluation of this paper as a short contribution for RecSys is weak accept while I recommended the authors to write a longer version of this paper and provide more detailed explanations about the algorithms, the learning process, implementation details and try datasets which are implicit feedback by nature.
----------------------- REVIEW 3 ---------------------
PAPER: 92
TITLE: Top-N Recommendation with Missing Implicit Feedback
AUTHORS: Daryl Lim, Gert Lanckriet and Julian McAuley
OVERALL EVALUATION: 2 (accept)
REVIEWER'S CONFIDENCE: 4 (high)
RELEVANCE FOR RECSYS: 5 (excellent)
NOVELTY: 5 (excellent)
TECHNICAL QUALITY: 5 (excellent)
SIGNIFICANCE: 5 (excellent)
PRESENTATION AND READABILITY: 5 (excellent)
Consider for Best Paper/Best Poster Award: 1 (Yes)
----------- REVIEW -----------
This is a excellent short paper, showing how to estimate a discounted gain metric on the common missing implicit feedback scenario.
In a sense, this is a continuation of Steck's work showing how AUC/ATOP can be estimated on such datasets. However, the metric proposed by the authors is more realistic, as it emphasizes items at the top of the ranked list (unlike AUC).
In addition to the theoretical creation of the estimated metric, the authors provide a method for optimizing it.
A shortcoming of the paper is the given its short format, it skip essential details on justifying the approximation formula and the related algorithm. Those are clearly related to the WSABIE method by Weston and Bengio [10], but there is no discussion on differences and relations. Still, I am not aware of another work doing the same in RecSys, and if indeed the case, this will be a nice contribution to the community.
------------------------- METAREVIEW ------------------------
PAPER: 92
TITLE: Top-N Recommendation with Missing Implicit Feedback
Following the consensus positive opinion of the three reviewers, I recommend accepting the paper.
Still, the reviewers pointed to really necessary improvements: mostly a better explanation of how formula (5) was derived, and a justification of algorithm 1 (targeting also those readers not familiar with [10]). Empirical study was also viewed as quite superficial. Admittedly, the 4-page limitation cannot allow adding much content, but perhaps pointing to an external technical report (or an arxiv manuscript) can fill this gap.