Assume I've built a recommender system that (given say movie rankings or whatever of many users) will produce a list of 10 recommended movies for each user to watch. Imagine that I also have some large pool of movie items, together with a log of user ratings together with movies that they actually decided to watch. So I want to use this data set to evaluate my system. I've seen in the literature that these "suggest some good items" tasks are usually evaluated using precision, recall and F1-scores (e.g. see [1]). I guess that I should be interested, in particular, on "precision at 10". However I'm not quite sure how one is supposed to compute these measures (or if they make any sense) in the scenario that I've described above. Apparently, the preferred thing to do is to randomly break the sample into a "training" and a "testing" part. And then feed the training data to my algorithm so that it can come up with a list of 10 predictions. Now precision sort of makes sense, I can check from the 10 predictions how many of these are actually found in the movies watched by the user in the test data. However for recall, if the user watched a lot of movies in the test data, say 50 or so; there is no way to obtain a "good" recall score, simply because my system was constrained to produce only 10 movies and I would get at most a 1/5 = 0.2 of recall. Alternatively, if I constrain the test only to guess the "next 10 watched" movies of the user (so that there is a chance to get a "perfect recall"), then precision and recall will always be exactly the same number (if the number recommended and the number relevant for the user is the same, precision and recall are also always the same). Am I doing something wrong? Or these metrics simply don't make much sense in the considered scenario?
In case of a "top-N" recommender system, it is helpful to construct an "unbiased" test data set (e.g. by adding a thousand random unwatched/unrated movies to the list of watched movies from the holdout data set for a given user), and then scoring the resulting test data set using a model. Once it is done for a bunch of users, one can then calculate "precision vs recall" curve and "recall-at-N vs N" curve (as well as sensitivity/specificity and lift curves) which can be used to judge quality of a given model. This paper, Performance of Recommender Algorithms on Top-N Recommendation Tasks by Cremonesi et al., has more details.
If a given model includes time dynamics then the split between training and test should be done along the time dimension (not entirely randomly)
103 4 4 bronze badges answered Mar 22, 2012 at 22:04 1,430 12 12 silver badges 11 11 bronze badges $\begingroup$ broken link it is $\endgroup$ Commented Oct 17, 2019 at 21:59 $\begingroup$Most of the time recall do not produce a result which can be evaluated in absolute terms. You should use recall value to evaluate one algorithm with respect to another.
If an algorithm A has a recall value of 0.2 (as in your example) it is difficult to interpret what this value means. However, if another algorithm B has a recall value of 0.15 (given the same experimental setup) then you can conclude that algorithm A has a better performance than algorithm B with respect to recall.
Mean Absolute Error (MAE) is not like this, it can be interpreted by itself.
answered Aug 23, 2013 at 13:13 1,262 12 12 silver badges 19 19 bronze badges $\begingroup$Another way to deal with this situation would be to not use all the ground truth data. To evaluate the performance of your model, you could also compute the probability of the most likely movies, say arbitrarily 6. Then if a user has watched 50 movies for example, you would pick the top 6 which will serve as your ground truth. Remember what you are looking at is the interpretability of the results and setting thresholds based on probability could be one way to have a more meaningful recall and precision values.
answered Jan 7, 2022 at 14:05 Squid Game Squid GameTo subscribe to this RSS feed, copy and paste this URL into your RSS reader.
Site design / logo © 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2024.9.9.14969