==================================================
Q:
You use lbfgs algorithm to train the model, but i can't get it why you run the lbfgs algorithm several times(50 times exactly you use ) ,and use the validation set to find the best one. But dont we use the validation set to only to find the optimal hyperparameters ? So the question is :why you use validation when you use lbfgs algorithm?
A:
The algorithm alternates between fitting topics and fitting a latent factor model. This procedure repeats for 50 iterations. The algorithm terminates when no further improvement is seen on the validation set, as that is the version of the model we'd expect to generalize best to unseen data.
==================================================
Q:
I would like to ask you a question about the source code for your work "Hidden factors and hidden topics: understanding rating dimensions with review text".
In the paper you mentioned that review topics could be aligned both with user and item features and you showed the results of both approaches in Table 3.
My questions are:
which strategy is implemented by default in the code?
Can I easily change it?
A:
In the code factors are attached to items by default. It's very easy to change -- just change the order that they get read in, i.e., change
uName >> bName
to
bName >> uName
Then the alignment will still be to the "items" but the items will actually be the user IDs.
==================================================
Q:
I want to change the LDA in the HFT model to twitter-LDA and see whether a better result can get. But I have no ideas which part of your code should I work on. Could you please give me a guide that how can I change it? In the twitter -LDA, there is one more hierarchy that belongs to the topic distribution over users as well as a Bernoulli distribution that controls background words or topic words generated.
A:
You should start by looking into the "lsq" and "dl" functions, which implement the energy and its derivative. Make sure you fully understand how these relate to the equations in the paper.
==================================================
Q:
I am using your code and I understand that it saves the topic-weights for each word in the corpus for all topics in "model.out". In my application, I am trying to compute the topic distributions for "new documents" created by considering a subset of reviews for a particular product. Is it reasonable to sum the topic-weights over the words in my new document and use the transformation exp(kappa*gamma_k) / (sum_k'(exp(kappa*gamma_k'))), where gamma_k is the sum of topic-weights for topic-k to compute theta_k? How does the background-weight for each word factor into this? If this isn't correct, can you let me know how I can compute the topic-distribution?
A:
Sounds like you're on the right track -- to compute representations for new documents, you just want to get the topic representations (i.e., phi in a topic model) and then proceed exactly as if it were a regular topic model.
For this, you *do* want to include backgroundWords though. I only separated background words from the word distributions so that they'd be more easily visualizable. To produce the word weights for each topic, you'll want to add the background weights back to each dimension (as I recall, backgroundWords+topicWords in the code).
In terms of the existing code, the most relevant thing to look at is "wordZ()". This computes the normalization constant for phi. So the actual probability that a word is associated with a topic is given by:
double* Z = new double [K];
wordZ(Z);
for (int k = 0; k < K; k ++)
for (int w = 0; w < nWords; w ++)
phi[k][w] = exp(backgroundWords[w] + topicWords[w][k]) / Z;
==================================================
Q:
I hope it would not be too much trouble to ask you one more question regarding your paper, "Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text." I have a question about the \mu in equation 5 of page 3. You mentioned the \mu is the "hyperparameter that trades-off the importance of [the rating error and the corpus likelihood]." I was wondering how you decided on the \mu value in the model. Did you get the \mu using cross validation or another method? I was not able to figure out how to replicate your \mu and was wondering if you could guide me through your steps in obtaining the \mu. Also, does the \mu in the paper translate to the lambda in your code?
We pre-processed your data for
our model by removing stop words and using stemming. We would like to
compare against your model. Having the latentReg and lambda parameters
for various datasets would help us with our experiments.
I also wanted to confirm that you tune these parameters by performing
grid search on a validation set.
A:
1) Yes indeed it's grid search on the validation set.
2) I used powers of 10: lambda \in 0.01 ... 10, and latentReg (mu) \in 1 ... 10000.
3) I computed the validation error after each iteration of the outer loop (i.e., after performing a complete gradient update/sampling), and chose the lowest validation error. The performance tended to deteriorate somewhat after it had achieved the lowest validation error. Probably this is due to the funky combination of sampling/MLE that I do. I gave the baseline (the latent factor model) the same advantage by computing the validation error between gradient updates.
4) I can hunt down my log files and find out what the optimal values of latentReg were for me, but based on how you're pre-processing the data, the optimal values will be different for you. I didn't do any stemming/stopword removal, though I did remove punctuation/capitalization. I also used a fixed dictionary size for all experiments (5K words), and ignored words that weren't among the 5K most-frequently occurring. If you do anything differently it'll change the likelihood and therefore the optimal values, so I suggest you just run gridsearch in the same way to get the optimal values for your processed data.
5) Ultimately, the code on my website is the same as what I used to report performance in the paper.
6) In truth I'm still not happy with the regularizer I used... ideally there should be some kind of smarter normalization based on the document length. As it is, the current model regularizes too much on long documents (i.e., products with many reviews), so that it often performs *worse* than the latent factor model for highly reviewed products, rather than what it should do, which is gracefully decay so as to ignore the regularizer altogether when there are many reviews. I tried some dumb fixes, but none of them outperformed the naive regularizer that I used in the paper (in almost all cases, products with few reviews make up the bulk of the datasets).
==================================================
Q:
Recently, I was interested in making racommendation using reviews and ratings. To my best knowledge, your work which titled "Hidden Factors and Hidden Topics" in Recsys'13 is one of the state-of-the-art work. So I downloaded your source code and your Amazon datasets which are shared on your profile site(http://cseweb.ucsd.edu/~jmcauley/), and run your code on these data. But the experimental results is some different from that of your paper. I think that must be some different betweent my experiment and yours. So I have some questions about that as follows:
In my experiment, the parameters (latentReg = 0, lambda = 0.1, K = 5) are all default as in your code. Are they the settings of your report results in your paper? Do these parameters need to be adjusted?
In the Amazon datasets, some itemIDs and userIDs are "unknown" for some reasons. In my experiments, I discarded these records (ratings and the corresponding reviews). Did you discard them or how did you deal with them?
A:
First, you need to do hyperparameter tuning. As I recall I took lambda in {0.001, 0.01, 0.1, 1, 10} or something like that. You'll then report the *test* error for the iteration which minimizes the *validation* error.
Second, I didn't throw out any "unknown" user/item IDs. These users/items then will all just get treated as if they were a single user/item.
==================================================
Q:
I downloaded the source code of HFT, it‘s a excellent work. But I am puzzled about the update of kappa and gamma, there is the code:
double q = -lambda * (beerTopicCounts[b][k] - beerWords[b] * exp(*kappa * gamma_beer[b][k]) / tZ);
I think beerWords[ b] is also beerTopicCounts[ b][k] as it’s a derivation of
-lambda * beerTopicCounts[b][k] * (*kappa * gamma_beer[b][k] - lZ);
A:
beerTopicCounts [b][k] is the number of times topic k appears for words in reviews of b.
beerWords[b] is just the total number of words in reviews of b. So yeah, if you add up beerTopicCounts[b] for all values of k you should get beerWords[b].
Other than that it's just the derivative of the likelihood expression wrt kappa. The negative part of the expression is what you get from differentiating the (log of the) normalization constant.
Q:
If beerTopicCounts[ b] for all values of K are added up, it may be written out of loop. But now the derivation part is in the loop for K. I cannot understand why it also be beerWords inside the loop for K.
A:
beerWords[b] shows up because the corresponding term appears for *every* word in the review.
beerTopicCounts[b][k] shows up because that term appears only for those words that have topic k.
I'd suggest going through the likelihood equation as it's written in the paper and computing its derivative with respect to kappa. You should reproduce what I have there.
==================================================
Q:
I found that your RecSys’13 paper (http://i.stanford.edu/~julian/pdfs/recsys_extended.pdf) is very interesting. I am also working in the same area.
There are several points in your paper I am still not clear.
1. How’s your HFT model compared to the CTR model mentioned in the “Collaborative topic modeling for recommending scientific article” paper ?
Although there are some differences between two models, both assume a linear mapping between hidden factors and hidden topics. CTR generates items’ hidden factors from their topic distribution, and HFT does it inversely.
It is also interesting to know the performance between the two approaches.
2. HFT has an advantage for cold-start users and items. However, this hypothesis is shown through 2/3 categories (movies, music). Do you have any explanation why it does not work in the Book category case ? How about other categories, like Beer, Clothing, etc. ?
A:
1. Yeah, the two models are similar, but not directly comparable (for one, CTR is not predicting ratings). To compare the two you'd have to make some modifications to CTR, it's not clear how best to do that.
I wouldn't say that CRT does things one way while HFT doing things inversely, there's a one-to-one correspondence between factors and topic distributions in HFT: one does not "generate" the other, but rather a single parameter simultaneously defines both.
2. The cold-start plots were hard to generate in a nice way. To generate these plots, you need items that show up only a few times in the training set, but enough times in the test set to generate statistically significant results. Although the plots are nice and smooth on the biggest datasets, they get pretty noisy on the smaller ones. There are also certain other factors that may be at play, for instance products with only 1-2 reviews may be somehow different from products with many reviews (e.g. niche products versus mainstream products), so it can be hard to compare the two. From memory, results were similar on other datasets, but much noisier.
As for the book category, I'm not so sure. Books are a bit unique in the sense that people discuss a lot of things that aren't evaluative (e.g. the plot of the book). There are also loads of products with only 1-2 reviews. That means to get good performance *on average across the whole corpus* the model will be very reliant on review text (which is beneficial for products with few reviews), at the expense of poor performance on products with many reviews (in which case having a lot of ratings is more reliable than having a lot of text). If you were to train the same model only on books with >10 reviews, the model would behave quite differently (e.g. it would rely less on the corpus likelihood).
Long story short I suspect that this result in the case of books is simply due to the statistics of how much of the corpus consists of "new" products versus products with many reviews. I think some careful modifications to how the regularizer works might address this.
==================================================
Q:
First of all I need to know how to get the propertise of each user and business in your existing code. I have tried several times but failed. Could you please tell me the way to get it?
A:
To save the item representations you'll want to do something like
for (int i = 0; i < nItems; i ++)
{
printf("%s", corp->rItemIds[i].c_str());
for (int k = 0; k < K; k ++)
printf(" %f", gamma_item[i][k]);
printf("\n");
}
Same thing for users.
==================================================
Q:
I am trying to do some evaluation work on your HFT model. I wish to extract the words distribution over topics and topic distribution over items, i.e., phi and theta respectively. Seems that you have already answered how to extract the phi. Could you please tell me how to get the topics proportion of each item?
A:
Note that gamma_i determines the topic proportions via the softmax function in eq. 4. If you get the code to spit out kappa and gamma that will be enough to compute the topic proportions.