How predictable are you? Say you view 100 pages a day. And of those pages, you like about 5% of them (and indicate that by pressing a “like” button). Is there some way I can predict which 5 of the next 100 pages you view you will actually like? If I know what your interests are, maybe I can.
Using AlchemyAPI, we extracted keywords from the pages viewed in a mobile app that the user indicates that he or she likes. We aggregate over all of these to form an interest graph for that individual. But how good is that? Are the keywords that we are extracting really representative of a user’s interests? In order to find out, we test it. And how we test it is by building an interest graph based on a subset of what that user has seen and test how well we can use that to predict which other pages he will like.
How did we do? Well, we were about 23.8% accurate over 100 tests. So on average, we got about 1 out of those 5 correct. Considering that, by chance, we would only expect to get about 9.4% correct, that’s not bad (that’s a p-value of 1.085 × 10-13 for you statisticians out there). Could we do better? Certainly. As a matter of fact, it’s pretty remarkable how good this system is when you consider how little it does. For example, it does not account for synonyms. It does not account for higher level concepts and categories. And it does not account for other forms of user activity such as commenting or sharing (both of which imply a higher level of interest in the item in question).
As we move forward, we are looking at other ways of extracting the interests embedded in user activity. We’ll be considering hypernyms or superordinate categories (i.e. if someone is interested in baseball, maybe they are interested in other sports as well), leveraging the vast amount of semantic data in WordNet, Freebase, and other online services. We’ll also be comparing users (if Sue likes what Mary likes, then Mary’s activity should tell us something about Sue). And, of course, we won’t be forgetting our non-English speaking users either. It’s an exciting time to be mining the social web.