
How predictable are you? Say you view 100 pages a day. And of those pages, you like about 5% of them (and indicate that by pressing a “like” button). Is there some way I can predict which 5 of the next 100 pages you view you will actually like? If I know what your interests are, maybe I can.
Using AlchemyAPI, we extracted keywords from the pages viewed in a mobile app that the user indicates that he or she likes. We aggregate over all of these to form an interest graph for that individual. But how good is that? Are the keywords that we are extracting really representative of a user’s interests? In order to find out, we test it. And how we test it is by building an interest graph based on a subset of what that user has seen and test how well we can use that to predict which other pages he will like.
How did we do? Well, we were about 23.8% accurate over 100 tests. So on average, we got about 1 out of those 5 correct. Considering that, by chance, we would only expect to get about 9.4% correct, that’s not bad (that’s a p-value of 1.085 × 10-13 for you statisticians out there). Could we do better? Certainly. As a matter of fact, it’s pretty remarkable how good this system is when you consider how little it does. For example, it does not account for synonyms. It does not account for higher level concepts and categories. And it does not account for other forms of user activity such as commenting or sharing (both of which imply a higher level of interest in the item in question).
As we move forward, we are looking at other ways of extracting the interests embedded in user activity. We’ll be considering hypernyms or superordinate categories (i.e. if someone is interested in baseball, maybe they are interested in other sports as well), leveraging the vast amount of semantic data in WordNet, Freebase, and other online services. We’ll also be comparing users (if Sue likes what Mary likes, then Mary’s activity should tell us something about Sue). And, of course, we won’t be forgetting our non-English speaking users either. It’s an exciting time to be mining the social web.
Is the number really as high as 5%? I would expect something closer to 0.01% for websites in general…though considerably higher in user-generated/social content.
I tried to build a similar model once that combined different kinds of votes — dwell times, repeat visits, explicit likes, etc — the hope being to votes about more pages. Never really went anywhere, alas.
Also: please don't train the internet to think I am a teenage girl or a single mother. That happens enough already.
Hey Q,
This was for a particularly active user who likes a lot. But you're right, most users are probably closer to 0.01%. In fact, the vast majority of likes are made by only a small fraction of active users.
However, theoretically this should also hold true for views (especially repeat views), though there may be more noise.
And don't worry, you won't be getting any Justin Bieber ads in your Harvard Business Review app.