The evaluation process

The goal of this challenge is to build an algorithm that learns efficiently a policy to serve news articles on a web site. At each iteration of the evaluation process, you will be asked to pick an article from a list given a visitor (136 binary features + a timestamp). To build a smart algorithm, you might want to balance carefully exploration and exploitation and pay close attention to the “age” of the news articles (among other things of course). A quick look on the leaderboard is enough to figure out why that last point matters. It is the overall CTR (click through rate) of your algorithm that will be taken into account to rank it on the leaderboard.

Since the evaluation process is based on data, all your choices cannot be evaluated (simply because they don’t necessarily appear in the dataset). Please have a look at the following details to fully understand how this fact should impact the design of your algorithm.

The details

The evaluation of your algorithm will be performed on the 920.000 first lines of the dataset. The constructor of your policy will be called before that.

public MyPolicy() ;

Choosing an article
At each iteration, the evaluator will ask you to choose an article to display to a visitor by the way of the following method.

public YahooArticle getActionToPerform(YahooVisitor visitor,
List possibleActions) ;

A visitor is characterized by a 136-long binary vector and a time. The method visitor.getFeatures() returns an array containing them. You also have access to a timestamp via the method visitor.getTimestamp().
NB: The method must always return an article from the list. If it doesn’t return one or returns another one, an exception will be raised and the evaluation stopped.

Updating the policy
If and only if you have chosen the same article as in the data, the evaluator allows you to update your policy using the current line of the dataset via this method:

public void updatePolicy(YahooVisitor c, YahooArticle a, Boolean reward);

NB: In case you are wondering why you cannot learn from every line: It’s because if you could, exploration would be useless!

An iteration is evaluated only if you have chosen the same article as the one in the data (same as above). As you choose among around 30 articles and that a random policy was used to generate the data, this should happen roughly 9M/30 = 300k times. Your final score is the number of times you got a click divided by the number of times you chose the same ad as in the data. This evaluation process is very close from being unbiased (see this paper for theoretical and empirical proofs).

The evaluation of your algorithm must run in less than 36 hours. Passed that limit, the evaluation process will be killed. Note that a random policy runs in 30min. It lets you plenty of computationnal time. If an error (java exception) occurs during the evaluation, the process will also be stopped.

Programs that terminates before the end of the data (not answering “correctly” at least at random) receives a value but are ranked at the end of the leaderboard with a mention “Timeout”. You can use this behavior to early stop your algorithm and test quickly a new idea.

About the 100 given lines

The rewards, the possible choices and the attributes have been modified so your should not rely on this data except for testing purposes. The resulting click rate is roughly 0.5 (in the whole dataset the average click rate is around 0.03) and you have less articles available in order to make the testing process more relevant (as far as code testing is concerned of course).