Tentative schedule

Workshop took place on July 1st (second day of workshops) in FRN.

Video recording of the workshop


9:00-10:00 Causal reasoning and Learning Systems.
Leon Bottou, Jonas Peters, D. Max Chickering, Patrice Simard
Due to a sudden personal reason Leon cannot attend to ICML this
year. The talk will be given by Elon Portugaly.
10:00-10:30 On Bayesian Bandits Algorithms
Émilie Kaufmann
10:30-11:00 Coffee break
11:00-12:00 Two basic problems in finite stochastic optimization
Sebastien Bubeck
12:00-12:15 Dynamic Pricing with Limited Supply
Moshe Babaioff,Shaddin Dughmi, Robert Kleinberg, Aleksandrs Slivkins
download paper
12:15-12:30 Learning Hurdles for Sleeping Experts
Varun Kanade, Thomas Steinke
 download paper
12:30-12:45  Contextual Bandits: Approximated Linear Bayes for Large Contexts
Jose Antonio Martin H
download paper
12:45-14:00 Lunch
14:00-15:00 The knowledge gradient for optimal learning
Warren B. Powell
15:00-15:10 Introduction to the Exploration & Exploitation Challenge
Jérémie Mary
15:10-15:25 Second team of the Phase 1 presentation
Regulating Greed over Time for Yahoo! Front Page News Article

Cynthia Rudin,  Virot Ta Chiraphadhanakul, Edward Su (MIT)
15:25-15:45 Winner of phase 1 presentation
Upper Confidence Bound with Time Information on News Article

Ku-Chun Chou, Hsuan-Tien Lin (National Taiwan University)
15:45-16:00 Coffee break
16:00-16:10 Annoucement of winners of Phase 2
16:10-17:10 Open discussion on future of exploration & exploitation


You can find the last year proceedings (2011 edition) on  JMLR Workshop series


Confirmed invited speakers

(order is the same than in the schedule)

Leon Bottou ,Jonas Peters, D. Max Chickering, Patrice Simard.

TitleCausal reasoning and Learning Systems.

Abstract: Using the ad placement problem as an example, we describe an
intricate link between causation and the interpretation of
experimental data in learning systems. Following the lead gives
(a) a “non-regret” perspective on exploration/exploitation decisions,
(b) means to leverage the fine structure of the causal graph,
(c) interesting differential techniques,
(d) plenty of open questions.

Emilie Kaufmann

TitleOn Bayesian Upper Confidence Bounds for Bandits Problems and Optimality of Thomson sampling

Abstract: We present here two algorithms based on Bayesian modelling of the MAB, that are proved to be optimal in term of frequentist regret. The first, Bayes-UCB uses a Bayesian Upper Confidence Bound based on quantiles of the posterior distribution as an index. The second is Thompson Sampling, a randomized Bayesian algorithm dating back to 1933 whose first finite-time regret analysis was proposed at COLT (2012) and for which we now have an optimal finite-time analysis.

Sebastien Bubeck

Title: Two basic problems in finite stochastic optimization

Abstract: I will present a new theoretical perspective on two basic problems arising in stochastic optimization. The first one is arguably the most elementary problem in stochastic optimization: assume that one wants to find the maximum of function defined on a finite set, and that one is given a budget of n noisy evaluations. What is the best sequential allocation procedure for the evaluations?

The second problem that I will discuss is inspired from the issue of security analysis of a power system. We formalize this problem as follows: Let X be a set, and A a subset of X of “interesting” elements in X. One can access X only through requests to a finite set of probabilistic experts. More precisely, when one makes a request to the i^th expert, the latter draws independently at random a point from a fixed probability distribution P_i over X. One is interested in discovering rapidly as many elements of A as possible, by making sequential requests to the experts.


Warren B. Powell

Title : The knowledge gradient for optimal learning

Abstract : The knowledge gradient is a policy for efficiently learning the best of a set of choices by maximizing the marginal value of information, a form of steepest ascent for a belief model. The policy has no tunable parameters, and has been adapted to both online (bandit) and offline (ranking and selection) problems. Versions of the knowledge gradient have been derived for both discrete and continuous alternatives, and a variety of Bayesian belief models including lookup tables with correlated beliefs, parametric models and two forms of nonparametric models. We have also derived the knowledge gradient for a robust (min-max) objective. We report on comparisons against popular policies such as Boltzmann, epsilon-greedy, interval estimation and UCB policies. We will discuss applications in drug discovery (a problem with 90,000 alternatives) and the tuning of parameters for black-box simulations (vector-valued continuous parameters). Additional information may be found at http://optimallearning.princeton.edu.

Python version

Thanks to the work of José Antonio Martín we are now able to propose you to enter the competition using  Python. Thanks again to him :-)


All you have to do is to download the archive exploChallenge-Python, extract it and follow the readme.txt. You have to implement the MyPolicy class. To submit your algorithm, use the same form than for java files (you will have to provide a .zip file embedding your Python code).


Version of Python on the cluster is 2.6.5 with numpy 1.3.0 and SciPy 0.7.0. The Python code needs slightly more time than the java one. For a very basic policy on the first part of the data, Python needs 45 min and Java needs 26 min.


Edit May 03th: A newer of python is now available (python 2.7, numpy 1.7.0.dev-259fff8, scipy 0.11.0.dev-f14e0fe ). To use it you have to replace python exploChallenge/Main.py by python2.7 exploChallenge/Main.py in go.sh before building your submission zip. This version of python have not been extensively tested (and some modules could be missing) so do not hesitate to report bugs.

Call for papers

We think that three new kinds of challenges are more and more important in Exploration and Exploitation :

  • Contextual bandit problems with large number of contexts and (unknown) correlations between arms with respect to the context,
  • Numerical difficulties due to large scale problems, with the need for very fast, robust algorithms able to learn and reply in a few milliseconds per query on a standard computer,
  • New theoretical breakthroughs in the analysis of bandit algorithms.

Balancing exploration and exploitation is of high practical importance in large-scale applications on the web and also in information retrieval, where it is desired to help users quickly access the information they are looking for. This requires learning what users are interested in (exploration) and, based on these learnings, to serve adequate targeted content (exploitation).

The workshop will be single-day, made  of invited talks and presentations of contributed work, with time for discussion. Depending on the quality and compatibility with the workshop objectives, slots for brief talks and posters will be allocated.

We invite contributions on the topic of exploration and exploitation in various domains including (but not limited to) bandit games, online optimisation, reinforcement learning, tree search, recommender systems, information retrieval, etc. Topics of interest include: applications and practical performance of techniques to trade exploration and exploitation, benchmark/comparison of different techniques, computational challenges in large scale applications, new techniques, theoretical analyses, etc.

Contributions should be communicated to the programme committee (the organisers) by way of an extended abstract (a pdf file, from 4 to 8 pages formatted in the ICML conference paper style), sent by email to jeremie.mary[at]inria.fr before May 7th 2012, midnight WST.

Dual Submissions Policy: we accept contributions already published if theses conditions are filled :

  • First publication of the work in the previous year
  • If the work is also an ICML’12 paper then the workshop paper must emphases the link between the paper and the challenge.

You should send us your best papers !!!

You can find the last year proceedings on  JMLR Workshop series


Some remarks

Some remarks on the task:

  • The context is a set of 136 binary features
  • You usually have to choose you favorite arm in a set of 30 arms
  • The clickthrough rate (CTR) is not constant over time, you can try to take advantage of it.
  • There is a total of 30 millions of records. 10 millions are used for the first part of the challenge.
  • The record will be used for evaluation of your algorithm only 1/30 of the time (data are collected with a random policy). To fulfill the time constraint this means you should be able to choose an arm within 5 ms and to update your policy within 50 ms when receiving feedback.
  • The total number of documents (ie articles displayed) in the first part of the challenge is 246 (timestamp 1317513291 to 1317945293).  The total number of documents in the full dataset is 652 (timestamp 1317513291 to 1318809293).
  • Data are presented in chronological order. So between two  consecutive users possible choices tends to be same but evolves over time.
  • Evaluation is done on our servers to simulate online learning and avoid the use of some offline methods on the whole set of data. If for test purpose you want more datas consider the R6 data set from Yahoo!
  • In phase 1, winners will be known at the beginning of June, these winners are strongly encouraged to present their work at the workshop.
  • Phase 2 results will be known only at the workshop, it will be the same procedure of evaluation but with more (and new) data. Participants cannot submit any new algorithm, we will use their best submission of phase 1.
  • If you dislike Java, you can contact us and rewrite the evaluator under GPL license (specs will be provided). After reviewing your code, we may accept it and allow a new language of coding for all the participants. Time limit will remain the same whatever the language.
  • After the challenge the data will be made available through the Yahoo! Webscope program.

Getting started

First you have to suscribe

How to install everything ?

First you need java. Version of the cluster is :

java version “1.6.0_24″
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

If you are using java 1.5 it will lead to crashes.

Download this archive and extract it. It contains everything you need to take part in the challenge :
- The source code in java in which you have to write two functions.
- The 100 first lines of the data so that you can test your implementation.

We recommend to use Eclipse to work on this project because it is a very powerful tool to write code in java. However if you are used to coding with something else, don’t worry, everything will be as simple.

With Eclipse :
If you don’t already have it, you can download eclipse here.
To open the project click on File > New > Java Project. Then untick the Use default location box and click on the Browse button. Select the ExploChallenge directory that was in the archive.

If you cannot see the Package Explorer tab, open it by clicking on Window > Show view > Package Explorer. In this explorer, you can browse the source code of the project. A double click on the class MyPolicy of the package myPolicy displays the class you have to fill in in order to participate.

Without Eclipse :
Nothing special to do here. The source code is in the src/ directory. The class MyPolicy of the package myPolicy is the one you have to fill in in order to participate. You will need ant (kind of an equivalent of make for java) especially if you plan on using external libraries later.

How to implement your solution ?

These are the methods of the class you have to fill in :

public MyPolicy() {
//Any initialization of your algorithm should be done here.
public YahooArticle getActionToPerform(YahooVisitor visitor,
List possibleActions) {
//Given a visitor, you have to choose the "best" article in the list.
return possibleActions.get(0); //here we choose the first article!
public void updatePolicy(YahooVisitor c, YahooArticle a, Boolean reward) {
//update your policy given the visitor, the displayed article and
//the associated reward (click or not click)

You can of course create new classes/packages and use them in this one. However do not write any code in the exploChallenge package (no new classes, no modifications) because this code won’t be taken into account on our cluster. You are given a modified version of the 100 first lines of the dataset in order to test your code. Please make sure that your code is at least working on this very small subset of the data before submitting it.

With Eclipse :


Compilation is automatic. You will be notified by way of small icons as soon as you write incorrect java code. Clicking on the icon will suggest corrections. To run the test, just click on the run button (see picture) or use Ctrl F11. The result will be displayed in the Console tab. If you cannot see it : Window > Show View > Console

Without Eclipse :
We supply a build.xml file which is to ant what the Makefile is to Make. The command ant compile compiles the project. ant run runs the test.

How to submit an algorithm

You have to submit here a jar file built from the project.

With Eclipse :
Display the ant tab with Window > Show View > Ant . Then drag the build.xml file from the package explorer to the ant tab. This procedure only has to be done once. 4 targets are available. The target jar builds a jar called submission.jar at the root of the project. clean and compile are useless under Eclipse and run can be used to run the test.

An alternative way is to right click on the project in the package explorer and then Export… > Java > Runnable JAR File. In the window that appears, select the following options :
Launch configuration: Main – ExploChallenge
Export destination: where you want to generate the jar.
Library handling: Extract required libraries into generated JAR

Without Eclipse :
The command ant jar builds a jar called submission.jar at the root of the project.

How to use an external library

In order to allow Eclipse to find the library you are using, you shoud right click on your project, then Build Path > Configure Build Path…. In the window that appears, click on Add External JARs and select the jar of your library. If you are using ant (even with Eclipse) you also have to make a copy of your library in the lib directory of the project.

A few rules concerning submissions

Your submissions are NOT allowed to log any part of the data or to access the hidden parts of the data (even for debugging purposes). All the files and outputs are kept on our servers and will be automatically and manually inspected. Any attempt of cheating will lead to immediate and definitive exclusion.

Related publications

The Pascal-2 Exploration and Exploitation Challenge 2011 presents a problem similar to web advertising as presented by [1], [2], and [3]. It can be seen as a bandit problem in which arms are (visitor, ad) pairs and rewards are binary (1 when click, 0 otherwise). Recent theoretical results on (possibly generalized) linear bandit algorithms ([12] ,[13]) Gaussian Process Optimization for trading exploration and exploitation have sparked interest in Bayesian approaches ([6], [4] and [11]) or KL-bandits ([5] and [10]), but for web-scale applications the computational cost is critical (as well as performance!).

There is also some new interest in Thompson sampling for such applications brought to the attention of the community by [7] and with a beginning of theoretical analysis by [8].

[1] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms, Lihong. Li, Wei Chu, John Langford &  Xuanhui Wang. WSDM, 2011.

[2] Multi-armed Bandit Problems with Dependent Arms, Sandeep Pandey, Deepayan Chakrabarti & Deepak Agarwal , ICML 2007

[3] Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine, Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, Ralf Herbrich, ICML 2010

[4] Regret Bounds for Gaussian Process Bandit Problems, Steffen Grunewalder, Jean-Yves Audibert, Manfred Opper & John Shawe-Taylor, AISTATS 2010

[5] The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, by Aurelien Garivier & Olivier Cappe, arXiv:1102.2490

[6] Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias Seeger, ICML 2010

[7] An Empirical Evaluation of Thompson Sampling, Olivier Chapelle & Lihong Li, NIPS11

[8] Analysis of Thompson Sampling for the multi-armed bandit problem, Shipra Agrawal & Navin Goyal, arXiv:1111.1797v2

[9] ICML Exploration and Exploitation challenge: Keep it Simple !, Olivier Nicol, Jeremie Mary & Philippe Preux, Journal of Machine Learning Research (JMLR), 2012 (accepted, to appear)

[10] Finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. Odalric-Ambrym Maillard, Remi Munos & Gilles Stoltz. In Conference On Learning Theory, 2011

[11] On Bayesian Upper Confidence Bounds for Bandit Problems E. Kaufmann, Olivier Cappe, & Aurélien Garivier. AISTATS 2012.

[12] Parametric bandits : The generalized linear case. Sarah Filippi, Olivier Cappé, Aurélien Garivier, & Csaba Szepesvari, NIPS, 2010.

[13] Improved Algorithms for Linear Stochastic Bandits (extended version), Yasin Abbasi-Yadkori, David Pál, & Csaba Szepesvári NIPS, 2011

Raw data description

Each line in the files corresponds to a user visit. An example line is as follows:

1317513291 id-560620 0 |user 1 9 11 13 23 16 18 17 19 15 43 14 39 30 66 50 27 104 20 |id-552077 |id-555224 |id-555528 |id-559744 |id-559855 |id-560290 |id-560518 |id-560620 |id-563115 |id-563582 |id-563643 |id-563787 |id-563846 |id-563938 |id-564335 |id-564418 |id-564604 |id-565364 |id-565479 |id-565515 |id-565533 |id-565561 |id-565589 |id-565648 |id-565747 |id-565822

which contains the following fields delimited with spaces:

* timestamp: e.g., 1317513291
* displayed_article_id: e.g., id-560620
* user_click (0 for no-click and 1 for click): e.g., 0
* string “|user” indicates the start of user features
* user features are 136-dimensional binary vectors; the IDs of nonzero features are listed after the string “|user”
* The pool of available articles for recommendation for each user visit is the set of articles that appear in that line of data. All user IDs (bcookies in our data) are replaced by a common string “user”.

Note that each user is associated with a 136-dimensional binary feature vector. Features IDs take integer values in {1,2,…,136}. Feature #1 is the constant (always 1) feature, and features #2-136 correspond to other user information such as age, gender, and behavior-targeting features, etc. Some user features are not present, since not all users logged in to Yahoo! when they visited the front page.

A unique property of this data set is that the displayed article is chosen uniformly at random from the candidate article pool. Therefore, one can use an unbiased offline evaluation method [1] to compare bandit algorithms in a reliable way.

Related publications for further information:
* Evaluation methodology: http://dx.doi.org/10.1145/1935826.1935878
* Reference performance: http://doi.acm.org/10.1145/1772690.1772758
* First version of the data, collected for a different period of time: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r (“R6″)