Simulating Context-Free Bandits
In this post I describe a framework and experiment in simulating context-free bandits The explore-exploit dilemma can be found in many aspects of every day life. To put this in concrete terms, imagine a person receives a free 30 meal gift card from a new breakfast restaurant that just opened up in their city. The restaurant may be well known for having good breakfast options; and as a breakfast lover, the person wants to find the best breakfast option on the menu — note that best here means personally favored, not categorically best, as in defined by a food critic or social media popularity. There are dozens of breakfast options, yet, not all of them are equally good, as per the person’s preferences. The resource constraint here is number of meals, which in this case is limited to 30. Assuming a limit of a single meal per day, that gives 30 days as slots to try out the meals. The goal is to maximize total reward by spending as many free meals as possible on the most favored menu option. For instance, if the person tries menu item four, and really enjoys it, then the reward can be 1. If they don’t, then reward can be 0 - this is the lost opportunity in getting a positive experience, which is also called regret. ...