Exploration Constant in UCB1 algorithm -
i'm writing ucb1 algorithm game. algorithm i'm using is:
average(i) + sqrt( (2 * ln(totalcount)) / count(i) )
where averagei average score of arm i, count(i) count arm i, , totalcount total samples of arms. equation score given arm i. arm highest score, max(i), chosen sampled. algorithm repeats equation new data got sample, ad infinitum, or until runs out of thinking time.
i have assignment tells me "modify exploration constant" algorithm. notice exploits rather explores time -- hardly ever tries arms has visited once. however, don't see exploration constant. missing part of algorithm?
the 2 exploration constant. larger is, more algorithm favors exploration on exploitation.
also beware formula makes sense when payoffs in [0,1] range, otherwise large payoff (say 1000) nullify influence of "exploration" part of formula, making exploitation-only.
Comments
Post a Comment