— Ch. 1 · Foundations And Frameworks —
Reinforcement learning.
~5 min read · Ch. 1 of 6
In 1956, Arthur Samuel wrote about machine learning that could improve through experience rather than explicit programming. This early concept laid the groundwork for what would become reinforcement learning decades later. The field defines an agent as any entity that takes actions within a dynamic environment to maximize a reward signal. Unlike supervised learning which relies on labeled data or unsupervised learning which finds patterns in unlabeled data, reinforcement learning trains agents through direct interaction with their surroundings.
The environment is typically modeled as a Markov decision process where states transition based on actions taken by the agent. At each discrete time step, the agent receives the current state and reward information before choosing its next action from available options. The environment then moves to a new state while determining the associated reward for that specific transition. This cycle continues until the agent learns a policy that maximizes expected cumulative rewards over time.
Biological brains appear hardwired to interpret signals like pain and hunger as negative reinforcements while treating pleasure and food intake as positive ones. Animals learn behaviors that optimize these rewards, suggesting natural reinforcement learning mechanisms exist in nature. The mathematical framework assumes full observability when agents directly observe environmental states, though partial observability occurs when only subsets of states are accessible or corrupted by noise.
Exploration Versus Exploitation
The trade-off between exploration and exploitation has been most thoroughly studied through the multi-armed bandit problem and finite state space Markov decision processes in Burnetas and Katehakis (1997). Reinforcement learning requires clever exploration mechanisms because randomly selecting actions without reference to estimated probability distributions shows poor performance overall. With small finite Markov decision processes, researchers understand the dynamics relatively well compared to problems with infinite state spaces.
One practical method is epsilon-greedy where epsilon serves as a parameter controlling how much exploration versus exploitation occurs. With probability one minus epsilon, exploitation gets chosen meaning the agent selects the action it believes has the best long-term effect. When ties occur between actions, they break uniformly at random. Alternatively, with probability epsilon, exploration happens and the action gets chosen uniformly at random from all possibilities.
Epsilon usually remains a fixed parameter but can be adjusted according to schedules making agents explore progressively less over time. Adaptive adjustments based on heuristics also allow flexibility depending on specific application requirements. Simple exploration methods remain the most practical choice due to lack of algorithms that scale well with increasing numbers of states or infinite state spaces.