Reinforcement Learning and Life Lessons

One of the most fundamental concepts in Reinforcement Learning is the tension between exploration and exploitation. To understand RL, you must understand these two terms.

In this framework, a learning agent interacts with its environment by making observations and taking actions. These actions are either rewarded or punished. Since agents don't have neurotransmitters like we do, a simple positive number serves as a reward, while a negative number acts as punishment. The scale of these rewards is what ultimately drives the agent to prioritize certain actions over others.

The Two Strategies

The agent’s ultimate goal is to maximize the cumulative reward. To achieve this, it employs two strategies:

Exploitation: When an agent first starts training, it tries random actions. Once it finds an action that yields a positive reward, it might decide to repeat that action indefinitely. This "greedy" strategy ensures a steady stream of known rewards, but it has a major flaw: by refusing to branch out, the agent may never discover an even larger reward hidden elsewhere.

Exploration: Conversely, an agent could explore forever, constantly trying new things while ignoring or forgetting the actions that previously paid off generously.

To solve this, researchers use the epsilon-greedy algorithm. This approach balances the two by exploiting known rewards most of the time (say, 90%) while exploring the action space every now and then (the remaining 10%). Often, this ratio shifts further toward exploitation as the simulation progresses and the agent becomes more "knowledgeable."

The Human Algorithm

Real life mirrors this logic deeply, which makes sense, considering RL is modeled after animal biology and behavior. When we are young, our "action space" is vast. Some of us study hard and receive early rewards in school, only to realize later that the professional world often rewards the opposite behavior: thinking outside the box. Others pour their energy into sports, where only a tiny percentage find success before their physical window closes.

Relationships follow a similar pattern. We might find a rewarding partnership but feel the itch to keep exploring. However, unlike a simulation, life is not a game; you cannot always "reset" to a previous state once an action is taken.

The Constraint of Time

In theory, an RL agent has no time constraints, it can run as long as there is compute power. But we "mere humans" have a finite clock. Our action space is constantly shrinking. At ten years old, the dream of an Olympic gold medal is a valid data point in your action space; a few decades later, that option simply disappears. As a kid I always had grandiose dreams about my future, it was sometimes being a very successful business person or a very famous musician. Now at the age of 40, I realize that my action space is getting smaller faster every year.

This leads to the ultimate human question: How do we know when we’ve found a "good enough" life path, versus when we are simply settling because we’re afraid to keep looking?