Reinforcement Learning for Trading: How an Agent Learns to Trade, and Why It Is Harder Than It Looks

Reinforcement Learning for Trading: How an Agent Learns to Trade

Most machine-learning approaches to trading are supervised: you show the model a pile of labelled examples ("when the chart looked like this, price went up") and it learns to predict the next move. Reinforcement learning (RL) works differently. Instead of predicting a label, an agent learns a policy — a rule that maps the current market state to an action — by interacting with the market and being rewarded or punished for the results of those actions. It is the same family of methods behind game-playing systems that learn by trial and error, applied to the problem of deciding when to buy, sell, or stay flat.

The vocabulary you actually need

State — what the agent sees at each step: recent returns, indicator values, current position, unrealised P&L, time of day. Designing the state is most of the work.
Action — usually a small discrete set such as {go long, go flat, go short}, or a continuous target position size.
Reward — the feedback signal. This is where trading RL lives or dies (more below).
Policy — the strategy the agent is learning, often a neural network mapping state to action.
Episode — one pass through a stretch of historical data, after which the environment resets.

Why the reward function is everything

The naive choice is to reward the agent with the change in account equity at each step. It works, but it teaches the agent nothing about risk: a policy that earns 10% with stomach-churning drawdowns scores the same as one that earns 10% smoothly. In practice you shape the reward to reflect what you actually care about — penalise volatility of returns, subtract transaction costs and spread on every trade, and add a penalty for holding through high-uncertainty periods. If you forget to charge the agent for costs, it will "discover" a beautiful high-frequency strategy that evaporates the moment it meets a real broker.

Common algorithms

Deep Q-Networks (DQN) — learns the value of each action in each state; natural fit for a small discrete action set.
Policy-gradient / PPO — learns the policy directly and handles continuous position sizing well; PPO is a popular, stable default.
Actor-critic methods — combine a policy ("actor") with a value estimate ("critic") to reduce noise in the learning signal.

Why it is harder than the demos make it look

Non-stationarity — markets change regime. A supervised model trained on bad labels at least fails honestly; an RL agent can confidently keep exploiting a pattern that stopped existing months ago.
Low signal-to-noise — financial returns are mostly noise. RL is sample-hungry, and you do not have millions of independent market "lives" to learn from the way a game simulator does.
Backtest overfitting through the back door — every time you tweak the reward, the state, or the network and re-run on the same history, you are quietly fitting to that history. Hold out data you never touch during development, and treat walk-forward testing as mandatory, not optional.
Look-ahead leakage — if any feature in the state secretly contains future information, the agent will find it and your equity curve will look magical right up until it goes live.

A sane way to start

Begin with a single instrument, a discrete three-action space, daily or hourly bars, and a reward that already includes costs. Build the environment so a "do nothing" policy is the baseline you must beat. Train, then evaluate only on out-of-sample data, and compare against a dumb benchmark like buy-and-hold and a simple moving-average rule. If your agent cannot beat those after costs, the problem is not your network depth — it is the signal, the reward, or hidden leakage.

Reinforcement learning is a genuinely powerful framing for sequential decision-making, and position sizing under uncertainty is exactly that kind of problem. But it rewards discipline far more than cleverness: honest data splits, costs baked into the reward, and a healthy suspicion of any backtest that looks too good. Get those right and RL becomes a serious tool. Skip them and it becomes the most sophisticated way yet invented to fool yourself.

Questions or a setup you want to compare notes on? Reply below — happy to dig into state design and reward shaping.