Traditional reinforcement learning relies on scalar rewards, limiting its use of rich semantic knowledge. In contrast, humans learn by
combining numerical feedback with language and prior knowledge. We introduce
Prompted Policy Search (ProPS), a novel LLM-based reinforcement learning method that unifies numerical and
linguistic reasoning within a single framework. Unlike prior work that augments existing RL components
with language, ProPS places a large language model (LLM) at the center of the
policy optimization loop—directly proposing policy updates
based on both reward feedback and natural language input. Across 15 Gymnasium tasks, ProPS
outperforms all seven popular RL methods on 8 tasks and shows further gains with domain knowledge, demonstrating the
benefits of integrating semantics and numerics for more efficient, human-aligned RL.
The video below illustrates the policy search process of ProPS in the Gymnasium Swimmer environment. It
shows how the LLM synthesizes policy parameters based on reward feedback through its own reasoning,
combining both numerical and linguistic signals.
Summary of Results
(1) LLMs can directly perform reinforcement learning and optimize policies.
(2) LLMs can use semantic signals for smarter, more efficient policy search.
(3) We introduce ProPS, a method for LLM-based policy search.
(4)ProPS outperforms widely-adopted RL algorithms, including PPO and TRPO, on 8 out of 15 diverse RL tasks.
Overview
We present a novel reinforcement learning (RL) approach in which a large language model (LLM) directly
generates policy parameters without relying on a conventional RL optimizer or any external optimization
component beyond the reward signal. Traditional RL methods focus on numerical information (e.g., gradients
with respect to the reward) and as a result cannot incorporate important task-specific knowledge that is
difficult to express in numbers, such as domain semantics or user-provided guidance. To address this
limitation, we introduce Prompted Policy Search (ProPS), a new method that combines
numerical reasoning with linguistic reasoning to enable more flexible and informed
learning. By linguistic reasoning, we mean the ability of LLMs to understand, process, and analyze natural
language in order to draw (deductive and inductive) inferences and make informed decisions. This allows us
to embed valuable information like prior domain knowledge, goals, or user-provided policy hints directly
into the learning process using natural language. For example, traditional RL methods treat all input
features as raw numbers and do not distinguish between features expressed in different units, such as
meters versus kilometers. In contrast, an LLM can interpret text-based task descriptions that explain the
nature and context of each feature.
ProPS Prompt
You are a good global optimizer, helping me find the global maximum of a mathematical function
f(params). I will give you the function evaluation and the current iteration number at each step. Your
goal is to propose input values that efficiently lead us to the global maximum within a limited number
of iterations (400).
1. Regarding the parameters param: % definitions of parameters
2. Here’s how we’ll interact: % formatting instructions
3. Remember: % constraints to be respected
The figure above illustrates a truncated version of the prompt (full prompt in paper). The system message
specifies the role of the LLM as a global optimizer and indicates the total number of optimization
iterations. The prompt includes three key components: (1) definitions of the parameters to be
optimized, (2) formatting instructions for the LLM’s output, and (3) any additional constraints the
LLM must adhere to during optimization. At each iteration, the LLM receives the prompt along
with a history of previous parameter suggestions and their associated rewards (i.e., in-context
examples). It then proposes a new parameter vector, accompanied by a textual justification of the
update. These justifications add a layer of interpretability to the search process, as they describe
observed trends in the data.
ProPS+ Prompt
You are a good global RL policy optimizer, helping me find an optimal policy in the following environment:
1. Environment: % definition of the environment, parameters and policy
In the cartpole environment, a pole is attached by an un-actuated joint to a cart which moves along a
frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by
applying forces in the left and right direction on the cart.The state is a vector of 4 elements, representing
the cart position (-4.8 to 4.8), cart velocity (-inf to inf), pole angle (-0.418 to 0.418 rad), and pole
angular velocity (-inf to inf) respectively. The goal is to keep the pole upright and the cart within the
bounding position of [-2.4, 2.4]. The action space consists of 2 actions (0: push left, 1: push right).
The policy is a linear policy with 10 parameters and works as follows: action = argmax(...) The reward
is +1 for every time step the pole is upright and the cart is within the bounding position. The episode
ends when the pole falls over or the cart goes out of bounds.
2. Regarding the parameters param: % definitions of parameters
3. Here’s how we’ll interact: % formatting instructions
4. Remember: % constraints to be respected
The figure above illustrates ProPS+, which is a semantically-augmented variant of
ProPS, where we extend the basic framework to incorporate rich, task-specific, and contextual
knowledge into the reinforcement learning process via semantically-informed prompts. The
example shown describes the CartPole environment, using text adapted from publicly available documentation (e.g., OpenAI
Gym/Gymnasium). In this example, the prompt specifies details such as the task description, action space (binary),
policy parameterization (linear), and reward structure. Additionally, it includes optional expert-provided guidance on
desirable or undesirable policy behaviors, framed as constraints.
Visualizing the Policy Search Process of ProPS
Below is the visualization of the policy search process of ProPS compared to PPO, where each point represents the parameters of a learned policy, projected down by T-SNE. Hover over any point to watch
the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior. Larger points indicate higher rewards obtained by the policy.
Iteration
Reward:
ProPS
PPO
Episode 0
Episode 8000
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus.
Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor.
... (more text to make it scrollable) ...
Policy Parameters
Hover over points to see the LLM outputs!
Experiment Results
We evaluate the performance of both ProPS and ProPS+, using GPT-4o, across 15 widely-used
reinforcement learning benchmarks from the OpenAI Gym and Gymnasium suites. For tasks with
continuous state spaces, we employ linear policy representations, while discrete-state tasks use tabular policies. The
selected environments span a diverse range of RL domains, including classic control problems (e.g., CartPole,
MountainCar), games (e.g., Pong, Nim), continuous control tasks (e.g., MuJoCo environments),
etc. Notably, in 7 out of 15 environments, ProPS outperforms all baseline algorithms. After incorporating
domain knowledge, ProPS+ achieves the highest performance in 8 out of 15 tasks. The table below displays the
average return and standard deviation over 10 random seeds for each method. The best and second-best
baseline results are highlighted and underlined, respectively.
Domain
Best Baseline
2nd Best Baseline
ProPS
ProPS+
Mount. Car (C)
SAC
86.65 ± 0.84
PPO
78.16 ± 5.32
87.21 ± 29.28
89.16 ± 29.72
Inverted Pend.
TRPO
571.31 ± 358.88
PPO
218.65 ± 129.31
1000.00 ± 0.00
1000.00 ± 0.00
Inv. Dbl. Pend.
TRPO
3609.37 ± 4000.04
PPO
108.60 ± 4.12
128.17 ± 24.52
148.39 ± 48.65
Reacher
PPO
-7.32 ± 0.38
TRPO
-8.93 ± 1.39
-11.32 ± 1.37
-18.15 ± 22.06
Swimmer
TRPO
52.96 ± 18.86
PPO
39.40 ± 6.54
218.83 ± 58.45
227.30 ± 56.23
Hopper
TRPO
716.90 ± 385.20
PPO
351.75 ± 157.71
284.16 ± 165.62
356.22 ± 292.35
Walker
TRPO
519.38 ± 73.15
PPO
469.78 ± 159.17
147.17 ± 81.20
126.75 ± 136.44
Frozen Lake
TRPO
0.22 ± 0.05
PPO
0.16 ± 0.02
0.57 ± 0.17
0.19 ± 0.05
Cliff Walking
TRPO
-66.60 ± 13.61
PPO
-94.35 ± 3.96
-100.00 ± 0.00
-96.40 ± 22.90
Maze
A2C
0.97 ± 0.00
TRPO
0.97 ± 0.00
0.55 ± 0.00
0.97 ± 0.00
Nim
A2C
0.58 ± 0.10
TRPO
0.50 ± 0.10
0.33 ± 0.29
0.97 ± 0.09
Mount. Car (D)
DQN
-194.36 ± 1.47
A2C
-200.00 ± 0.00
-126.11 ± 21.67
-116.71 ± 15.20
Navigation
TRPO
4223.51 ± 19.70
PPO
4127.43 ± 24.29
2587.30 ± 707.55
2779.55 ± 270.65
Pong
PPO
2.29 ± 0.91
TRPO
1.36 ± 1.05
2.80 ± 0.26
2.99 ± 0.03
Cart Pole
TRPO
465.34 ± 62.32
PPO
365.86 ± 73.38
478.27 ± 65.17
500.00 ± 0.00
Means ± standard errors
Learning Curves: Rewards vs Iterations
Here is the learning curve of our proposed ProPS. Similarly, hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior.
Iteration
Reward:
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus.
Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor.
... (more text to make it scrollable) ...
Policy Parameters
Hover over points and discover more details!
ProPS+: Semantics Help Reasoning
Over the span of 250 iterations, the LLM is capable of consistently reaching policies that maximize the reward for the
CartPole environment. At every iteration, LLM explains and justifies its updates for the policy parameters. These justifications, as shown below, specify trends and patterns on
how certain parameters or combinations of parameters impact the reward function, providing a level of interpretability
and linguistic reasoning that is not afforded by traditional numeric optimizers.
LLM's Explanation on Iteration 36
I chose these parameters to explore combinations close to those that have provided higher episodic rewards (e.g., [66.00, 66.40]), by incorporating positive values near a previously successful range for indices such as cart velocity and pole angle. I hope this furthers the exploration around these promising regions with slightly varied step size.
LLM's Explanation on Iteration 123
In the previous iterations, I have observed that increasing the combination of parameters related to the pole angle and pole angular velocity, while balancing the contributions from cart position and velocity, generally yields higher episodic rewards. I'm incrementally exploring by slightly increasing the parameters where the local policy improvements occurred, aiming to explore a potentially higher reward solution while maintaining the control balance.
LLM's Explanation on Iteration 175
To
explore new potential state-action dynamics for increased episodic reward, I chose to adjust `params[2]` to a slightly lower value and increased `params[7]` for larger adjustments in weights associated with pole angle and pole angular velocity. Additionally, `params[6]` is shifted up slightly to modify the cart's position handling. This variation is within unexplored combinations that could potentially offer better stability or control, aiming to increase the reward.
LLM's Explanation on Iteration 244
Having observed that increasing params related to pole angular velocity, especially params[7], has led to higher rewards recently, I will continue exploring this parameter by increasing it to further test its impact.
Why do LLMs work? Numerical Optimization
In this section, our experiments demonstrate that large language models (LLMs), specifically Gemini-1.5-pro and GPT-4o, can effectively
perform direct numerical optimization tasks, such as minimizing complex mathematical functions. Given that policy search
in reinforcement learning is fundamentally a numerical optimization problem, where the goal is to maximize rewards
within a parameterized policy space, we argue that LLMs can be used for direct policy search.
Specifically, how well can LLMs perform numerical optimization? To answer this, we conducted a thorough evaluation across several benchmark functions, such as Ackley and Rastrigin. The results show the mean and standard deviation of the final objective function values obtained
with 50 independent trials, each consisting of 100 optimization steps.
Gemini-1.5-pro consistently outperformed other methods, achieving the best (lowest) average objective value in 12 out of
the 20 tested optimization scenarios. This robust performance enables the capability and potential of LLM-based
approaches for policy optimization tasks.
Try the Code Yourself!
More Insights
Impact of In-Context History Size
We first examine whether the number of in-context examples (i.e., the history length N iterations) influences policy
search performance. The figure shows the results on the Mountain Car task. We observe a
clear, nearly linear improvement in average reward as N increases. When N=1 (which is analogous to a conventional
optimizer maintaining only a single candidate) the reward plateaus around 100. In contrast, when the full history is
utilized (unbounded N), the agent reaches the maximum reward of 200. This highlights the benefit of leveraging
historical parameter-reward pairs, as the LLM is able to synthesize more effective updates over time.
Run Time Comparison
Next, we evaluate computational efficiency of our proposed methods, ProPS and ProPS+, in comparison to the baselines.
To ensure a fair comparison that accounts for potential differences in CPU utilization during training, we recorded the
CPU time for traditional RL algorithms. We observe that ProPS and ProPS+ in this setting show modest time requirements when compared with the baselines.
Effect of LLM Choice
We next assess the robustness of our method across different large language models. Specifically, we evaluate GPT-4o,
Gemini-2.5-Flash, Claude-3.7-sonnet and
Qwen2.5-14B-Instruct on the Mountain Car and Swimmer tasks. As shown in
the figure, all proprietary models show strong performance, demonstrating that modern
LLMs are capable of supporting effective prompted policy search, albeit with differences in sample efficiency and final
performance. However, by comparison, lightweight LLMs such as Qwen are free and resource-efficient but have more limited
capabilities with regards to numerical optimization and policy search.
Fine-Tuning for Policy Search
Thus, we explore whether a lightweight LLM can explicitly be fine-tuned to improve its prompted policy search
capabilities. To this end, we perform GRPO finetuning of the
Qwen2.5-14B-Instruct model using a
dataset of 2000 randomly generated policy parameters for the Mountain Car Continuous task.
After finetuning, we evaluate the
fine-tuned model on three tasks: Mountain Car, Inverted Pendulum and Pong to assess generalization.
The fine-tuned model outperforms its pre-trained counterpart on all
tasks, suggesting that targeted fine-tuning can enhance general policy search capabilities beyond the training task.
BibTeX
@article{zhou2025props,
author = {Zhou, Yifan and Grover, Sachin and El Mistiri, Mohamed and Kalirathinam, Kamalesh and Kerhalkar, Pratyush and Mishra, Swaroop and Kumar, Neelesh and Gaurav, Sanket and Aran, Oya and Ben Amor, Heni},
title = {Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs},
year = {2025},
}