(1) LLMs can directly optimize policies for RL tasks.
(2) LLMs can use semantic signals for smarter, more efficient policy search.
(3) Our proposed ProPS outperforms baselines on 8 out of 15 Gymnasium tasks.
Abstract
Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to
leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn
efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce
Prompted Policy Search (ProPS), a novel RL method that unifies numerical and
linguistic reasoning within a single framework. Unlike prior work that augments existing RL components
with language, ProPS places a large language model (LLM) at the center of the
policy optimization loop—directly proposing policy updates
based on both reward feedback and natural language input. We show that LLMs can perform numerical
optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and
strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated
across 15 Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to
seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on 8 out of 15
tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the
potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.
Visualization of Policy Search
Each point in the space below represents a learned policy. Hover over any point to watch
the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior. Larger points indicate higher rewards obtained by the policy.
Iteration
Reward:
ProPS
PPO
Episode 0
Episode 8000
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus.
Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor.
... (more text to make it scrollable) ...
Policy Parameters
Hover over points to see the LLM outputs!
Overview
We present a novel reinforcement learning (RL) approach in which a large language model (LLM) directly
generates policy parameters without relying on a conventional RL optimizer or any external optimization
component beyond the reward signal. Traditional RL methods focus on numerical information (e.g., gradients
with respect to the reward) and as a result cannot incorporate important task-specific knowledge that is
difficult to express in numbers, such as domain semantics or user-provided guidance. To address this
limitation, we introduce Prompted Policy Search (ProPS), a new method that combines
numerical reasoning with linguistic reasoning to enable more flexible and informed
learning. By linguistic reasoning, we mean the ability of LLMs to understand, process, and analyze natural
language in order to draw (deductive and inductive) inferences and make informed decisions. This allows us
to embed valuable information like prior domain knowledge, goals, or user-provided policy hints directly
into the learning process using natural language. For example, traditional RL methods treat all input
features as raw numbers and do not distinguish between features expressed in different units, such as
meters versus kilometers. In contrast, an LLM can interpret text-based task descriptions that explain the
nature and context of each feature.
ProPS Prompt
You are a good global optimizer, helping me find the global maximum of a mathematical function
f(params). I will give you the function evaluation and the current iteration number at each step. Your
goal is to propose input values that efficiently lead us to the global maximum within a limited number
of iterations (400).
1. Regarding the parameters param: % definitions of parameters
2. Here’s how we’ll interact: % formatting instructions
3. Remember: % constraints to be respected
The figure above illustrates a truncated version of the prompt (full prompt in paper). The system message
specifies the role of the LLM as a global optimizer and indicates the total number of optimization
iterations. The prompt includes three key components: (1) definitions of the parameters to be
optimized, (2) formatting instructions for the LLM’s output, and (3) any additional constraints the
LLM must adhere to during optimization. At each iteration, the LLM receives the prompt along
with a history of previous parameter suggestions and their associated rewards (i.e., in-context
examples). It then proposes a new parameter vector, accompanied by a textual justification of the
update. These justifications add a layer of interpretability to the search process, as they describe
observed trends in the data.
ProPS+ Prompt
You are a good global RL policy optimizer, helping me find an optimal policy in the following environment:
1. Environment: % definition of the environment, parameters and policy
In the cartpole environment, a pole is attached by an un-actuated joint to a cart which moves along a
frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by
applying forces in the left and right direction on the cart.The state is a vector of 4 elements, representing
the cart position (-4.8 to 4.8), cart velocity (-inf to inf), pole angle (-0.418 to 0.418 rad), and pole
angular velocity (-inf to inf) respectively. The goal is to keep the pole upright and the cart within the
bounding position of [-2.4, 2.4]. The action space consists of 2 actions (0: push left, 1: push right).
The policy is a linear policy with 10 parameters and works as follows: action = argmax(...) The reward
is +1 for every time step the pole is upright and the cart is within the bounding position. The episode
ends when the pole falls over or the cart goes out of bounds.
2. Regarding the parameters param: % definitions of parameters
3. Here’s how we’ll interact: % formatting instructions
4. Remember: % constraints to be respected
The figure above illustrates ProPS+, which is a semantically-augmented variant of
ProPS, where we extend the basic framework to incorporate rich, task-specific, and contextual
knowledge into the reinforcement learning process via semantically-informed prompts. The
example shown describes the CartPole environment, using text adapted from publicly available documentation (e.g., OpenAI
Gym/Gymnasium). In this example, the prompt specifies details such as the task description, action space (binary),
policy parameterization (linear), and reward structure. Additionally, it includes optional expert-provided guidance on
desirable or undesirable policy behaviors, framed as constraints.
Learning Curves: Rewards vs Iterations
Here is the learning curve of our proposed ProPS. Similarly, hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior.
Iteration
Reward:
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus.
Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor.
... (more text to make it scrollable) ...
Policy Parameters
Hover over points and discover more details!
Experiment Results
We evaluate the performance of both ProPS and ProPS+, using GPT-4o, across 15 widely-used
reinforcement learning benchmarks from the OpenAI Gym and Gymnasium suites. For tasks with
continuous state spaces, we employ linear policy representations, while discrete-state tasks use tabular policies. The
selected environments span a diverse range of RL domains, including classic control problems (e.g., CartPole,
MountainCar), games (e.g., Pong, Nim), continuous control tasks (e.g., MuJoCo environments),
etc. Notably, in 7 out of 15 environments, ProPS outperforms all baseline algorithms. After incorporating
domain knowledge, ProPS+ achieves the highest performance in 8 out of 15 tasks. The table below displays the
average return and standard deviation over 10 random seeds for each method. The best and second-best
baseline results are highlighted and underlined, respectively.
Domain
Best Baseline
2nd Best Baseline
ProPS
ProPS+
Mount. Car (C)
SAC
86.65 ± 0.84
PPO
78.16 ± 5.32
87.21 ± 29.28
89.16 ± 29.72
Inverted Pend.
TRPO
571.31 ± 358.88
PPO
218.65 ± 129.31
1000.00 ± 0.00
1000.00 ± 0.00
Inv. Dbl. Pend.
TRPO
3609.37 ± 4000.04
PPO
108.60 ± 4.12
128.17 ± 24.52
148.39 ± 48.65
Reacher
PPO
-7.32 ± 0.38
TRPO
-8.93 ± 1.39
-11.32 ± 1.37
-18.15 ± 22.06
Swimmer
TRPO
52.96 ± 18.86
PPO
39.40 ± 6.54
218.83 ± 58.45
227.30 ± 56.23
Hopper
TRPO
716.90 ± 385.20
PPO
351.75 ± 157.71
284.16 ± 165.62
356.22 ± 292.35
Walker
TRPO
519.38 ± 73.15
PPO
469.78 ± 159.17
147.17 ± 81.20
126.75 ± 136.44
Frozen Lake
TRPO
0.22 ± 0.05
PPO
0.16 ± 0.02
0.57 ± 0.17
0.19 ± 0.05
Cliff Walking
TRPO
-66.60 ± 13.61
PPO
-94.35 ± 3.96
-100.00 ± 0.00
-96.40 ± 22.90
Maze
A2C
0.97 ± 0.00
TRPO
0.97 ± 0.00
0.55 ± 0.00
0.97 ± 0.00
Nim
A2C
0.58 ± 0.10
TRPO
0.50 ± 0.10
0.33 ± 0.29
0.97 ± 0.09
Mount. Car (D)
DQN
-194.36 ± 1.47
A2C
-200.00 ± 0.00
-126.11 ± 21.67
-116.71 ± 15.20
Navigation
TRPO
4223.51 ± 19.70
PPO
4127.43 ± 24.29
2587.30 ± 707.55
2779.55 ± 270.65
Pong
PPO
2.29 ± 0.91
TRPO
1.36 ± 1.05
2.80 ± 0.26
2.99 ± 0.03
Cart Pole
TRPO
465.34 ± 62.32
PPO
365.86 ± 73.38
478.27 ± 65.17
500.00 ± 0.00
Means ± standard errors
ProPS+: Semantics Help Reasoning
Over the span of 250 iterations, the LLM is capable of consistently reaching policies that maximize the reward for the CartPole environment. This capability of the LLM is coupled with the provision of textual justification for the choice of updates for policy parameters at each iteration. These justifications, as shown below, specify trends and patterns on how certain parameters or combinations of parameters impact the reward function, providing a level of interpretability and linguistic reasoning that is not afforded by traditional numeric optimizers.
LLM's Explanation on Iteration 36
I chose these parameters to explore combinations close to those that have provided higher episodic rewards (e.g., [66.00, 66.40]), by incorporating positive values near a previously successful range for indices such as cart velocity and pole angle. I hope this furthers the exploration around these promising regions with slightly varied step size.
LLM's Explanation on Iteration 123
In the previous iterations, I have observed that increasing the combination of parameters related to the pole angle and pole angular velocity, while balancing the contributions from cart position and velocity, generally yields higher episodic rewards. I'm incrementally exploring by slightly increasing the parameters where the local policy improvements occurred, aiming to explore a potentially higher reward solution while maintaining the control balance.
LLM's Explanation on Iteration 175
To
explore new potential state-action dynamics for increased episodic reward, I chose to adjust `params[2]` to a slightly lower value and increased `params[7]` for larger adjustments in weights associated with pole angle and pole angular velocity. Additionally, `params[6]` is shifted up slightly to modify the cart's position handling. This variation is within unexplored combinations that could potentially offer better stability or control, aiming to increase the reward.
LLM's Explanation on Iteration 244
Having observed that increasing params related to pole angular velocity, especially params[7], has led to higher rewards recently, I will continue exploring this parameter by increasing it to further test its impact.
Why do LLMs work? Numerical Optimization
Our experiments demonstrate that large language models (LLMs), specifically Gemini-1.5-pro and GPT-4o, can effectively
perform direct numerical optimization tasks, such as minimizing complex mathematical functions. Given that policy search
in reinforcement learning is fundamentally a numerical optimization problem—where the goal is to maximize rewards within
a parameterized policy space—we investigated how well LLMs perform compared to classical optimization algorithms,
including Gradient Descent and Adam.
We conducted a thorough evaluation across several numerical optimization benchmarks, such as the Ackley and Rastrigin
functions. The results presented reflect the mean and standard deviation of the final objective function values obtained
with 50 independent trials, each consisting of 100 optimization steps.
Gemini-1.5-pro consistently outperformed other methods, achieving the best (lowest) average objective value in 12 out of
the 20 tested optimization scenarios. This robust performance highlights the capability and potential of LLM-based
approaches for policy optimization tasks.
Try the Code Yourself!
More Insights
Impact of In-Context History Size
We first examine whether the number of in-context examples (i.e., the history length N iterations) influences policy
search performance. The figure shows the results on the Mountain Car task. We observe a
clear, nearly linear improvement in average reward as N increases. When N=1 (which is analogous to a conventional
optimizer maintaining only a single candidate) the reward plateaus around 100. In contrast, when the full history is
utilized (unbounded N), the agent reaches the maximum reward of 200. This highlights the benefit of leveraging
historical parameter-reward pairs, as the LLM is able to synthesize more effective updates over time.
Run Time Comparison
Next, we evaluate computational efficiency of our proposed methods, ProPS and ProPS+, in comparison to the baselines.
To ensure a fair comparison that accounts for potential differences in CPU utilization during training, we recorded the
CPU time for traditional RL algorithms. We observe that ProPS and ProPS+ in this setting show modest time requirements when compared with the baselines.
Effect of LLM Choice
We next assess the robustness of our method across different large language models. Specifically, we evaluate GPT-4o,
Gemini-2.5-Flash, Claude-3.7-sonnet and
Qwen2.5-14B-Instruct on the Mountain Car and Swimmer tasks. As shown in
the figure, all proprietary models show strong performance, demonstrating that modern
LLMs are capable of supporting effective prompted policy search, albeit with differences in sample efficiency and final
performance. However, by comparison, lightweight LLMs such as Qwen are free and resource-efficient but have more limited
capabilities with regards to numerical optimization and policy search.
Fine-Tuning for Policy Search
Thus, we explore whether a lightweight LLM can explicitly be fine-tuned to improve its prompted policy search
capabilities. To this end, we perform GRPO finetuning of the
Qwen2.5-14B-Instruct model using a
dataset of 2000 randomly generated policy parameters for the Mountain Car Continuous task.
After finetuning, we evaluate the
fine-tuned model on three tasks: Mountain Car, Inverted Pendulum and Pong to assess generalization.
The fine-tuned model outperforms its pre-trained counterpart on all
tasks, suggesting that targeted fine-tuning can enhance general policy search capabilities beyond the training task.
BibTeX
@article{zhou2025props,
author = {xxx},
title = {Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs},
year = {2025},
}