Abstract

Traditional reinforcement learning relies on scalar rewards, limiting its use of rich semantic knowledge. In contrast, humans learn by combining numerical feedback with language and prior knowledge. We introduce Prompted Policy Search (ProPS), a novel LLM-based reinforcement learning method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augments existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop—directly proposing policy updates based on both reward feedback and natural language input. Across 15 Gymnasium tasks, ProPS outperforms all seven popular RL methods on 8 tasks and shows further gains with domain knowledge, demonstrating the benefits of integrating semantics and numerics for more efficient, human-aligned RL.

The video below illustrates the policy search process of ProPS in the Gymnasium Swimmer environment. It shows how the LLM synthesizes policy parameters based on reward feedback through its own reasoning, combining both numerical and linguistic signals.

Summary of Results

(1) LLMs can directly perform reinforcement learning and optimize policies.

(2) LLMs can use semantic signals for smarter, more efficient policy search.

(3) We introduce ProPS, a method for LLM-based policy search.

(4) ProPS outperforms widely-adopted RL algorithms, including PPO and TRPO, on 8 out of 15 diverse RL tasks.

Example

Overview

We present a novel reinforcement learning (RL) approach in which a large language model (LLM) directly generates policy parameters without relying on a conventional RL optimizer or any external optimization component beyond the reward signal. Traditional RL methods focus on numerical information (e.g., gradients with respect to the reward) and as a result cannot incorporate important task-specific knowledge that is difficult to express in numbers, such as domain semantics or user-provided guidance. To address this limitation, we introduce Prompted Policy Search (ProPS), a new method that combines numerical reasoning with linguistic reasoning to enable more flexible and informed learning. By linguistic reasoning, we mean the ability of LLMs to understand, process, and analyze natural language in order to draw (deductive and inductive) inferences and make informed decisions. This allows us to embed valuable information like prior domain knowledge, goals, or user-provided policy hints directly into the learning process using natural language. For example, traditional RL methods treat all input features as raw numbers and do not distinguish between features expressed in different units, such as meters versus kilometers. In contrast, an LLM can interpret text-based task descriptions that explain the nature and context of each feature.

Example
ProPS Prompt
You are a good global optimizer, helping me find the global maximum of a mathematical function f(params). I will give you the function evaluation and the current iteration number at each step. Your goal is to propose input values that efficiently lead us to the global maximum within a limited number of iterations (400).
1. Regarding the parameters param:           % definitions of parameters
2. Here’s how we’ll interact:                         % formatting instructions
3. Remember:                                              % constraints to be respected

The figure above illustrates a truncated version of the prompt (full prompt in paper). The system message specifies the role of the LLM as a global optimizer and indicates the total number of optimization iterations. The prompt includes three key components: (1) definitions of the parameters to be optimized, (2) formatting instructions for the LLM’s output, and (3) any additional constraints the LLM must adhere to during optimization. At each iteration, the LLM receives the prompt along with a history of previous parameter suggestions and their associated rewards (i.e., in-context examples). It then proposes a new parameter vector, accompanied by a textual justification of the update. These justifications add a layer of interpretability to the search process, as they describe observed trends in the data.

ProPS+ Prompt
You are a good global RL policy optimizer, helping me find an optimal policy in the following environment:
1. Environment:                                          % definition of the environment, parameters and policy
In the cartpole environment, a pole is attached by an un-actuated joint to a cart which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.The state is a vector of 4 elements, representing the cart position (-4.8 to 4.8), cart velocity (-inf to inf), pole angle (-0.418 to 0.418 rad), and pole angular velocity (-inf to inf) respectively. The goal is to keep the pole upright and the cart within the bounding position of [-2.4, 2.4]. The action space consists of 2 actions (0: push left, 1: push right). The policy is a linear policy with 10 parameters and works as follows: action = argmax(...) The reward is +1 for every time step the pole is upright and the cart is within the bounding position. The episode ends when the pole falls over or the cart goes out of bounds.
2. Regarding the parameters param:         % definitions of parameters
3. Here’s how we’ll interact:                       % formatting instructions
4. Remember:                                            % constraints to be respected

The figure above illustrates ProPS+, which is a semantically-augmented variant of ProPS, where we extend the basic framework to incorporate rich, task-specific, and contextual knowledge into the reinforcement learning process via semantically-informed prompts. The example shown describes the CartPole environment, using text adapted from publicly available documentation (e.g., OpenAI Gym/Gymnasium). In this example, the prompt specifies details such as the task description, action space (binary), policy parameterization (linear), and reward structure. Additionally, it includes optional expert-provided guidance on desirable or undesirable policy behaviors, framed as constraints.

Visualizing the Policy Search Process of ProPS

Below is the visualization of the policy search process of ProPS compared to PPO, where each point represents the parameters of a learned policy, projected down by T-SNE. Hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior. Larger points indicate higher rewards obtained by the policy.


          
Iteration
Reward:
ProPS
PPO
Episode 0
Episode 8000
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. ... (more text to make it scrollable) ...
Policy Parameters
Placeholder
Hover over points to see the LLM outputs!

Experiment Results

We evaluate the performance of both ProPS and ProPS+, using GPT-4o, across 15 widely-used reinforcement learning benchmarks from the OpenAI Gym and Gymnasium suites. For tasks with continuous state spaces, we employ linear policy representations, while discrete-state tasks use tabular policies. The selected environments span a diverse range of RL domains, including classic control problems (e.g., CartPole, MountainCar), games (e.g., Pong, Nim), continuous control tasks (e.g., MuJoCo environments), etc. Notably, in 7 out of 15 environments, ProPS outperforms all baseline algorithms. After incorporating domain knowledge, ProPS+ achieves the highest performance in 8 out of 15 tasks. The table below displays the average return and standard deviation over 10 random seeds for each method. The best and second-best baseline results are highlighted and underlined, respectively.

Domain Best Baseline 2nd Best Baseline ProPS ProPS+
Mount. Car (C) SAC86.65 ± 0.84 PPO78.16 ± 5.32 87.21 ± 29.28 89.16 ± 29.72
Inverted Pend. TRPO571.31 ± 358.88 PPO218.65 ± 129.31 1000.00 ± 0.00 1000.00 ± 0.00
Inv. Dbl. Pend. TRPO3609.37 ± 4000.04 PPO108.60 ± 4.12 128.17 ± 24.52 148.39 ± 48.65
Reacher PPO-7.32 ± 0.38 TRPO-8.93 ± 1.39 -11.32 ± 1.37 -18.15 ± 22.06
Swimmer TRPO52.96 ± 18.86 PPO39.40 ± 6.54 218.83 ± 58.45 227.30 ± 56.23
Hopper TRPO716.90 ± 385.20 PPO351.75 ± 157.71 284.16 ± 165.62 356.22 ± 292.35
Walker TRPO519.38 ± 73.15 PPO469.78 ± 159.17 147.17 ± 81.20 126.75 ± 136.44
Frozen Lake TRPO0.22 ± 0.05 PPO0.16 ± 0.02 0.57 ± 0.17 0.19 ± 0.05
Cliff Walking TRPO-66.60 ± 13.61 PPO-94.35 ± 3.96 -100.00 ± 0.00 -96.40 ± 22.90
Maze A2C0.97 ± 0.00 TRPO0.97 ± 0.00 0.55 ± 0.00 0.97 ± 0.00
Nim A2C0.58 ± 0.10 TRPO0.50 ± 0.10 0.33 ± 0.29 0.97 ± 0.09
Mount. Car (D) DQN-194.36 ± 1.47 A2C-200.00 ± 0.00 -126.11 ± 21.67 -116.71 ± 15.20
Navigation TRPO4223.51 ± 19.70 PPO4127.43 ± 24.29 2587.30 ± 707.55 2779.55 ± 270.65
Pong PPO2.29 ± 0.91 TRPO1.36 ± 1.05 2.80 ± 0.26 2.99 ± 0.03
Cart Pole TRPO465.34 ± 62.32 PPO365.86 ± 73.38 478.27 ± 65.17 500.00 ± 0.00

Means ± standard errors

Learning Curves: Rewards vs Iterations

Here is the learning curve of our proposed ProPS. Similarly, hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior.


          
Iteration
Reward:
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. ... (more text to make it scrollable) ...
Policy Parameters
Placeholder
Hover over points and discover more details!

ProPS+: Semantics Help Reasoning

Over the span of 250 iterations, the LLM is capable of consistently reaching policies that maximize the reward for the CartPole environment. At every iteration, LLM explains and justifies its updates for the policy parameters. These justifications, as shown below, specify trends and patterns on how certain parameters or combinations of parameters impact the reward function, providing a level of interpretability and linguistic reasoning that is not afforded by traditional numeric optimizers.

LLM's Explanation on Iteration 36

I chose these parameters to explore combinations close to those that have provided higher episodic rewards (e.g., [66.00, 66.40]), by incorporating positive values near a previously successful range for indices such as cart velocity and pole angle. I hope this furthers the exploration around these promising regions with slightly varied step size.

LLM's Explanation on Iteration 123

In the previous iterations, I have observed that increasing the combination of parameters related to the pole angle and pole angular velocity, while balancing the contributions from cart position and velocity, generally yields higher episodic rewards. I'm incrementally exploring by slightly increasing the parameters where the local policy improvements occurred, aiming to explore a potentially higher reward solution while maintaining the control balance.

LLM's Explanation on Iteration 175

To explore new potential state-action dynamics for increased episodic reward, I chose to adjust `params[2]` to a slightly lower value and increased `params[7]` for larger adjustments in weights associated with pole angle and pole angular velocity. Additionally, `params[6]` is shifted up slightly to modify the cart's position handling. This variation is within unexplored combinations that could potentially offer better stability or control, aiming to increase the reward.

LLM's Explanation on Iteration 244

Having observed that increasing params related to pole angular velocity, especially params[7], has led to higher rewards recently, I will continue exploring this parameter by increasing it to further test its impact.

Why do LLMs work? Numerical Optimization

In this section, our experiments demonstrate that large language models (LLMs), specifically Gemini-1.5-pro and GPT-4o, can effectively perform direct numerical optimization tasks, such as minimizing complex mathematical functions. Given that policy search in reinforcement learning is fundamentally a numerical optimization problem, where the goal is to maximize rewards within a parameterized policy space, we argue that LLMs can be used for direct policy search.

Specifically, how well can LLMs perform numerical optimization? To answer this, we conducted a thorough evaluation across several benchmark functions, such as Ackley and Rastrigin. The results show the mean and standard deviation of the final objective function values obtained with 50 independent trials, each consisting of 100 optimization steps.

Gemini-1.5-pro consistently outperformed other methods, achieving the best (lowest) average objective value in 12 out of the 20 tested optimization scenarios. This robust performance enables the capability and potential of LLM-based approaches for policy optimization tasks.


          

          
img1 img2 img3 img4

Try the Code Yourself!

More Insights

Impact of In-Context History Size

We first examine whether the number of in-context examples (i.e., the history length N iterations) influences policy search performance. The figure shows the results on the Mountain Car task. We observe a clear, nearly linear improvement in average reward as N increases. When N=1 (which is analogous to a conventional optimizer maintaining only a single candidate) the reward plateaus around 100. In contrast, when the full history is utilized (unbounded N), the agent reaches the maximum reward of 200. This highlights the benefit of leveraging historical parameter-reward pairs, as the LLM is able to synthesize more effective updates over time.

Run Time Comparison

Next, we evaluate computational efficiency of our proposed methods, ProPS and ProPS+, in comparison to the baselines. To ensure a fair comparison that accounts for potential differences in CPU utilization during training, we recorded the CPU time for traditional RL algorithms. We observe that ProPS and ProPS+ in this setting show modest time requirements when compared with the baselines.

Effect of LLM Choice

We next assess the robustness of our method across different large language models. Specifically, we evaluate GPT-4o, Gemini-2.5-Flash, Claude-3.7-sonnet and Qwen2.5-14B-Instruct on the Mountain Car and Swimmer tasks. As shown in the figure, all proprietary models show strong performance, demonstrating that modern LLMs are capable of supporting effective prompted policy search, albeit with differences in sample efficiency and final performance. However, by comparison, lightweight LLMs such as Qwen are free and resource-efficient but have more limited capabilities with regards to numerical optimization and policy search.

Fine-Tuning for Policy Search

Thus, we explore whether a lightweight LLM can explicitly be fine-tuned to improve its prompted policy search capabilities. To this end, we perform GRPO finetuning of the Qwen2.5-14B-Instruct model using a dataset of 2000 randomly generated policy parameters for the Mountain Car Continuous task. After finetuning, we evaluate the fine-tuned model on three tasks: Mountain Car, Inverted Pendulum and Pong to assess generalization. The fine-tuned model outperforms its pre-trained counterpart on all tasks, suggesting that targeted fine-tuning can enhance general policy search capabilities beyond the training task.

BibTeX

@article{zhou2025props,
        author    = {Zhou, Yifan and Grover, Sachin and El Mistiri, Mohamed and Kalirathinam, Kamalesh and Kerhalkar, Pratyush and Mishra, Swaroop and Kumar, Neelesh and Gaurav, Sanket and Aran, Oya and Ben Amor, Heni},
        title     = {Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs},
        year      = {2025},
      }