Summary of Results

(1) LLMs can directly optimize policies for RL tasks.

(2) LLMs can use semantic signals for smarter, more efficient policy search.

(3) Our proposed ProPS outperforms baselines on 8 out of 15 Gymnasium tasks.

Example

Abstract

Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augments existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop—directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across 15 Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on 8 out of 15 tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

Visualization of Policy Search

Each point in the space below represents a learned policy. Hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior. Larger points indicate higher rewards obtained by the policy.


          
Iteration
Reward:
ProPS
PPO
Episode 0
Episode 8000
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. ... (more text to make it scrollable) ...
Policy Parameters
Placeholder
Hover over points to see the LLM outputs!

Overview

We present a novel reinforcement learning (RL) approach in which a large language model (LLM) directly generates policy parameters without relying on a conventional RL optimizer or any external optimization component beyond the reward signal. Traditional RL methods focus on numerical information (e.g., gradients with respect to the reward) and as a result cannot incorporate important task-specific knowledge that is difficult to express in numbers, such as domain semantics or user-provided guidance. To address this limitation, we introduce Prompted Policy Search (ProPS), a new method that combines numerical reasoning with linguistic reasoning to enable more flexible and informed learning. By linguistic reasoning, we mean the ability of LLMs to understand, process, and analyze natural language in order to draw (deductive and inductive) inferences and make informed decisions. This allows us to embed valuable information like prior domain knowledge, goals, or user-provided policy hints directly into the learning process using natural language. For example, traditional RL methods treat all input features as raw numbers and do not distinguish between features expressed in different units, such as meters versus kilometers. In contrast, an LLM can interpret text-based task descriptions that explain the nature and context of each feature.

Example
ProPS Prompt
You are a good global optimizer, helping me find the global maximum of a mathematical function f(params). I will give you the function evaluation and the current iteration number at each step. Your goal is to propose input values that efficiently lead us to the global maximum within a limited number of iterations (400).
1. Regarding the parameters param:           % definitions of parameters
2. Here’s how we’ll interact:                         % formatting instructions
3. Remember:                                              % constraints to be respected

The figure above illustrates a truncated version of the prompt (full prompt in paper). The system message specifies the role of the LLM as a global optimizer and indicates the total number of optimization iterations. The prompt includes three key components: (1) definitions of the parameters to be optimized, (2) formatting instructions for the LLM’s output, and (3) any additional constraints the LLM must adhere to during optimization. At each iteration, the LLM receives the prompt along with a history of previous parameter suggestions and their associated rewards (i.e., in-context examples). It then proposes a new parameter vector, accompanied by a textual justification of the update. These justifications add a layer of interpretability to the search process, as they describe observed trends in the data.

ProPS+ Prompt
You are a good global RL policy optimizer, helping me find an optimal policy in the following environment:
1. Environment:                                          % definition of the environment, parameters and policy
In the cartpole environment, a pole is attached by an un-actuated joint to a cart which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.The state is a vector of 4 elements, representing the cart position (-4.8 to 4.8), cart velocity (-inf to inf), pole angle (-0.418 to 0.418 rad), and pole angular velocity (-inf to inf) respectively. The goal is to keep the pole upright and the cart within the bounding position of [-2.4, 2.4]. The action space consists of 2 actions (0: push left, 1: push right). The policy is a linear policy with 10 parameters and works as follows: action = argmax(...) The reward is +1 for every time step the pole is upright and the cart is within the bounding position. The episode ends when the pole falls over or the cart goes out of bounds.
2. Regarding the parameters param:         % definitions of parameters
3. Here’s how we’ll interact:                       % formatting instructions
4. Remember:                                            % constraints to be respected

The figure above illustrates ProPS+, which is a semantically-augmented variant of ProPS, where we extend the basic framework to incorporate rich, task-specific, and contextual knowledge into the reinforcement learning process via semantically-informed prompts. The example shown describes the CartPole environment, using text adapted from publicly available documentation (e.g., OpenAI Gym/Gymnasium). In this example, the prompt specifies details such as the task description, action space (binary), policy parameterization (linear), and reward structure. Additionally, it includes optional expert-provided guidance on desirable or undesirable policy behaviors, framed as constraints.

Learning Curves: Rewards vs Iterations

Here is the learning curve of our proposed ProPS. Similarly, hover over any point to watch the policy in action, see how the LLM explains its decision, and explore the parameter heatmap that shaped the behavior.


          
Iteration
Reward:
LLM Explanation
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. ... (more text to make it scrollable) ...
Policy Parameters
Placeholder
Hover over points and discover more details!

Experiment Results

We evaluate the performance of both ProPS and ProPS+, using GPT-4o, across 15 widely-used reinforcement learning benchmarks from the OpenAI Gym and Gymnasium suites. For tasks with continuous state spaces, we employ linear policy representations, while discrete-state tasks use tabular policies. The selected environments span a diverse range of RL domains, including classic control problems (e.g., CartPole, MountainCar), games (e.g., Pong, Nim), continuous control tasks (e.g., MuJoCo environments), etc. Notably, in 7 out of 15 environments, ProPS outperforms all baseline algorithms. After incorporating domain knowledge, ProPS+ achieves the highest performance in 8 out of 15 tasks. The table below displays the average return and standard deviation over 10 random seeds for each method. The best and second-best baseline results are highlighted and underlined, respectively.

Domain Best Baseline 2nd Best Baseline ProPS ProPS+
Mount. Car (C) SAC86.65 ± 0.84 PPO78.16 ± 5.32 87.21 ± 29.28 89.16 ± 29.72
Inverted Pend. TRPO571.31 ± 358.88 PPO218.65 ± 129.31 1000.00 ± 0.00 1000.00 ± 0.00
Inv. Dbl. Pend. TRPO3609.37 ± 4000.04 PPO108.60 ± 4.12 128.17 ± 24.52 148.39 ± 48.65
Reacher PPO-7.32 ± 0.38 TRPO-8.93 ± 1.39 -11.32 ± 1.37 -18.15 ± 22.06
Swimmer TRPO52.96 ± 18.86 PPO39.40 ± 6.54 218.83 ± 58.45 227.30 ± 56.23
Hopper TRPO716.90 ± 385.20 PPO351.75 ± 157.71 284.16 ± 165.62 356.22 ± 292.35
Walker TRPO519.38 ± 73.15 PPO469.78 ± 159.17 147.17 ± 81.20 126.75 ± 136.44
Frozen Lake TRPO0.22 ± 0.05 PPO0.16 ± 0.02 0.57 ± 0.17 0.19 ± 0.05
Cliff Walking TRPO-66.60 ± 13.61 PPO-94.35 ± 3.96 -100.00 ± 0.00 -96.40 ± 22.90
Maze A2C0.97 ± 0.00 TRPO0.97 ± 0.00 0.55 ± 0.00 0.97 ± 0.00
Nim A2C0.58 ± 0.10 TRPO0.50 ± 0.10 0.33 ± 0.29 0.97 ± 0.09
Mount. Car (D) DQN-194.36 ± 1.47 A2C-200.00 ± 0.00 -126.11 ± 21.67 -116.71 ± 15.20
Navigation TRPO4223.51 ± 19.70 PPO4127.43 ± 24.29 2587.30 ± 707.55 2779.55 ± 270.65
Pong PPO2.29 ± 0.91 TRPO1.36 ± 1.05 2.80 ± 0.26 2.99 ± 0.03
Cart Pole TRPO465.34 ± 62.32 PPO365.86 ± 73.38 478.27 ± 65.17 500.00 ± 0.00

Means ± standard errors

ProPS+: Semantics Help Reasoning

Over the span of 250 iterations, the LLM is capable of consistently reaching policies that maximize the reward for the CartPole environment. This capability of the LLM is coupled with the provision of textual justification for the choice of updates for policy parameters at each iteration. These justifications, as shown below, specify trends and patterns on how certain parameters or combinations of parameters impact the reward function, providing a level of interpretability and linguistic reasoning that is not afforded by traditional numeric optimizers.

LLM's Explanation on Iteration 36

I chose these parameters to explore combinations close to those that have provided higher episodic rewards (e.g., [66.00, 66.40]), by incorporating positive values near a previously successful range for indices such as cart velocity and pole angle. I hope this furthers the exploration around these promising regions with slightly varied step size.

LLM's Explanation on Iteration 123

In the previous iterations, I have observed that increasing the combination of parameters related to the pole angle and pole angular velocity, while balancing the contributions from cart position and velocity, generally yields higher episodic rewards. I'm incrementally exploring by slightly increasing the parameters where the local policy improvements occurred, aiming to explore a potentially higher reward solution while maintaining the control balance.

LLM's Explanation on Iteration 175

To explore new potential state-action dynamics for increased episodic reward, I chose to adjust `params[2]` to a slightly lower value and increased `params[7]` for larger adjustments in weights associated with pole angle and pole angular velocity. Additionally, `params[6]` is shifted up slightly to modify the cart's position handling. This variation is within unexplored combinations that could potentially offer better stability or control, aiming to increase the reward.

LLM's Explanation on Iteration 244

Having observed that increasing params related to pole angular velocity, especially params[7], has led to higher rewards recently, I will continue exploring this parameter by increasing it to further test its impact.

Why do LLMs work? Numerical Optimization

Our experiments demonstrate that large language models (LLMs), specifically Gemini-1.5-pro and GPT-4o, can effectively perform direct numerical optimization tasks, such as minimizing complex mathematical functions. Given that policy search in reinforcement learning is fundamentally a numerical optimization problem—where the goal is to maximize rewards within a parameterized policy space—we investigated how well LLMs perform compared to classical optimization algorithms, including Gradient Descent and Adam.

We conducted a thorough evaluation across several numerical optimization benchmarks, such as the Ackley and Rastrigin functions. The results presented reflect the mean and standard deviation of the final objective function values obtained with 50 independent trials, each consisting of 100 optimization steps.

Gemini-1.5-pro consistently outperformed other methods, achieving the best (lowest) average objective value in 12 out of the 20 tested optimization scenarios. This robust performance highlights the capability and potential of LLM-based approaches for policy optimization tasks.


          

          
img1 img2 img3 img4

Try the Code Yourself!

More Insights

Impact of In-Context History Size

We first examine whether the number of in-context examples (i.e., the history length N iterations) influences policy search performance. The figure shows the results on the Mountain Car task. We observe a clear, nearly linear improvement in average reward as N increases. When N=1 (which is analogous to a conventional optimizer maintaining only a single candidate) the reward plateaus around 100. In contrast, when the full history is utilized (unbounded N), the agent reaches the maximum reward of 200. This highlights the benefit of leveraging historical parameter-reward pairs, as the LLM is able to synthesize more effective updates over time.

Run Time Comparison

Next, we evaluate computational efficiency of our proposed methods, ProPS and ProPS+, in comparison to the baselines. To ensure a fair comparison that accounts for potential differences in CPU utilization during training, we recorded the CPU time for traditional RL algorithms. We observe that ProPS and ProPS+ in this setting show modest time requirements when compared with the baselines.

Effect of LLM Choice

We next assess the robustness of our method across different large language models. Specifically, we evaluate GPT-4o, Gemini-2.5-Flash, Claude-3.7-sonnet and Qwen2.5-14B-Instruct on the Mountain Car and Swimmer tasks. As shown in the figure, all proprietary models show strong performance, demonstrating that modern LLMs are capable of supporting effective prompted policy search, albeit with differences in sample efficiency and final performance. However, by comparison, lightweight LLMs such as Qwen are free and resource-efficient but have more limited capabilities with regards to numerical optimization and policy search.

Fine-Tuning for Policy Search

Thus, we explore whether a lightweight LLM can explicitly be fine-tuned to improve its prompted policy search capabilities. To this end, we perform GRPO finetuning of the Qwen2.5-14B-Instruct model using a dataset of 2000 randomly generated policy parameters for the Mountain Car Continuous task. After finetuning, we evaluate the fine-tuned model on three tasks: Mountain Car, Inverted Pendulum and Pong to assess generalization. The fine-tuned model outperforms its pre-trained counterpart on all tasks, suggesting that targeted fine-tuning can enhance general policy search capabilities beyond the training task.

BibTeX

@article{zhou2025props,
        author    = {xxx},
        title     = {Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs},
        year      = {2025},
      }