I'm working on a custom Gymnasium environment using Stable-Baselines3’s PPO. My task is to have an agent keep its yaw error as close to zero as possible. In training, it initially learns to keep the yaw aligned, but after a while, the yaw error starts to drift (sometimes increasing, sometimes decreasing) even though the training logs (such as the increasing explained_variance) seem to indicate that learning is progressing. No matter how I adjust the reward calculation or tweak the model parameters, I can’t get the agent to consistently maintain the correct yaw.
Below is a simplified version of my environment’s step()
and calculate_reward()
functions, along with the PPO model parameters I’ve experimented with:
def step(self, action, train=True):
self.current_step += 1
# Scale action as before
done = self.agent.step(action * 25)
# Compute reward components
reward, reward_dict, distance_error, yaw_error = self.calculate_reward(done)
# Reward difference: using the delta between current and previous rewards
prev_reward = reward
reward = reward - self.prev_reward
self.prev_reward = prev_reward
# Construct the state: various agent kinematics and sensor readings
state = np.concatenate([
self.agent.nu[:3],
np.array([self.agent.eta[-1], self.agent.ref[-1], distance_error, yaw_error]),
self.agent.prop_f,
self.agent.prop_r,
self.agent.prop_m
])
self.info = {
"reward": reward,
"reward_dict": reward_dict,
"prop_f": np.mean(self.agent.prop_f),
"prop_m": np.mean(self.agent.prop_m),
"distance_error": distance_error,
"yaw_error": yaw_error,
"pos_x": [self.agent.eta[0], self.agent.ref[0]],
"pos_y": [self.agent.eta[1], self.agent.ref[1]],
"pos_z": self.agent.eta[2],
}
if not train:
self._render()
print("\nStep:", self.current_step)
for k, y in self.info.items():
if k != "reward_dict":
print(f"{k}{' '*(20-len(k))}: {y}")
for k, y in self.info["reward_dict"].items():
print(f"{k}{' '*(20-len(k))}: {y}")
terminated = done == 1 # Episode ended due to task completion
truncated = self.current_step >= self.config['max_steps_per_ep'] # Time limit
return state, reward, terminated, truncated, self.info
def calculate_reward(self, done):
# Get current yaw and compute the rotation matrix to transform velocities
current_yaw = self.agent.eta[5]
rotation_matrix = np.array([
[np.cos(current_yaw), -np.sin(current_yaw), 0],
[np.sin(current_yaw), np.cos(current_yaw), 0],
[0, 0, 1]
])
# Convert body-fixed velocities to global frame
global_vel = rotation_matrix @ self.agent.nu[:3]
vehicle_pos = np.array(self.agent.eta[:3])
ref_pos = np.array(self.agent.ref[:3])
direction_to_target = ref_pos - vehicle_pos
distance = np.linalg.norm(direction_to_target)
# 1. Directional Progress (Global Frame)
target_direction = direction_to_target / (distance + 1e-6)
velocity_toward_target = np.dot(global_vel, target_direction)
progress_reward = velocity_toward_target / 2.5 # Normalized by target speed
# 2. Yaw Alignment (Body-Fixed Frame)
desired_yaw = np.arctan2(direction_to_target[1], direction_to_target[0])
yaw_error = np.arctan2(np.sin(desired_yaw - current_yaw),
np.cos(desired_yaw - current_yaw))
yaw_reward = np.exp(-2 * yaw_error**2) # Gaussian reward based on yaw error
# 6. Terminal Conditions
terminal_reward = 0
if done == 1:
terminal_reward = 5 + 5.0 * (1 - self.current_step / self.config['max_steps_per_ep'])
elif self.current_step >= self.config['max_steps_per_ep']:
terminal_reward = -2 * distance
# Combined Reward
reward = (
0.1 * progress_reward +
1.0 * (yaw_reward - 0.7) +
terminal_reward
)
# Dynamic Weight Adjustment when close to the target
if distance < 5:
reward += 0.1 * yaw_reward # Emphasize precise alignment near target
reward_dict = {
'progress': progress_reward,
'yaw': yaw_reward,
'terminal': terminal_reward,
'total': reward
}
return reward, reward_dict, distance, abs(yaw_error)
agent.nu
: Represents the body-fixed velocity vector (e.g., [forward_velocity, lateral_velocity, vertical_velocity, ...]). This is used to calculate the vehicle’s movement and is transformed into the global frame using the current yaw.agent.eta
: This typically holds the pose information of the agent (position and orientation). In this example: agent.eta[:3] gives the current position (x, y, z). agent.eta[5] or agent.eta[-1] is used as the current yaw.agent.ref
: Represents the reference or target pose. agent.ref[:3] is the target position.agent.prop_f
,agent.prop_r
,agent.prop_m
: These are actuator or propeller signals: prop_f represents the front propeller's force. prop_r represents the rear propeller's force. prop_m represents the middle propeller's force. They are included in the state vector to provide the agent with information about its current actuation status. Action taken changes the propellers.
Model parameters used (subject to change):
"model_params": {
"tensorboard_log": "models/tensorboard_logs/",
"policy": "MlpPolicy",
"learning_rate": 0.0003,
"policy_kwargs": {
"net_arch": {
"pi": [128, 64],
"vf": [128, 64]
},
"activation_fn": "Tanh",
"ortho_init": true,
"optimizer_kwargs": {
"eps": 1e-05
},
"log_std_init": -0.9
},
"vf_coef": 0.5,
"ent_coef": 0.0,
"max_grad_norm": 0.5,
"clip_range": 0.2,
"clip_range_vf": "",
"n_epochs": 10,
"target_kl": 0.05,
"gae_lambda": 0.95,
"verbose": 2,
"device": "cpu"
}
Has anyone encountered similar issues where a PPO agent initially learns the task but later “fets” or destabilizes its performance? Specifically:
Could the differential reward (subtracting the previous reward) be a source of instability?
How might one better balance multiple reward components (yaw alignment vs. progress vs. terminal rewards) to ensure the agent consistently maintains low yaw error?
Are there any tips for tuning PPO hyperparameters in such a mixed-reward scenario?
Any advice or pointers to resources on reward shaping in PPO would be greatly appreciated!