Working with Continuous Actions
Let’s start with the project from our previous tutorial and add continuous actions to it. You can either follow along using the starter project or check out the complete project if you prefer.
Discrete vs. Continuous Actions
In the previous guide, we worked with discrete actions - our agent had to choose between a finite set of options (0 or 1) to match a pattern. In real scenarios, we might receive a multitude of sensor data and visual inputs to decide which button to press.
However, in many real-world applications, this isn’t always possible. For controlling things like:
- Steering angles in vehicles
- Joint torques in robotic arms
- Power levels in engines
Our agent will need to output continuous actions—precise floating-point values rather than categorical choices.
Adding Continuous Actions to Our Environment
Let’s modify our environment to include both discrete and continuous actions. We’ll keep our original pattern matching task but add a second pattern where we expect the AI to output the square root of this new value.
Notice how we change nothing except our EXPECTATIONS—the agent will need to figure out what we want through trial and error, guided only by reward signals!
First, add new fields to track the second pattern and continuous action in PatternMatchingEnvironment.cs
:
private int pattern = 0;private int pattern2 = 0;private int aiChoice = 0;private float aicontinuousChoice = 0f;private bool roundFinished = false;
Next, add a second observation method and our continuous action method:
[RLMatrixObservation]public float SeePattern() => pattern;
[RLMatrixObservation]public float SeePattern2() => pattern2;
[RLMatrixActionContinuous]public void MakeChoiceContinuous(float input){ aicontinuousChoice = input;}
Now, let’s create our reward functions:
[RLMatrixReward]public float GiveReward() => aiChoice == pattern ? 1.0f : -1.0f;
// Add +2 reward when the AI's continuous output is close to the square root// of the second pattern[RLMatrixReward]public float ExtraRewards() => Math.Abs(aicontinuousChoice - Math.Sqrt(pattern2)) < 0.1f ? 2f : 0.0f;
Finally, we need to update our StartNewRound
method to generate both patterns:
[RLMatrixReset]public void StartNewRound(){ pattern = Random.Shared.Next(2); pattern2 = Random.Shared.Next(10); aiChoice = 0; roundFinished = false;}
Notice we’re using a range of 0-9 for pattern2, which gives our agent a more interesting challenge of predicting different square roots.
Fixing Compilation Errors
When you try to build the solution, you’ll encounter a series of errors. This is actually helpful—RLMatrix uses strong typing to prevent runtime errors and guide you toward the correct implementation for continuous actions.
Error 1: Environment Type Mismatch
Argument 1: cannot convert from 'PatternMatchingExample.PatternMatchingEnvironment' to 'RLMatrix.IEnvironmentAsync<float[]>'
This occurs because RLMatrix has different interfaces for continuous and discrete environments to ensure type safety. Let’s update our code in Program.cs
:
var env = new List<IEnvironmentAsync<float[]>> {var env = new List<IContinuousEnvironmentAsync<float[]>> { environment, //new PatternMatchingEnvironment().RLInit() //you can add more than one to train in parallel};
Error 2: Agent Type Mismatch
After this change, we’ll get a second error:
Argument 2: cannot convert from 'System.Collections.Generic.List<RLMatrix.IContinuousEnvironmentAsync<float[]>>' to 'System.Collections.Generic.IEnumerable<RLMatrix.IEnvironmentAsync<float[]>>'
This is because we’re trying to use a discrete agent with a continuous environment. We need to change the agent type:
var agent = new LocalDiscreteRolloutAgent<float[]>(learningSetup, env);var agent = new LocalContinuousRolloutAgent<float[]>(learningSetup, env);
Error 3: Algorithm Options Mismatch
This leads to our third error:
Argument 1: cannot convert from 'RLMatrix.DQNAgentOptions' to 'RLMatrix.PPOAgentOptions'
This final error shows that DQN is incompatible with continuous actions. We need to switch to PPO (Proximal Policy Optimization), which can handle both discrete and continuous action spaces:
var learningSetup = new DQNAgentOptions( batchSize: 32, memorySize: 1000, gamma: 0.99f, epsStart: 1f, epsEnd: 0.05f, epsDecay: 150f);var learningSetup = new PPOAgentOptions( batchSize: 128, memorySize: 1000, gamma: 0.99f, width: 128, lr: 1E-03f);
Our First Training Run
Now let’s run the training and see what happens:
Step 800/1000 - Last 50 steps accuracy: 42.0%Press Enter to continue...
Step 850/1000 - Last 50 steps accuracy: 38.0%Press Enter to continue...
Step 900/1000 - Last 50 steps accuracy: 40.0%Press Enter to continue...
Step 950/1000 - Last 50 steps accuracy: 38.0%Press Enter to continue...
Step 1000/1000 - Last 50 steps accuracy: 37.0%Press Enter to continue...
Surprise! The AI is hardly learning at all. The accuracy doesn’t get over 50%, and if we inspect the dashboard, we see it’s regularly collecting +1 rewards for discrete actions (matching pattern) but rarely getting the +2 rewards for continuous actions (predicting √pattern2).
Why Is This Happening?
Ask yourself: Why does the AI learn to match the discrete action so much easier than the continuous one?
Your first instinct might be the learning rate (lr
)—maybe it’s too low? Let’s try changing it to 1E-02f
and running the training again…
Did that help? Probably not. In fact, you might notice that while the agent learns the discrete action faster, it hardly explores the continuous action space at all, and the accuracy gets even worse as training progresses.
So what’s really going on?
Adding a Guiding Signal
Let’s try to remedy this by providing a more helpful reward signal. We’ll add a reward that increases as the agent gets closer to the correct square root, rather than only rewarding exact matches:
[RLMatrixReward]public float ExtraSupportingReward() => 0.5f / (1 + Math.Abs(aicontinuousChoice - (float)Math.Sqrt(pattern2)));
//Dont forget to set your lr back to 1E-03f!
This reward function creates a gradient—a continuous signal that gets stronger as the agent approaches the correct value. Even when it’s not exactly right, it gets feedback about whether it’s getting “warmer” or “colder.”
Let’s run the training again with this change and see what happens:
Step 850/1000 - Last 50 steps accuracy: 35.0%Press Enter to continue...
Step 900/1000 - Last 50 steps accuracy: 40.0%Press Enter to continue...
Step 950/1000 - Last 50 steps accuracy: 47.0%Press Enter to continue...
Step 1000/1000 - Last 50 steps accuracy: 36.0%Press Enter to continue...
We’re seeing some small improvements, but it’s still not great. The dashboard might show hints that learning is progressing, but clearly, we need more training time for this more complex task.
Extending Training Time
For more complex challenges like continuous action prediction, we often need more training steps. Let’s modify our program to train for 10,000 steps instead of 1,000:
for (int i = 0; i < 10000; i++){ await agent.Step();
if ((i + 1) % 500 == 0) { Console.WriteLine($"Step {i + 1}/10000 - Last 500 steps accuracy: {environment.RecentAccuracy:F1}%"); environment.ResetStats();
Console.WriteLine("\nPress Enter to continue..."); Console.ReadLine(); }}
Experiment: Learning Rate Impact
As you watch the longer training progress, try experimenting with different learning rates. What happens if you lower it even further? What if you raise it significantly?
In my experiments, setting a very high learning rate causes the model to get stuck collecting only the +1 rewards for discrete actions while completely failing to explore the continuous space adequately.
Key Takeaways
Through this exercise, we’ve learned several important lessons:
-
Continuous actions are inherently harder to learn than discrete ones, due to the sparse reward problem. When possible, discretize your action space!
-
Reward engineering matters enormously for continuous control problems. Providing signal about “getting warmer” transforms an impossible learning task into a tractable one.
-
Complex tasks require more training time. As we add dimensions to our action space, we need to scale training duration accordingly.
-
Algorithm selection is critical. DQN can’t handle continuous actions at all, while PPO can handle discrete, continuous, or mixed action spaces.
-
Learning rate tuning is delicate, especially with PPO. Higher isn’t always better and can sometimes be worse for exploration.
These principles will serve you well as you tackle more complex reinforcement learning challenges with RLMatrix.
Test Your Understanding
Understanding Continuous Actions
Next Steps
Now that you understand the challenges of continuous action spaces and how to address them, you’re ready to try a classic reinforcement learning problem with more complex observations.