RL Self-Learning Page

Domain Randomization

Domain randomization is a technique used in Reinforcement Learning (RL) and Sim-to-Real transfer to improve the generalization of an RL agent. The idea is to train the agent in a simulated environment where various physical properties are randomized during training, so the policy becomes robust to variations.

What is Randomized?

Some examples of randomized parameters in domain randomization:

Object Properties: Mass, friction, size, shape
Robot Properties: Joint stiffness, damping, actuator latency, sensor noise
Environment Properties: Lighting conditions, textures, background noise
External Disturbances: Wind, force perturbations

Why is it Useful?

Helps the policy generalize better to unseen real-world conditions.
Prevents overfitting to a specific simulated environment.
Reduces the Sim-to-Real Gap by exposing the agent to a variety of simulated experiences.

For your dual-arm manipulation task, you could use domain randomization in:

Box properties: Varying weight, friction, and position.
Grasp perturbations: Adding slight noise to object grasping.
Sensor noise: Randomizing the accuracy of object pose estimation.

Value Function and Q-Function

In Reinforcement Learning, we estimate how good a certain state or action is by using value functions.

1. Value Function (V-function)

Measures the expected cumulative reward an agent will get starting from a given state and following a policy.
Defined as:

$$ \begin{equation}V^{\pi}(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, \pi \right]\end{equation} $$
Here, Vπ(s)V^\pi(s) is the value of state ss under policy π\pi.
Example: In your setup, V(s)V(s) tells you how good a given robot + object state is if you keep following the current policy.

2. Q-Function (Q-value function)

Measures the expected cumulative reward starting from a given state-action pair and following a policy afterward.
Defined as:

$$ \begin{equation}Q^{\pi}(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a, \pi \right]\end{equation} $$
Here, Qπ(s,a)Q^\pi(s, a) is the expected return if you take action aa in state ss and then follow policy π\pi.
Example: Q(s,a)Q(s, a) in your setup will tell you how good it is to apply a certain joint velocity command given the current robot + object state.

Difference Between V and Q

Concept	Value Function (V-function)	Q-Function (Q-function)
Definition	Expected reward from state ss	Expected reward from state-action pair (s,a)(s, a)
Input	State ss	State ss, Action aa
Output	Expected total reward	Expected total reward if taking action aa in state ss
Used in	Policy-based methods (e.g., Actor-Critic)	Q-learning, DDPG, SAC (off-policy methods)

Since your action space is joint velocities, your Q-function will evaluate how good different velocity commands are in a given state.

How It Relates to Your Task

Domain Randomization: You can randomize object mass, friction, and sensor noise to make your policy robust.
Value Functions: Helps estimate how good a given robot-object configuration is.
Q-Function: Helps determine which joint velocity command leads to better lifting performance.

No, Domain Randomization and Curriculum Learning are different concepts in reinforcement learning, though they can sometimes be used together.

1. Domain Randomization

Goal: Improve the generalization of policies by training in a variety of simulated environments with randomized parameters.
How it works: The environment's properties (e.g., friction, mass, textures, lighting, sensor noise) are randomized during training.
Example: Training a robot to grasp objects by varying object shapes, weights, and surface friction so that it performs well in the real world despite these variations.
Why? Helps overcome the sim-to-real gap by making the policy robust to unseen variations.

2. Curriculum Learning

Goal: Train an agent progressively by increasing task complexity over time.
How it works: The environment starts with easy tasks and gradually increases in difficulty as the agent improves.
Example: Training a quadruped robot to walk by first letting it learn to balance, then move forward, then navigate uneven terrain.