Address
Boltzmannstr. 3, 85748 Garching, Germany
The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enabling robots to interpret language commands and manipulate objects accordingly. While language-conditioned approaches demonstrate impressive capabilities for addressing tasks in familiar environments, they encounter limitations in adapting to unfamiliar environment settings. In this study, we propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data to enhance the algorithm's generalization in adapting to unfamiliar environments. We assess our model's performance in both simulated and real-world environments using a zero-shot setting. In the simulated environment, the proposed approach surpasses previously reported scores for CALVIN benchmark, especially in the challenging Zero-Shot Multi-Environment setting. The average completed task length, indicating the average number of tasks the agent can continuously complete, improves more than 2.5 times compared to the state-of-the-art method HULC. In addition, we conduct a zero-shot evaluation of our policy in a real-world setting, following training exclusively in simulated environments without additional specific adaptations. In this evaluation, we set up ten tasks and achieved an average 30\% improvement in our approach compared to the current state-of-the-art approach, demonstrating a high generalization capability in both simulated environments and the real world.
Notations | Definition |
---|---|
\(\mathcal{S}\) | State space |
\(\mathcal{A}\) | Action space |
\(\mathcal{A}_\textrm{skill}\) | Skill space |
\(\mathcal{A}_z\) | Skill embedding space |
\(\mathcal{P}\) | Environment dynamics |
\(\mathcal{G}\) | Multi-context goal space |
\(\mathcal{I}\) | Language instruction set |
\(x\) | Action sequence \((a_0, a_1, ...)\) |
\(y\) | Base skills, \(y \in \{\textrm{translation}, \textrm{rotation}, \textrm{grasping}\}\) |
\(z\) | Skill embedding in the latent space |
\(N_h\) | Horizon of action sequence (skill) |
\(N_o\) | Number of observations |
\(N_z\) | Skill embedding dimension |
\(f_{\boldsymbol{\kappa}}\) | Base skill locator network with parameters \(\boldsymbol{\kappa}\) |
\(f_{\boldsymbol{\theta}}\) | Skill generator network with parameters \(\boldsymbol{\theta}\) |
\(f_{\boldsymbol{\phi}}\) | Encoder network for action sequences |
\(f_{\boldsymbol{\Phi}}\) | Encoder network of our SPIL model |
\(f_{\boldsymbol{\lambda}}\) | Skill embedding selector network with parameters \(\boldsymbol{\lambda}\) |
\(f_{\boldsymbol{\omega}}\) | Base skill selector network with parameters \(\boldsymbol{\omega}\) |
Our methods aim to learn a goal-conditioned policy \(\pi(a|s,l)\) that outputs action \(a \in \mathcal{A}\), conditioned on the current state \(s \in \mathcal{S}\) and a language instruction \(l \in \mathcal{I}\), under environment dynamics \(\mathcal{P}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}\). The environment can be characterized by the following statements:
A multidimensional action space \(\mathcal{A} \subset \mathbb{R}^7\). This actions space contains all parameters to drive the agent to finish tasks. The first three parameters are the displacement of the end-effector’s position. Another three parameters are the rotation of the end effector and the final parameters is the gripper control parameters.
A visual state \(\mathcal{S} \subset \mathbb{R}^{N_o \times H \times W \times 3}\), where \(N_o\) is the number of observations. H and W are the height and width of the images. 3 is the channel numbers since the agent only has access to visual observations from cameras.
A multi-context goal space consists of language instructions and goal images \(\mathcal{G} \subset \mathcal{I} \cup \mathbb{R}^{H \times W \times 3}\), where \(\mathcal{I}\) is the nature language set and \(H\), \(W\) are the height and width of images, respectively.
Mees et al. [1] introduce the CALVIN benchmark to facilitate learning language-conditioned tasks across four manipulation environments. We mainly use this benchmark to evaluate our SPIL model’s performance. CALVIN benchmark mainly contains three components.
CALVIN environments. CALVIN includes four distinct environments (A, B, C, D) that are interconnected in terms of their underlying structure. Each environment consists of one Franka Emika Panda robot arm equipped with a gripper and a desk featuring a sliding door and a drawer that can be opened and closed. On the desk, there exists a button that can toggle the green light and a switch to control a light bulb. Note that each environment has a different desk with various of textures and the position of static elements such as the sliding door, drawer, light, switch, and button are different across each environment.
CALVIN dataset. To comprehensively explore the possible scenarios within a given space, the individuals involved engaged in teleoperated play while wearing an HTC Vive VR headset for a total of 24 hours, spending roughly the same amount of time (6 hours) in each of four different environments. In terms of language instructions, they utilize 400 natural language instructions that correspond to over 34 different tasks to label episodes in a procedural manner, based on the recorded state of the environment in the CALVIN dataset.
CALVIN challenge. The authors of the CALVIN introduce various evaluation protocols and metrics of different difficulty levels. These protocols are
Single Environment: Training in a single environment and evaluating the policy in the same environment.
Multi Environment: Training in all four environments and evaluating the policy in one of them.
Zero-Shot Multi Environment: This involves training the agent in three different environments and then testing its ability to generalize and perform well in a fourth environment that it has not previously encountered.
and the metrics are
Multi-Task Language Control (MTLC): The most straightforward evaluation aims to verify how well the learned multi-task language-conditioned policy generalizes 34 manipulation tasks
Long-Horizon Multi-Task Language Control (LH-MTLC): In this evaluation, the 34 tasks from a previous evaluation are treated as subgoals, and valid sequences consisting of five sequential tasks are computed.
To showcase our model’s exceptional performance relative to other skill-based reinforcement learning approaches, we’ve adapted the CALVIN benchmark to match the assessment criteria utilized in those approaches. This modified benchmark focuses on a subset of tasks within the CALVIN benchmark for evaluation purposes. The baselines we choose are two skill-based reinforcement learning approaches SpiRL [2] and SkiMo [3]. We assess these methods using a fixed task chain comprising four assignments, namely Open Drawer - Turn on Lightbulb - Move Slider Left - Turn on LED. This task sequence is evaluated 1000 times to determine the average success rate. The outcomes are presented in Table 2. Our SPIL approach consistently attains an almost perfect success rate of nearly 100% in this task sequence, outperforming the other baseline methods that utilize skill-based reinforcement learning for agent training. All experiments are evaluated with 3 random seeds.
Model | SpiRL | SkiMO | SPIL(ours) |
---|---|---|---|
Avg. Len. (#/4.00) | \(3.02 (0.53)\) | \(3.64 (0.21)\) | \(\textbf{3.99} (0.01)\) |
For the Single Environment, the skill embedding space is generated by the action sequences in the training data of environment D. Regarding to Zero-shot Multi Environment, the skill embedding space is generated by the action sequences in the training data of environment A, B, and C. Important hyperparameters are listed in Table 3 and 4.
The hyperparameters leveraged to train the agent in Single Environment and Zero-shot Multi Environment settings are listed in Table 3 and 4. The camera observations are applied with an image augmentation strategy. For simulation evaluations, the static observation goes through random shift of 10 pixels and normalization with mean=[0.48145466, 0.4578275, 0.40821073] and std=[0.26862954, 0.26130258, 0.27577711]; the gripper observation goes through random shift of 4 pixels and normalization with the same mean and std as static observation;
For real-world experiments, we apply stronger augmentation for images. The static observation goes through the following transforms, random shift of 10 pixels, color jitter with 0.2 brightness and 0.2 contrast, random rotation with the range of (-5, 5) degrees, random perspective with distortion-scale 0.1, and finally normalization with mean=[0.48145466, 0.4578275, 0.40821073] and std=[0.26862954, 0.26130258, 0.27577711]. The gripper observation goes through center cropping, random shift with 4 pixels, color jitter with 0.2 brightness and 0.2 contrast, and finally the same normalization as the static observation.
We have also implemented an augmentation strategy for action sequences during the skill embedding training, where we randomly set the last three relative actions of a sequence to zero, indicating still actions.
Description | Value |
---|---|
Batch Size | 64 |
Learning Rate | \(1.0 \times 10^{-4}\) |
Skill Embedding Dimension | 20 |
Horizon Length \(H\) of Skill | 5 |
Magic Scales \(w_1, w_2, w_3\) | 1.4, 3.0, 0.75 |
Plan prior matching weight \(\beta\) | \(5.0 \times 10^{-4}\) |
Regularizer weight \(\beta_1\) | \(1.0 \times 10^{-4}\) |
Regularizer weight \(\beta_2\) | \(1.0 \times 10^{-5}\) |
Regularizer weight \(\gamma_1\) | \(5.0 \times 10^{-3}\) |
Regularizer weight \(\gamma_2\) | \(1.0 \times 10^{-5}\) |
Description | Value |
---|---|
Batch Size | 32 |
Learning Rate | \(1.0 \times 10^{-4}\) |
Skill Embedding Dimension | 20 |
Horizon Length \(H\) of Skill | 5 |
Magic Scales \(w_1, w_2, w_3\) | 1.4, 3.0, 0.75 |
Plan prior matching weight \(\beta\) | \(1.0 \times 10^{-4}\) |
Regularizer weight \(\beta_1\) | \(1.0 \times 10^{-4}\) |
Regularizer weight \(\beta_2\) | \(1.0 \times 10^{-5}\) |
Regularizer weight \(\gamma_1\) | \(5.0 \times 10^{-3}\) |
Regularizer weight \(\gamma_2\) | \(1.0 \times 10^{-5}\) |
Environment | LangLfP | HULC | SPIL(ours) |
---|---|---|---|
D \(\rightarrow\) D | 30 | 42 | 43 |
ABC \(\rightarrow\) D | 102 | 122 | 125 |
Hardware and Software: All of the experiments were performed on a virtual machine with 40 virtual processing units, 356 GB RAM, and two Tesla V100 (16GB) GPUs. The virtual machine is equipped with the Ubuntu-20.04-LTS-focal operating system. Table 5 shows the training time for each model, which was done with 40 epochs for Single Environment (D \(\rightarrow\) D) and 30 epochs for Zero-shot Multi Environment (ABC \(\rightarrow\) D).
We observe that predicting skill embeddings instead of actions leads to faster model convergence. This may be because the action sequences are compressed into skill embeddings, reducing dimensional complexity and making it easier for the agent to learn.
For both SPIL and HULC, we observed that the agent often moves toward the target object but misjudges the distance between the gripper and the object, preventing it from making contact. This issue arises from the differences between the simulator observations and the real world, particularly in camera angles and light condition changes. Incorporating base skills can reduce the noise in actions when manipulating the target object. Our SPIL model performs more smoothly than the HULC model, which exhibits noticeable movement oscillations. Thus, our SPIL model achieves a higher success rate than HULC in the real-world experiment.
We found that estimating distance using gripper and static cameras remains challenging for the agent. Even slight disturbances in camera angle can significantly degrade the agent's performance. For this reason, we include stronger data augmentations like random perspective transforms and color jitter in the training process. We also hypothesize that incorporating depth information could improve the model's performance in the zero-shot real-world experiments.
We define \(y\) as the indicator for base skills and the base skill distribution in the latent space can be written as \(z \sim p(z|y)\). For given action sequence \(x\), we employ the approximate variational posterior \(q(z|x)\) and \(q(y,z|x)\) to estimate the intractable true posterior. Following the VAEs procedure, we measure the Kullback-Leibler (KL) divergence between the true posterior and the posterior approximation to determine the ELBO:
The objective of our model is to learn a policy \(\pi(x|s_c,s_g)\)
conditioned on
the current
state \(s_c\) and the goal state \(s_g\)
and
outputting \(x\), a sequence of actions, namely a skill.
Since we introduced the base skill concept into our model, the policy
\(\pi(\cdot)\) should also find the
best base skill \(y\) for the current
observation. We have \(\pi(x,y|s_c,s_g)\), where \(y\) is the base skill the agent chooses
based on the current state and goal state.
Inspired by the conditional variational autoencoder (CVAE):
We focus on the ELBO term:
variable \(x,y,z,c\):
\(x\): action sequence (skill) the agent chooses
\(y\): base skill priors
\(z\): skill embeddings in the latent skill space
\(c\): a combination of the current state and the goal state \((s_c,s_g)\)
\(q(y|x,c)\): It corresponds to the base skill labeler part in our SPIL model. We define this network with parameters \(\boldsymbol{\omega}\) and simplify it as \(q_{\boldsymbol{\omega}}(y|c)\) by pointing out \(x\).
\(q(z|x,y,c)\): It refers to the encoder network \(f_{\boldsymbol{\Phi}}\) plus the skill embedding selector network \(f_{\boldsymbol{\lambda}}\), taking \(c\) as input in our settings. It can be written as \(q_{\boldsymbol{\Phi}, \boldsymbol{\lambda}}(z|c)\), pointing out \(x,y\).
\(p(x|y,z,c)\): It is the skill generator network \(f_{\boldsymbol{\theta}}\) with parameters \(\boldsymbol{\theta}\). This network only takes \(z,c\) as input in our setting and we consider \(x\) and \(y\) to be conditionally independent given \(z,c\). It can then be formalized as \(p(x|z,c)\). Note that the parameters of this network are pretrained and frozen during the training process.
\(p(z|y,c)\): It is the base skill prior locater \(f_{\boldsymbol{\kappa}}\) with parameter \(\boldsymbol{\kappa}\). We assume \(z\) and \(c\) are conditionally independent given \(y\) so that we have \(p_{\boldsymbol{\kappa}}(z|y)\). Note that the parameters of this network are frozen during the training.
\(p(y)\): It is the prior distribution for base skills \(y\) which is drawn from a categorical distribution.
Based on the above analysis, the whole equation can be simplified as follows:
[1] Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters (RA-L)
[2] L. X. Shi, J. J. Lim, and Y. Lee, “Skill-based model-based reinforcement learning,” in 6th Annual Conference on Robot Learning, 2022.
[3] K. Pertsch, Y. Lee, and J. J. Lim, “Accelerating reinforcement learning with learned skill priors,” in Conference on Robot Learning (CoRL), 2020.
@ARTICLE{zhou2024languageconditioned,
author={Zhou, Hongkuan and Bing, Zhenshan and Yao, Xiangtong and Su, Xiaojie and Yang, Chenguang and Huang, Kai and Knoll, Alois},
journal={IEEE Robotics and Automation Letters},
title={Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data},
year={2024},
volume={},
number={},
pages={1-8},
keywords={Imitation Learning;Robotic Manipulation},
doi={10.1109/LRA.2024.3466076}
}
Boltzmannstr. 3, 85748 Garching, Germany