For humans, working with deformable objects is not significantly more difficult than handling rigid objects. We learn naturally to shape them, fold them, and manipulate them in different ways and still recognize them. But for robots and artificial intelligence systems, manipulating deformable objects present a huge challenge. Consider the series of steps that a robot must take to shape a ball of dough into pizza crusts. It must keep track of the dough as it changes shape, and at the same time, it must choose the right tool for each step of the work. These are challenging tasks for current AI systems, which are more stable in handling rigid-body objects, which have more predictable states. Now, a new deep learning technique developed by researchers at MIT, Carnegie Mellon University, and the University of California at San Diego, shows promise to make robotics systems more stable in handling deformable objects. Called DiffSkill, the technique uses deep neural networks to learn simple skills and a planning module for combining the skills to solve tasks that require multiple steps and tools.
Handling deformable objects with reinforcement learning and deep learning
If an AI system wants to handle an object, it has to be able to detect and define its state and predict how it will look in the future. This is a problem that has been largely solved for rigid objects. With a good set of training examples, a deep neural network will be able to detect a rigid object from different angles. However, when it comes to deformable objects, the space of possible states becomes much more complicated. “For rigid objects, we can describe its state with six numbers: Three numbers for its XYZ coordinates and another three numbers for its orientation,” Xingyu Lin, Ph.D. student at CMU and lead author of the DiffSkill paper, told TechTalks. “However, deformable bodies, such as the dough or fabrics, have infinite degrees of freedom, making it much more difficult to describe their states precisely. Furthermore, the ways they deform are also harder to model in a mathematical way compared to rigid bodies.” The development of differentiable physics simulators enabled the application of gradient-based methods to solve deformable object manipulation tasks. This is in contrast to the traditional reinforcement learning approach that tries to learn the dynamics of the environment and objects through pure trial-and-error interactions. DiffSkill was inspired by PlasticineLab, a differentiable physics simulator that was presented at the ICLR conference in 2021. PlasticineLab showed that differentiable simulators can help short-horizon tasks. But differentiable simulators still struggle with long-horizon problems that require multiple steps and the use of different tools. AI systems based on differentiable simulators also require the agent to know the full simulation state and relevant physical parameters of the environment. This is especially limiting for real-world applications, where the agent usually perceives the world through visual and depth sensory data (RGB-D). “We started to ask if we can extract [the steps required to accomplish a task] as skills and also learn abstract notions about the skills so that we can chain them to solve more complex tasks,” Lin said. DiffSkill is a framework where the AI agent learns skill abstraction using the differentiable physics model and composes them to accomplish complicated manipulation tasks. Lin’s past work was focused on using reinforcement learning for the manipulation of deformable objects such as cloth, ropes, and liquids. For DiffSkill, he chose dough manipulation because of the challenges it poses. “Dough manipulation is particularly interesting because it cannot be easily performed with the robot gripper, but requires using different tools sequentially, something humans are good at but is not very common for robots to do,” Lin said. Once trained, DiffSkill can successfully accomplish a set of dough manipulation tasks using only RGB-D input.
Learning abstract skills with neural networks
DiffSkill is composed of two key components: a “neural skill abstractor” that uses neural networks to learn individual skills and a “planner” that composes the skill to solve long-horizon tasks. DiffSkill uses a differentiable physics simulator to generate training examples for the skill abstractor. These samples show how to achieve a short-horizon goal with a single tool, such as using a roller to spread the dough or a spatula to displace the dough. These examples are presented to the skill abstractor as RGB-D videos. Given an image observation, the skill abstractor must predict whether the desired goal is feasible or not. The model learns and tunes its parameters by comparing its prediction with the actual outcome of the physics simulator.
— Xingyu Lin (@Xingyu2017) April 27, 2022 At the same time, DiffSkill trains a variational autoencoder (VAE) to learn a latent-space representation of the examples generated by the physics simulator. The VAE encodes images in a lower-dimension space that preserves important features and discards information that is not relevant to the task. By transferring the high-dimensional image space into the latent space, the VAE plays an important role in enabling DiffSkill to plan over long horizons and predict outcomes by observing sensory data. One of the important challenges of training the VAE is making sure it learns the right features and generalizes to the real world, where the composition of visual data is different from those generated by the physics simulator. For example, the color of the roller pin or the table is not relevant to the task, but the position and angle of the roller and the location of the dough are. Currently, the researchers are using a technique called “domain randomization,” which randomizes the irrelevant properties of the training environment such as background and lighting, and keeps the important features such as the position and orientation of tools. This makes the VAE more stable when applied to the real world. “Doing this is not easy, as we need to cover all possible variations that are different between the simulation and the real world [known as the sim2real gap],” Lin said. “A better way is to use a 3D point cloud as representation of the scene, which is much easier to transfer from simulation to the real world. In fact, we are working on a follow-up project using point cloud as input.”
Planning long-horizon deformable object tasks
Once the skill abstractor is trained, DiffSkill uses the planner module to solve long-horizon tasks. The planner must determine the number and sequence of skills needed to go from the initial state to the destination. This planner iterates over possible combinations of skills and the intermediate outcomes they yield. The variational autoencoder comes in handy here. Instead of predicting full image outcomes, DiffSkill uses the VAE to predict the latent-space outcome of intermediate steps toward the final goal. The combination of abstract skills and latent-space representations makes it much more computationally efficient to draw a trajectory from the initial state to the goal. In fact, the researchers didn’t need to optimize the search function and used an exhaustive search of all combinations. “The computation is not too much since we are planning over the skills and the horizon is not very long,” Lin said. “This exhaustive search eliminates the need for designing a sketch for the planner and might lead to novel solutions not considered by the designer in a more general way, although we did not observe this in the limited tasks we tried. Furthermore, more sophisticated search techniques could be applied as well” According to the DiffSkill paper, “optimization can be done efficiently in around 10 seconds for each skill combination on a single NVIDIA 2080Ti GPU.”
Preparing the pizza dough with DiffSkill
The researchers tested the performance of DiffSkill against several baseline methods that have been applied to deformable objects, including two model-free reinforcement learning algorithms and a trajectory optimizer that only uses the physics simulator. The models were tested on several tasks that require multiple steps and tools. For example, in one of the tasks, the AI agent must lift the dough with a spatula, place it on a cutting board, and spread it with a roller. The results show that DiffSkill is significantly better than other techniques at solving long-horizon, multiple-tool tasks using only sensory information. The experiments show that when well trained, DiffSkill’s planner can find good intermediate states between the initial and goal states and find decent sequences of skills to solve tasks. “One takeaway is that a set of skills can provide very important temporal abstraction, allowing us to reason over long-horizon,” Lin said. “This is also similar to how human approaches different tasks: thinking at different temporal abstractions instead of thinking what to do at every next second.” However, there are also limits to DiffSkill’s capacity. For example, when performing one of the tasks that required three-stage planning, DiffSkill’s performance degrades significantly (though it is still better than other techniques). Lin also mentioned that in some cases, the feasibility predictor produces false positives. The researchers believe that learning a better latent space can help solve this problem. The researchers are also exploring other directions to improve DiffSkill, including a more efficient planner algorithm that can be used for longer horizon tasks. Lin hopes that one day, he can use DiffSkill on real pizza-making robots. “We are still far from this. Various challenges emerge from control, sim2real transfer, and safety. But we are now more confident at trying some long-horizon tasks,” he said. This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original article here.