Robot learns to use tools by ‘watching’ YouTube videos

January 6, 2015 Imagine a self-learning robot that can enrich its knowledge about fine-grained manipulation actions (such as preparing food) simply by “watching” demo videos. That’s the idea behind a new robot-training system based on recent developments of “deep neural networks” in computer vision, developed by researchers at the University of Maryland and NICTA in Australia. The objective of the system is to improve performance, improving on previous automated robot-training systems such as Robobrain, discussed on KurzweilAI last August. Robobrain, which has been downloading and processing about 1 billion images, 120,000 YouTube videos, and 100 million how-to documents and appliance manuals, is designed to deduce where and how to grasp grasp an object from its appearance. The problem with Robobrain (and other automated robot-training systems): there’s a large variation in exactly where (which frames) the action appears in the videos, and without 3D information, its hard for the robot to infer what part of the image to model and how to associate a specific action such as a hand movement (cracking on egg on a dish, for example) with a specific object (“hmm, did the human just crack the egg on the dish, or did it break open magically from an arm movement, or do spheroid objects spontaneously implode?”). The researchers took a crack, if you will, at this problem by developing a deep-learning system that uses object recognition (it’s an egg) and grasping-type recognition (it’s using a “precision large diameter” grasping type) and can also learn what tool to use and how to grasp it. Their system handles all of that with recognition modules based on a “convolutional neural network” (CNN), which also predicts the most likely action to take using language derived by mining a large database. The robot also needs to understand the hierarchical and recursive structure of an action. For that, the researchers used a parsing module based on a manipulation action grammar.

Most likely parse tree generated for the buttering-corn clip (credit: Yezhou Yang et al.)

Abstract of Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos from the World Wide Web In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by “watching” unconstrained videos with high accuracy.