Open AI has trained a neural network to play the game of Minecraft by using a small amount of labeled contractor data along with a sizable unlabeled video dataset of human Minecraft play.
The AI research and deployment company is confident that its model can learn to make diamond tools, a task that typically requires skilled humans over 20 minutes, with a little bit of fine-tuning (24,000 actions). Its model, which is quite general and takes a step toward general computer-using agents, uses the native human interface of keypresses and mouse movements.
A spokesperson for the Microsoft-backed firm said: “The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of what happened but not precisely how it was achieved, i.e. you will not know the exact sequence of mouse movements and keys pressed.
“If we would like to build large-scale foundation models in these domains as we’ve done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where “action labels” are simply the next words in a sentence.”
Open AI introduces a novel, yet simple, semi-supervised imitation learning technique called Video PreTraining in order to make use of the abundance of unlabeled video data available on the internet (VPT). To start, the team collects a small dataset from contractors in which it records both their video and their actions, in this case, keystrokes and mouse movements. With the help of this information, the business can create an inverse dynamics model (IDM), which forecasts the actions that will be taken at each video step. It’s significant that the IDM can predict the action at each step using past and future information.
The spokesperson added: “This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.”
VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet, according to Open AI.
The spokesperson said: “Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.”