It turns out a leopard can change its spots.
Thanks to NVIDIA researchers’ new GPU-accelerated deep learning technique, a leopard — or at least a picture of it — can simultaneously turn into a house cat, a tiger or even a dog. It works for video, too.
The ability to turn one image or video into many could help game developers and filmmakers move faster, spend less and create richer experiences. It also could boost autonomous vehicles’ ability to handle a wider variety of road conditions by more quickly and easily generating diverse training data.
One to Many
With this discovery, the researchers one-up their earlier work on image translation, presented in December at the Conference and Workshop on Neural Information Processing Systems, better known as NIPS. The method described in the NIPS paper worked on a one-by-one basis, mapping one image or video to another.
The new technology, disclosed in a paper published today, is “multimodal” — it simultaneously converts one image to many.
Multimodal image translation is just the latest example of groundbreaking work from our 200-person strong NVIDIA Research team. Spread across 11 worldwide locations, our researchers are pushing the boundaries of technology in machine learning, computer vision, self-driving cars, robotics, graphics, computer architecture, programming systems and other areas.
Sunshine on a Cloudy Day
Like the NIPS research, multimodal image translation relies on two deep learning techniques — unsupervised learning and generative adversarial networks (GANs) — to give machines more “imaginative” capability, such as imaging how a sunny street would look during a rainstorm or in wintertime.
Now, instead of translating a summer driving video into just one instance of winter, the researchers can create a diverse set of winter driving videos in which the amount of snow varies. The technology works the same way for different times of day and other weather conditions, providing sunshine on a cloudy day or turning darkness to dawn, afternoon or twilight.
In the world of gaming, multimodal image translation could give studios a faster and easier way to create new characters or new worlds. Artists could set aside more tedious tasks to develop richer and more complex stories.
The Multimodal Unsupervised Image-to-image Translation framework, dubbed MUNIT, works by separating image content from style. In a picture of a cat, for example, the pose of the cat is the content and the breed is the style. The pose is fixed. If you’re converting a picture of a housecat to a leopard or a dog, the position of the animals must remain identical. What varies is the breed or species — domestic shorthair, leopard or collie, for example.
No Data? No Problem
This research is built on a deep learning method that’s good at generating visual data. A GAN uses two competing neural networks — one to generate images and one to assess whether the generated images are real or fake. GANs are particularly useful when there’s a shortage of data.
Typically, image translation requires datasets of corresponding images — pictures of collies, labrador retrievers or tigers positioned in the exact same way as an original cat image. That sort of data is difficult, if not impossible, to find. The advantage of MUNIT is that it works without it.
MUNIT could also be handy for generating training data for autonomous cars by eliminating the need to capture the same footage recorded from the same vantage point, with the same perspective and with all the oncoming traffic and other details in exact same location.
In addition, GANs eliminate the need for people to label the content of each image or video, a task that takes extensive time and manpower.
“My goal is to enable machines to have human-like imaginative capabilities,” said Ming-Yu Liu, one of the authors of the paper. “A person can imagine what a scene looks like in wintertime, whether the trees are bare of leaves or covered with snow. I hope to develop artificial intelligence that can do this.”