AI-Powered Image and Video Processing: The Future of Communication

Pixar - Luxo Jr.

As a child, I was captivated by the earliest animated shorts produced by Pixar. The combination of groundbreaking computer graphics and storytelling brought the characters to life. I can still vividly remember watching Luxo Jr bounce across the screen, chasing, playing and eventually puncturing a small blue and yellow ball.

Back then, animators laboured within significant constraints to deliver a believable and captivating scene with simulated light, objects with physical properties and of course motion. These early animations paved the way for the feature-length movies we enjoy today. Early pioneers in this space, most notably Ed Catmull, co-founder of Pixar, achieved this with hardware that took 1.5 hours to render a single frame. On modern hardware, this would be possible in real-time.¬†¬†

While early development of GPUs was predominantly influenced by the need to support increasingly realistic computer graphics for games and animation, today Artificial Intelligence and Machine Learning is having a significant influence on their development. GPUs are the foundation of training and making predictions with Machine Learning models.

In this article I provide a summary of cutting edge developments in the application of Artificial Intelligence to computer graphics and explore how these advances may influence how we visually communicate in the near future.

Image Upscaling

NVIDIA and AMD employ techniques to upscale images using trained neural networks through a process called Deep Learning Super Samping (DLSS) and Fidelity Super Resolution (FSR) respectively. This allows the rendering pipeline to run at a lower resolution and the GPU to more efficiently create a higher fidelity image. NVIDIA's latest release of DLSS is said to be able to artificially generate 85% of all the pixels on the screen. Even with modern powerful GPUs, rendering an image is an expensive process. Being able to exchange rendering for a highly plausible generated image is highly advantageous.

Super Sampling - NVIDIA

Upscaling images is also helping breathe new life into classic computer games when played on modern, high-resolution monitors.

Another area being positively impacted is photography. Canon’s first DSLR cameras in the mid 1990s delivered a little over one megapixels of resolution. This limited the size of physical prints before they appeared too grainy and pixelated. Upscaling images using Real-ESRGAN is allowing us to stretch print sizes further and rescue old photographs.

Optical Flow Acceleration

In addition to generating missing pixels during the upscaling process, GPUs are also capable of generating intermediate frames in video or animation sequences through a process called optical flow acceleration. For example, consider two images of a person riding a motorcycle. Instead of using a combination of rendering and upscaling to create the intermediate scene, the scene is interpolated based on how points within the scene are flowing. This effectively allows the computer generated image to run at a higher frame rate and therefore is smoother to watch.

Optical Flow Acceleration - NVIDIA

If we combine the effects of super sampling and frame interpolation, GPUs help us significantly compress images or video. If the video or images are being streamed across a network, this will lead to significant improvement in the quality (resolution and frames per second) of the content.

AI Simulated Materials and Textures

In games and virtual worlds, objects are increasingly being created with realistic material textures. These textures are applied to a model that has been skilfully unwrapped with carefully chosen seams. The unwrapped model is placed flat on a high-resolution texture map and the overlapping section is then used for the model's texture at different focal distances.

If you've ever zoomed in on a digital photograph, you'll notice that the image looks pixelated. The same thing happens with a 3D model with a low resolution image texture. One solution to this problem is to use higher resolution texture maps so we can zoom in more without it being too pixelated. The problem with this approach is the texture maps grow in size quickly. If we double the size of the map, the space required to store it increases by a factor of four. AI models trained on image textures now provide a solution to this problem, being able to generate hyper-realistic textures.

UV Unwrapping - Blender Community

There have also been recent advances in simulating materials. Unlike textures that represent a 2D image mapped onto the surface of a 3D object, materials attempt to simulate the physical properties of 3D objects. This includes how diffuse, specular, transparent or reflective the material is. Accurately simulating materials, especially caustics can require significant numbers of renderings cycles. However, a recent paper from the research team at NVIDIA demonstrated that they had trained and compressed real material data into a tiny two-layer neural network for each material. The results are staggeringly realistic and even have the benefit of containing less noise than the equivalent rendered image.

Real-time Neural Appearance Models

Inpainting and Outpainting of Images and Video

Inpainting is the process of removing sections of an image or video and replacing it with something new or plausible within the scene. For example, removing strangers, dustbins or other artefacts from a photograph is an example of inpainting. Adobe and others have been able to perform AI based inpainting since early this year and appear to have smoothly integrated into Photoshop’s workflow. Inpainting in videos is a more complex problem, but has already been introduced into products such as Runway.

Image Inpainting - Removing Unwanted Artefacts From Images

The all more challenging problem is outpainting, when an image is extended to include content that’s plausible but could significantly extend the original image in one or more directions.

Consider the famous painting “Girl with a Pearl Earring” by the Dutch master, Johannes Vermeer. AI has been used to significantly outpaint or extend the original image (outlined by a red box) to create a plausible but completely fabricated image of the 350 year old painting.

Image Out-Painting - Creating A Plausible Background For "The Girl with a Pearl Earring"

But what if we had other paintings of the room we could use for reference? Could we outpaint the portrait based on real reference images? That was the focus of a recent paper from Cornell University and Google titled “RealFill: Reference-Driven Generation for Authentic Image Completion”.

Hyper Realistic Images with Character Consistency

Gen AI solutions such as DALL-E, Midjourney, and Stable Diffusion have continued to rapidly evolve and improve. They are able to create hyper realistic images based on detailed prompts from users. Historically they would struggle to generate hands and fingers within images, details that would instantly give them away as being artificially generated. Improvements to the GenAI models, including more training sets containing hand data and better prompting has significantly improved performance in recent months.

Hyper-realistic Image Generation and Character Consistency

Hyper-realistic Image Generation With Character Consistency

Text to image models have also struggled to handle character consistency, which means transposing a character between different scenes. An example of this would be to ask an AI to recreate the man sitting by the window in the cafe scene above and placing him into a chair in a living room. The latest release of DALL-E v3 has demonstrated significant improvements in this space.

Another area where Generative AI has historically struggled is to generate images containing text. For example, in the past, if the prompt that generated the man sitting next to the window requested that he be reading a newspaper with a collection of headlines, it would probably perform poorly. Again, DALL-E v3 has recently made significant improvements in this space.

Being able to replicate realistic, believable scenes and have consistency of characters and embedding text in images opens a host of new use cases.

Motion Capture and Transposition

A few months ago, a team of researchers at ByteDance released a paper called Magic Avatar. It described how they are able to capture motion from an input video and then use that to generate a new synthetic AI video, transposing both the actor and the environment described by a Generative AI text prompt.

Motion Capture and Image Transposition

While changing the avatar between the source and destination has a number of creative use cases in the production of movies and games, what if the virtual avatar is a high fidelity version of the source image? It could be used to massively compress images and video streams. Instead of transmitting the image, we need only transmit the motion (the middle image) together with the prompt describing the transposed image.

3D Environment Reconstruction from Images and Videos

Up until recently, the most efficient way to create a 3D representation of a physical object would be to scan it through a process called Photogrammetry. It works by taking overlapping pictures of an object from different angles. Then, the software generates a 3D model by combining all these images together. However, recently NVIDIA's research team published a paper "Neuralangelo: High-Fidelity Neural Surface Reconstruction" in which they demonstrated being able to generate a 3D scene based on video captured from a mobile phone. Previous attempts to do this struggled to capture areas of fine detail.

Neuralangelo: High-Fidelity Neural Surface Reconstruction

This has profound implications for the future of virtual reality and augmented reality.

Meta’s first attempt at placing avatars into the metaverse, was described as looking “creepy and dead-eyed” by social-media users. This resulted in a change of direction and focus on using hyper-realistic avatars based on models generated from scans of a user’s face.

Hyper-realistic avatars based on 3D models created from scans a user’s face

However, for this to be widely adopted the process of scanning needs to be easily accessible and inexpensive. The mobile phone, with its high resolution camera and TrueDepth camera system seems to be the only viable approach to creating an accurate 3D map of the face. Avatar models, their textures and materials also need to be efficiently stored and brought to life through AI augmented rendering pipelines for an immersive and believable user experience.

Predictions

If we combine together some of the recent advances in the application of AI to image and video processing, there are some complementary streams of development that I believe will impact the telecommunications, media and entertainment industry.

Advancements in Gen AI have led to improved image upscaling and optical flow acceleration, allowing for more efficient communication through reduced data transmission. As GPUs become ubiquitous in communication devices and their power consumption decreases, software and network infrastructure will be able to significantly compress video streams while maintaining quality. This means that the amount of information being transmitted can be significantly reduced, making communication more efficient.

Further improvements can be gained by including foveated imaging, such that only the sections of a video stream that are actively being observed contain the greatest amount of information and detail.

As motion capture and transposition of motion to other objects and actors continues to evolve and become increasingly reliable, the effective compression of the stream will once again increase, potentially significantly. If motion capture is possible using high-resolution cameras and TrueDepth systems on phones, we can significantly reduce the barrier to adoption for immersive metaverse experiences.

Once we transition into representing actors and objects with a 3D representation, we also benefit from the improvements in simulated materials and textures mentioned earlier.

If we combine this with inpainting and outpainting, we’ll be able to isolate objects and actors from a livestream and extend the scene not only to a plausible AI generated background but based on a corpus of real images and videos.

If we now combine this with the ability to recreate 3D scenes from video captured from a mobile phone, the experience will be far more advanced than the background separation and virtual backgrounds that we see today.

Finally, the ability of AI to not only capture, transpose and generate actors in a video stream, but also simulate other entities such as the behaviour of mechanical systems and materials will make for a hyper realistic communications experience.

While arguably these advances will bring improvements in how we communicate, today, face-to-face interactions are still the most effective way to connect people, especially during first meetings. We have evolved to pick up on subtle, often subconscious cues when communicating face-to-face, something which COVID made all too apparent. That being said, I do believe that AI will have a profound impact on how we communicate during the next few years. The speed that we move from in-person to remote and virtual will depend upon how effectively the technology is able to capture the subtlety of motion and communication, which in part is why those first Pixar movies connected so deeply with those who watched them.