About UsBlog
DocumentationGuides, specifications and datasheets
GitHubSource code of our libraries
RobotHubDiscover our cloud platform
About UsBlogStore
Jun 24, 2022

What Are Computational Cameras?

Computer Vision
Machine Learning

Computational cameras have the capability to run multiple computer vision functions (including AI-based computer vision) directly on the camera. To understand exactly what that means, let’s unpack an example image from an Apple iPhone.

Computational Photography in action. Tiny Optics + Math = A photo that looks like it was taken with HUGE optics.

Playing arm-chair reverse-engineer, we can identify several things that are happening behind the scenes:

  • The person of interest is detected using semantic segmentation (OAK-D-Lite how-to, here)
  • An edge detector is used to refine the edges of the person - as the semantic segmentation won't be perfect nor high-enough resolution (OAK-D-Lite edge detector how-to, here)
  • The depth of the scene is estimated using disparity depth (OAK-D-Lite how-to, here)
  • A blur is applied to the non-person mask proportional to depth (distance) to mimic the blur of a large-lens focused on a close subject (implemented this for any/all OAK devices, here).

The first three functions are standard features of all OAK-D models (OAK-D-Lite, OAK-D, OAK-D-PoE - any OAK-D*).  The fourth function is where things get interesting. Since it isn't a standard feature of any OAK-D model, we decided to add it by creating a custom computer vision code to run directly on OAK-D-Lite. We will explain how we did it a little later, but for now, let's take a look at each of the four computer vision functions we identified in the iPhone image above.

Semantic Segmentation

The key enabler of the photo above is AI. Without that, such photos are not really possible (realistically speaking). The AI technique being used here is semantic segmentation, which, in the simplest terms, is the ability for the computer to identify and segment a specific item in the scene - in this case, the closest person to the camera.

Take a look at the example below. See how the person is continually painted green?  We are performing person-class semantic segmentation on an OAK-D. The other two panes are performing depth sensing, which we will cover later. 

GIF sematic segmentation

Note how the edges are not quite tack-sharp. The green mask at times bleeds out of (or into) the main subject, specifically on edges that are soft or undefined. The reason is that, at the moment, semantic segmentation models are generally low resolution. They are necessarily this way, for now (in the future won't be), because they are computationally intensive. Although AI was invented in 1986, with the computing power available at that time it would have taken longer than all of human history to train a single AI model. Not so, anymore.

Although we aren’t at 1986 computational levels now, in order to get semantic segmentation to run at ~30FPS ("real-time"), it's necessary to have them be quite low resolution. Whereas an OAK-D-Lite has a 4208x3120 pixel color camera, popular semantic segmentation models are only 256x256 pixels. To do the math for you, the semantic segmentation network is 1/200th the resolution of the color camera on OAK-D-Lite. When applied to a full-resolution image, it is not high-enough resolution to make the edges look right. Enter edge detection.

Edge Detection

Edge detection

Edge detectors are algorithmic computer vision filters which have, relatively-speaking, a low computational cost (at least relative to a semantic segmentation network). And better yet, our OAK cameras have a dedicated piece of hardware that is specifically for edge filtering.

Because the edge filter can run at the full resolution and full frame-rate, it makes an excellent partner for the semantic segmentation network. The edge filter is non-discriminatory; it has no idea what anything is. It can’t distinguish a dog from a dishtowel, or a human from a humidor, but it can give you edges with super-fine granularity at high resolution.

By combining these computer visions, we can tie semantic segmentation (low resolution) to the closest major edge (high resolution) such as the well-defined edge around the person in the image above to create an accurate mask around a person. But now what?

The next component to consider is depth.  We need to combine both the depth of the person, and the depth of the overall scene to make the desired effect.

Depth Sensing

The video below demonstrates semantic segmentation (top left) in combination with depth sensing so that we can isolate, then alter, the background to make a new image. The bottom left frame shows a depth map of the entire scene, while the bottom right shows how the background (the blue depth level) can easily be removed. Once we have identified the background, we can replace it as needed. In this case changing out a boring background for a more interesting one.

Now that we are able to identify what is the main subject and what is the background, if we want to use a filter to create a blurred background as seen in the original image above, we can apply a blur effect to whatever depth (or as seen in the example, color) we want. We could even stack blur effects by mapping them to a specific color to make a complex, yet aesthetically pleasing result. This is very likely what the iPhone is doing.

But how do you make the blur happen?

Gaussian Blur

All the above features are built-in capabilities of OAK cameras, but blurring isn't. Lucky for you, the OAK camera system easily accommodates custom programs so that you can make the OAK system work like you need it to. That’s right, go wild. Use your imagination!

Image of gaussian blur

Above we have a Gaussian Blur, running on OAK-D-Lite. In this case using Kornia, which is OpenCV re-implemented in PyTorch. You can find the how-to for your any OAK model here, but it’s so simple, we're going to just put the complete code below:

Easy, no? You could then take this capability and define a more-complex and creative model which takes depth into account to blur things that are farther from (or closer to) the camera than the subject of interest.

Running Custom Computer Vision Code

We're just scratching the surface here, but it's really exciting. Here are three major techniques:

  • Kornia - Fills the gap between Classical and Deep computer Vision. Built on PyTorch
  • PyTorch - A popular machine learning framework. You can create your own computer vision functions within its framework.
  • OpenCL - We still don't know if this is impossible or not. But it looks interesting.
  • G-API - Which aims to build hybrid AI/CV workloads. And we're super excited to share that the pull-request for direct support for all OAK models is here.

Wait, you said three things, and then had a list of four things! Kornia and PyTorch are actually the same technique. Kornia is quite powerful, and has done a lot of work you would do yourself if you were using PyTorch directly. Kornia is effectively a great open-source team already implementing a bunch of useful computer vision functions on top of PyTorch. If you're looking to implement a custom CV function on OAK, checkout Kornia first. If it's not there, then you can dive into implementing it in PyTorch.

import kornia class Model(nn.Module):
  def forward(self, image):
    return kornia.filters.gaussian_blur2d(
      (9, 9),
      (2.5, 2.5)
GIF korina

Beyond Kornia or PyTorch, picking between the rest is largely based on what you have experience in. Fancy OpenCL? Have at it! Are you a fan of the up-and-coming G-API? It’s all yours. The takeaway here is that you can both run custom CV code directly on OAK models and have a choice of how you prefer to do it.

Computational Cameras - Way More Than Great Photos

Our team is only scratching the surface of custom CV in OAK. The whole world is just scratching the surface of what computational cameras can do. And OAK is at the forefront of that. Below is just one example, courtesy of Cortic Technology.

import kornia class Model(nn.Module):
  def forward(self, image):
    return kornia.filters.gaussian_blur2d(
        (9, 9),
        (2.5, 2.5)

In this example, the camera has shifted its purpose from merely capturing an image to becoming a fall detector. The data product isn't a photo, or even a video, rather it is a notification of "Is there a fall and where did it occur?" And remember, all computation is done on camera and NO video has to leave the premises. 

The potential applications of simultaneous computer vision functions are endless, and we are confident that as this technology becomes more widely available it will prove yet again what is often the case with technology: the limits are only your imagination.

Erik Kokalj
Erik KokaljDirector of Applications Engineering