Augmented reality(AR) is our reality augmented with digital data.
The digital data can be in the form of text, pictures, videos, 3d assets, or a combination of all of the above.
The AR system will need to understand reality and reconstruct it to create its digital twin.
It also needs to enable the user to be able to interact with both the digital twin and the digital data.
The camera has no inherent sense of direction. You can’t command it to take pictures of things that are in front of or behind you. You can only point it in the direction you want.
It is an array of sensors that burn up when light rays shooting through a hole pour down on it.
Giving the camera a sense of direction and an understanding of the world is a complex process. The field of study that deals with analyzing and understanding pixels is called Computer Vision (CV).
To see the world as we do, classify objects as we do, and to learn to adapt to the changes in the world as we do is Machine Learning (ML).
We learn and we become a little more intelligent. When a machine learns and produces results that are a little better than before, its Artificial Intelligence (AI).
ML, AI, and CV, along with a few hardware sensors that form the IMU (Inertial Measurement Unit), make AR possible.
At the heart of an augmented reality app is the camera. Everything flows from the data that the camera is bringing into the system.
The augmented world is formed by detecting shapes and movements. This constitutes the physical layer.
Beyond the physical layer of forms, there is the semantic layer which gives meaning to the things we see.
The semantic layer involves training neural networks so machines can understand the world as we do.
Self-driving cars have solved that problem to a large extent. The hardware on those cars is more advanced than what’s available on the mobile. On the mobile, we are still working to build the foundational layers of understanding the physical world.
This article is about the physical layer.
The main components of the physical layer are-
IMU(Inertial Measurement Unit) is the accelerometer that measures linear acceleration, a gyroscope that measures rotation, and magnetometer that measures heading.
The traditional odometer in your car detects wheel rotation to measure the distance traveled.
The visual odometer on the phone measures distance traveled and orientation from a sequence of images.
Visual odometry is combined with the components of the IMU and forms the Visual Inertial Odometry (VIO) system.
We can’t use the traditional odometric sensors because they were meant for wheels and ours is legged motion.
We don’t need smart glasses to see the world the same way we see them without the glasses.
The usefulness of AR comes from using the information about the world and then adding value to it.
An extinct animal in a museum can come alive, or pointing our cameras at buildings can show us more information. If you visit London, you can’t see Big Ben right now as Elizabeth Tower is under construction for a few more years. But, if you download the Big Ben AR app and point it at the scaffolding laden structure, you can see and hear Big Ben again.
Data from the camera is sorted by computer vision algorithms that try to identify real-world objects in the data stream and return back a bunch of points a.k.a. point cloud.
Take a magnifying glass and look closely at a picture, you’ll start to see the pixels. Alternately, if you keep zooming into a picture in Photoshop, you’ll start to see the pixels. The picture then is nothing more than a point cloud. In the case of pictures, these points are called pixels.
Pixels in a picture are organized and arranged together in a way that, from a distance, completes the picture. The Point cloud data on the mobile are not as put together as it is still trying to understand what the picture is in the real world.
Connecting the points in the cloud results in a mesh. It looks like Spiderman covering up your furniture with spiderwebs. With new information coming in through the camera, the mesh is continuously changing and always expanding until the system can start locking onto objects. This is also called Spatial Mapping.
While mapping the environment, the program is also trying to localize itself to figure out where the device is in relation to its surrounding and which direction it is facing. This is called SLAM (Simultaneous Localization & Mapping)
The program then tries to find familiar patterns in this mesh. This brings us to our next topic — Pattern Matching.
There is a video from Microsoft in the reference section below showing how spatial mapping works on the Hololens.
The first release of Apple’s ARKit and Google’s ARCore, the two native AR Frameworks on the mobile recognized horizontal planes only. It took a few months for these two frameworks to recognize vertical planes.
The horizontal planes give us the ground and table tops where we can place digital objects.
Many AR games use the horizontal plane very creatively. It’s fun for now because it’s new. Beyond games, for AR to be useful in real life, we’ll need more than just planes.
In addition to the program detecting shapes, we can tell the program to look for a complex pattern. This pattern is provided to the system in the form of an image.
There is a lot we can do with image recognition- like placing a digitally animated character on the phone screen when you point the phone at a picture of that character on a billboard. Have pictures in a museum come alive, or provide information about the artist, etc.
We are also able to scan a 3D object and generate a pattern for the program to then recognize 3D objects in the real world. This is called object recognition.
If you extend object recognition to multiple objects in a room it becomes Room recognition.
Expand it further on a larger scale outdoors and it becomes Scene recognition.
Wikitude supports Object, Room and Scene Recognition. ARKit supports Object recognition it in its latest release, and the use cases are somewhat limited. The necessary step of scanning a 3D object to feed it to the system takes time to do it right on consumer hardware.
Though it’s nice to be able to detect more complex patterns than just planes, there is a serious drawback to this approach.
Feeding patterns limit the program from really learning about the world. If we were to feed each and every picture of everything in our world today from every possible angle, it would take many years and billions of dollars, and as soon as we tell the program about a picture, the picture may have outdated itself.
Things in our world change all the time.
Time of day changes, seasons change, lights change and make things look different. Mountains get snow in the winter and look different from summer. The same tree will shed its leaves. Things fade with time. New demolitions. New constructions, and we can spend a few years with this paragraph writing about everything that changes.
Many people use the word persistence when they really mean tracking an anchor. Then they ask a question on Stack Overflow and downvote a response for the wrong reason.
Let’s get some definitions out of the way.
Anchoring is placing a digital object in the augmented world.
Tracking is maintaining a consistent definition of the augmented world as the camera moves around so the digital objects in the augmented world stay where they are anchored.
But what is it tracking?
It’s tracking feature points, usually points of high contrast, from the sequence of images to form an optical flow. ARKit uses a combination of this Feature-based Method and another method called the Direct Method to get the job done.
More information on feature points and extracting features from images can be found in the References section.
Persistence is the ability to remember where a digital object was placed in the real world when the app is turned off and turned back on again.
Persistence works by saving (serializing) the map that we obtained from the point cloud of the world. When you turn the app on you have a choice to reload the saved map. This way you don’t need to perform spatial mapping in that space all over again.
The term Persistence in the human vision system is tied to the ability of the brain to persist an image for a fraction of a second after the image is taken away. This is how motion pictures work. 24 frames per second is a good frame rate for us to persist the previous frame and fuse it into the next frame to get a sense of fluid motion. But, that’s not Persistence in AR. That’s Tracking.
When we were 2, we looked at a pattern and pointed at it. A voice gently told us that it’s a table! Table! Then we looked at 100 other tables, and were told they were tables!
Slightly different forms but common features can be tagged with a common label. At some point, we became confident about tables.
AR on mobile phones has trouble sorting through all the different ways a plane can appear in the real world.
A horizontal plane can be at many different heights, and under different lighting conditions, they can look different.
A vertical plane can be angled in many different ways.
There can be any number of slanted planes that are neither horizontal nor vertical.
Walls come up from the ground. 45-degree walls. If we follow the edges, now we are following a polygon and not a rectangle. And the walls intersect each other. They always do.
As the camera spends more time gathering data, the confidence level of the program can rise or fall.
With new data coming in, maybe that plane now seems bigger or smaller or located a little farther down the hall. Data from another plane located 50 feet behind the first one comes in and now these two planes look like one giant vertical plane from the camera’s point of view.
With more time maybe the program will be able to resolve that the planes are separate. But we don’t have more time. We want the program to do all of these computations very quickly. For AR to be useful on mobile phones it has to work in real-time.
As the understanding of the world around the camera changes, the definition of the world changes and it gets harder for the program to accurately remember where a particular object in the digital world is anymore, especially when it is not looking at it.
In the real world, this is how this problem might look like-
I see a jukebox and walk on to take a seat at the bar. Now I know the jukebox is behind me. Maybe I order a beer and set it down a few inches in front of me to my right. If this was a Philip K. Dick novel, maybe this is when the world shifts when I blink. The world rotates on its axis and maybe I forget to rotate with it. I reach for the beer, and it’s not exactly where I placed it just a second ago. The jukebox is not really behind me either. It’s to my left now!
Imagine everything around you potentially shifting like this 30 times a second! (The camera on the phones record videos at 30 frames per second by default).
We need a lot of power on these chips to be able to process the world in real time and keep it all together.
Our brain is our hardware and our intelligence which is our software has evolved over a million years. The camera is a little over 100 years old, and machine learning is newer than that.
There are many challenges to getting it right. Cloud computing is a real thing, and with the upcoming 5G network speeds, we should be able to use the power of the cloud and stream the results back to the phone in real time.
AR has enormous potential in education, travel, fashion, retail, utility, entertainment, publishing, MRO, remote guidance, games…the list is big and growing.
Billions of dollars are pouring into this technology. Some people think it is the next biggest revolution since the internet.
Magic Leap, a company that makes AR headset became the highest funded startup in the history of technology — $2.3 billion in VC funds reported, and that too without a product. When it closed its funding rounds, all it had were videos showing the potential product. Now they have a product. It is good but fell way short of what they showed in those videos. Later they claimed the videos were doctored in visual FX software. Not sure how they got away with that.
But, that is the frenzy that exists in the AR startup world today. No investor wants to miss out.
Whatever the financial projections may be, at the end of the day, I believe it’ll improve the way we connect.
A complicated computer vision algorithm developed over half a century detects the facial features in real time. Snap uses that algorithm to overlay digital images, and users use it in fun ways that build relationships and human connection.
Delightful experiences born out of seemingly disconnected layers of technology.
The technology of scanning pixels intelligently developed deep inside a lab (CSAIL) can derive its meaning by helping people to communicate in a new way, and find new meanings in things that connect them to the world.
For me, that completes the circuit and that is what I’m looking forward to.