Challenges that retailers face nowadays
Brick and mortar retailers were already battling lost revenue from online shoppers before the Covid-19 pandemic. Now, with people being forced to do all of their shopping online, it’s become even more apparent that retailers must modernize in order to meet the needs of consumers.
By leveraging modern AR technologies retailers can enhance the online shopping experience, giving a needed boost to sales. In this article we’ll show you how this approach can be applied to jewelry stores using our ring fitting application as an example.
We’ve seen a few such applications available on mobile app stores, but all of them were non-interactive or requiring additional AR markers. It’s obvious that such limitations drastically spoil the user experience, so our goal was to create a solution working without markers and leveraging all modern AR trends.
What we were looking for was an app that could recognize your hand without any markers, and be able to render a 3D model of the ring on the proper finger.
As a teaser, we’ll show you the video with final results. If you are interested in the implementation, we will be going into more detail in a future artcle.
Fortunately, the medipipe team already created a TensorFlow Lite model capable of detecting a hand with exact positions of fingers. I really recommend you to check out their GitHub repo, as they have many more great examples besides this one.
To use this pretrained model we needed to shrink our video from camera preview having 640×480 resolution to 256×256 pixels. We do that by downscaling with cropping to preserve only the central part of the video. Now we can apply this model and it will return us a set of points for each hand joint.
You can find the guide about TFLite model inference here https://www.tensorflow.org/lite/guide/inference
The next step was to figure out the width of the finger in pixels so that we could adjust the rendered ring to it. For this task we used the basics of math and computer vision. We intentionally marked 2 dots with red on the previous GIF to show that we are interested only in them for now. We will draw a ring just on this finger for now. Support for other fingers will be implemented in future, as well as some UI for fingers selection.
In order to find edges, we will use a simple Sobel Filter. Those familiar with computer vision (CV) already know what kernel is and how it is applied to an image. But for others we’ll take a moment to explain. A kernel is essentially an array (for images, it’s usually 2-dimensional) of numerical coefficients along with an anchor point in that array, which is typically located at the center.
How does convolution with a kernel work?
Assume you want to know the resulting value for a particular location in the image. The value of the convolution is calculated in the following way:
Place the kernel anchor on top of a determined pixel, with the rest of the kernel overlaying the neighbouring pixels of the image.
Multiply the kernel coefficients by the corresponding image pixel values and sum the result.
Place the result to the location of the anchor in the input image.
Repeat the process for all pixels by scanning the kernel over the entire image.
Mathematical notation for that operations can be expressed as:
Fortunately, OpenCV provides you with the function filter2D so you don’t have to code all of these operations.
Kernels used in the Sobel edge detection
We are going to use only the horizontal one in order to avoid noise from vertical filter and thus increase accuracy of our edge detection. As you may notice, the filter goes from negative to positive values to detect change in gradient on image, but unfortunately this will correctly detect only edges that go from dark to bright regions of the image, leaving edges that go in opposite directions not found. So, we need to do another trick here, we’re applying two filters, one that goes from left to right and another that goes from right to left.
// Here you can see difference between left and right kernel val leftKernel = Mat(3, 3, CvType.CV_16SC1) val rightKernel = Mat(3, 3, CvType.CV_16SC1) leftKernel.put(0, 0, -1.0, 0.0, 1.0, -1.0, 0.0, 1.0, -1.0, 0.0, 1.0) rightKernel.put(0, 0, 1.0, 0.0, -1.0, 1.0, 0.0, -1.0, 1.0, 0.0, -1.0)
If you’re interested in this topic, we recommend the following resource that allows you to try difference kernels yourself, and see the result https://setosa.io/ev/image-kernels/
You may ask: what if our hand is turned slightly, so our target finger is not in an exact vertical position, would filters still find edges correctly? Yes, they still will find edges, but it will not be so precise. So, to get the best performance we would rotate our image in order to straighten the target finger so it will always be pointing upward. To achieve that, we’ll calculate the slope of the line between those red dots on the previous demo, and find the angle of necessary image rotation. With this angle set, we will build a rotation matrix and apply affine transformation to our image. The result of that operation you can see in the next GIF.
// Calculate slope as ratio // p14 and p13 is points that we interested in I marked them with red on // initial GIF val slope = (p14.y - p13.y)/(p14.x - p13.x) // Convert ratio to degrees val slopeAngle = Math.toDegrees(Math.atan(slope)) // Build rotation matrix in order to apply affine transformation val rotMat = Imgproc.getRotationMatrix2D( p13, angle, 1.0 ) // We need inverse matrix rotation to transform our points back to // original coordinates val inverseRotMat = Imgproc.getRotationMatrix2D( p13, -angle, 1.0 ) // This operation turn our image as shown above Imgproc.warpAffine(image, image, rotMat, image.size())
As a result we have a perfectly vertical finger as we want it to be. Now we can apply our sobel filters for both directions (left to right and vice-versa).
// Apply kernel to our image using OpenCV library as easy as this Imgproc.filter2D(image, leftSobel, -1, leftKernel) Imgproc.filter2D(image, rightSobel, -1, rightKernel) // first argument is source image (in our case grayscale with rotation // applied) // second argument is where we will store result // and last argument is our kernel
You can find documentation on Filtering with OpenCV here: https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html
In this GIF you can see the difference between left and right horizontal Sobel operators application. One highlights left to right edges and the other right to left edges. By left to right edges we mean that shade goes from dark to bright and vice-versa.
Now we just need to decide where we want to put our ring and place it according to detected values.
As a result we will get 4 points necessary to render our 3D model of the ring correctly.
Fixing finger in place by features recognition
Not everything goes as we expected it to be in perfect conditions. First tests showed that we have another issue. As you can notice, although coordinates are being detected quite well, they tend to jiggle slightly from frame to frame, creating an annoying jittering effect. We bet no one will use our app if the ring shakes on your finger.
This problem appears due to imperfection of the pretrained model. Model is trained on small images so even little imperfections can produce noise that end up in such a jittering effect. In order to fix this and “glue” our ring to the previously detected position we can use another CV trick that is called Homography. This technique itself is a huge topic for another article, so we’ll simplify a description a little bit here. Using one of popular feature detector (SIFT, SURF, ORB, BRISK, etc.), we can detect a bunch of simple features on the current and previous frames and find out how the image of the finger changed in-between. According to these changes we can move our previously detected points to a new position, getting rid of this jittering effect.
On the left we have the previous frame, and on the right the current one. Dots represent detected features, lines between dots connect similar features on both images. Using these similar features we can find homography and calculate a new transformation matrix for our previously detected dots.
First results are pretty satisfactory, we’ve got more stable placement points for our ring, but this is still just a temporary solution. In the future we’ll address this by training our own model that doesn’t have issues with jittering.
The obvious next step is to re-train the model for our needs with blackjack and higher resolution images. It will allow us to get rid of all unnecessary calculations of edges detection and jitter removal. Also we would like to try to train it not for single images, but for a few consequent frames of camera stream, this way we will achieve smoother ring placement.
In the next iteration we want to detect ring sizes and for that we will need to detect a distance to the finger. For that purpose we can utilise moden multi-camera features of smartphones and tablets or even a LiDAR scanners that appeared on the new iPad Pro, but we are pretty sure that Apple won’t stop here, and future iPhones will have that too. And as a final result we want to try to do a full 3D reconstruction of the hand with depth map to handle all scenarios.