Computer vision tracking time customer spent in a cafe and counting cups is a fake and that is why

Let’s talk about fakes and AI-based barista tracking!
I’ve been seeing this video on LinkedIn for a week or more. Three days ago, a crowd of people sent it to me with the words “see how they can.” And today different ML channels with different text started reposting it. Why do people lack critical thinking?
I’m not 100% sure it’s fake. Only 90%. 40 percent that the id of people and the number of mugs are completely faked. Another 50% that id-shniks and mugs are very much adjusted by hand, for example, by training on the same day / hand-cutting the moment where everything is more or less.

It is a cafe in Eastern Europe

Why do I think so. There are two reasons for this:

I have been dealing with the topic of tracking people for 5 years already. And I know what technology is capable of.
There are a number of objective bugs/logic errors in this video.

Since item 1 is purely subjective, let’s move on to item 2:

The tracking algorithm looks very bad. Rendering out of the box (didn’t tweak anything).
Very often loss of detections. The algorithm is either super lightweight or poorly trained.
Girl “Vika”. The only one whose cup counter increases. Her track for the video is torn 3-4 times. Recognition of a girl in black on a black background predictably does not work out.
And back “Vika” she becomes a frame before her counter clicks. What would have happened if she had transmitted a frame earlier?
Girl Elena. Once the track broke for a couple of seconds and it was picked up back (same color), the second time the gap was larger, and a new one (different color) started. Judging by the timings and processing, this is a regular DeepSort (I wonder if it’s a GPL-3 version or not).
How is the “increase in the number of cups” implemented? Globally, there are three approaches to how this can be done purely by video:
- Recognize the mug itself.
- Recognize “action” from the sequence of frames (frame).
- Entrance to the aisle between the tables.
6.a most likely works (dots/tracks on the circles). This is the only way to solve situations “the waiter put, the client took.” But look at the quality of detection and falsehoods! I drew a red bbox when the detection is hanging, green when the pass. Detection almost does not work. Moreover, it is clearly cut off in the client zone – there is not a single detection there.

The disadvantage of this algorithm is also that when a person closes the mug with himself, then there is no detection. When there are two circles nearby, then NMS is likely to suppress. Well, in general, it is clear why it will not work on the sale.
There are options for what is not done on the video. And for example, according to the one who punches at the checkout. But then it is not clear why “Elena” breaks through, and “Vika” gets +1.
Product problem. The girl “Anna” obviously prepares coffee and puts it on the table nearby. It is unlikely that she will carry it (preparing something further). Who gets coffee? “Elena” who punched, “Vika” who will take it or “Anna” who cooked it? How will this happen, given that the system has already lost a mug at least once?

Well, a couple of final thoughts.

Well, fake and fake. The norms of the idea, in general, the product can be made (take the norms of the camera, set a normal task, etc.). I don’t see anything wrong with this video. Moreover, there is a chance that it was not the authors who made the video viral, but this is some kind of demo of students (btw as a kursach looks very good).
In general, the task set by such a video cannot be solved with high quality when using only cameras.
Privacy. Believe it or not, Europe is full of such systems. They helped to do a lot: counting visitors in the store, smart stores, brand recognition on a person, etc. The GDPR is not about the fact that you can’t shoot (if it’s a public place or your property, then you can), it’s about the fact that you can’t save private information. You can stream Real Time, you can collect statistics, you can store blurred faces, you can film people who signed the consent.

And here is the full 31 seconds-long HD-video of the CV-system based on AI-vision cameras to track time customer spent in a café and control their workers productivity. Can you detect the mistakes and failures?

By Computer Vision Engineer

Telegram: @CVML_team

Leave a Reply