Automatically recognize tens of well-known players in images and videos using machine learning.


The FIFA World Cup only comes around every 4 years and may include famous football players from all nations throughout the world rather than just specific regions.

Having a famous person can be a powerful tool for marketing. Celebrities promote everything from products to services and even social causes. Celebrities can shine a bright spotlight on all types of businesses, and that’s why they are sought out by a wide variety of companies to advertise their products or services.

At AIM Technologies We used our world-class visual recognition technology to detect and track celebrities during the world’s most-followed event FIFA World cup 2022.

Celebrity Recognition System:

For a person who had not watched the match (or replays or highlights clips), the whole game would be condensed into that one image. They would never recognize the player very well from the main feed, instead of the main feed we first detect the close-ups. The System is listed in the following steps:

1- Person detection.
2- Close-up detection.
3- Face detection.
4- Face recognition.
5- Jersey number recognition.

Person Detection:

Identification and localization of objects in photos is a computer vision task called ‘object detection’, and several algorithms have emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once), initially proposed by Redmond et. al.

The class of interest among all the detected objects is the “Person class”, and with a certain threshold for the class confidence score. We detect every person in each frame image and store the person’s bounding box, later we are going to recognize the celebrities out of every bounding box of each frame image.

Close-up Detection:

Close-up frame
main feed frame

As we mentioned earlier “the whole game would be condensed into that one image”, we have to differentiate between the main feed and the close-up because we can not recognize the faces of the players from the main feed.

After many experiments on different sets of matches we come up with a solution, it’s a simple equation related to every person’s bounding box.

As shown, given every person’s bounding box in the frame image, if the percentage of the area of that bounding box over the frame image area is greater than the close-up bounding box threshold, that person’s bounding box is a close-up and valid to be passed to the face recognition system.

Face Detection:

Close-up person’s bounding boxes are passed to the face recognition system. Modern face recognition pipelines consist of 4 common stages illustrated in the below figure. These are detection, alignment, representation and verification. In this section we take a look at the face detection and alignment stages.

Face Detection is the first and essential step for face recognition, and it is used to detect faces in the images.There are many methods for face detection, and the most accurate method is appearance-based method.

The appearance-based method depends on a set of delegate training face images to find out face models. In general appearance-based method rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face images. This method is also used in feature extraction for face recognition. RetinaFace is a deep learning based cutting-edge facial detector for Python coming with facial landmarks. Its detection performance is amazing even in the crowd as shown in the following illustration.

This step takes every person’s bounding box and if it has a face, then this bounding box is passed to the Face Recognition step, and if not, then this bounding box is passed to Jersey Number Recognition step. So we consider this step is crucial for the overall performance of the pipeline.

Face Recognition:

For our Face Recognition system, it consists of two stages. These are representation and verification.

  • Face Representation:

Deep learning just appears in this representation stage. We will feed face images to a convolutional neural networks model but the task here is not classification. The most popular face recognition models are VGG-Face, and ArcFace. Luckily, these models are all provided by deepface framework.

We’ve detected and aligned face images and fed them to a face recognition model. Now, we have vector representations for each face image for each person’s bounding box.

  • Face Verification:

In order to assign a name for the detected and represented face, we trained a softmax classifier to be put on top of embedded vectors as a classification task. The output of the softmax model is one of top-tier football players.

  • At AIM Technologies we always focus on the performance and accuracy of our systems,so we managed to increase the accuracy of the face recognition system by using the ensembling of two Softmax models (VGG-Face, and ArcFace). Ensemble models are a machine learning approach to combine multiple other models in the prediction process. The ensembling of the two models increased the accuracy of the system by 10% compared to each model individually.

What if the close-up bounding box contains a famous player, but from his back?

Jersey Number Recognition:

The answer of the previous question is detecting the jersey number of that famous player, and this is the last step in our Celebrity Recognition System. We can consider this problem as object detection, but in our case it will be a single digit from 0 to 9 instead of an object.

When we talk about object detectors, we talk about deep learning models, hence we need data for the training process. We processed some football matches by this pipeline, and we filtered out the bounding boxes with the case of the player from his back.

For labeled data we labeled our own training/dev/test sets in-house, this allowed us to ensure the quality of the data and its labels we managed to collect data that spans various types of fonts, colors, and jersey colors.

We trained one of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once) on the labeled data, and YOLO has proved to get good results for jersey number recognition in different sets of images.

So, now we can recognize the famous players throughout the football match in both the front face and back jersey number as illustrated in the above figure.

What if we detected the same number in the same frame image? In this situation we used the predefined color palettes of the football jersey and every detected jersey number we measure the color and compared to the color palettes and assign the number for the correct team and hence the correct famous football player.

AIM technologies is the first Middle East based customer experience platform that introduced a multi-lingual text analytics solution with the world’s highest accuracy in Arabic language, and the first end to end automated customer research tool. AIM Technologies has a vision of harnessing the power of AI to develop a fully automated customer insights and actions platform helping brands enhance their customer’s experience.

To learn more about the products we’ve built using our AI models, you can reach out to us here

We Are also hiring in our Data Science team feel free to send us your CV at with the subject “DS candidate.