Vulnerability Analysis of AI Camera-Based Facial Recognition Systems

Vulnerability Analysis of AI Camera-Based Facial Recognition Systems

By Sebastian Bukvic

Abstract

This research paper delves into the evaluation of two facial recognition software, one open-source and one paid version, Vladmandic and Visage SDK accordingly. Prior to the testing of a non-invasive Institutional Review Board (IRB) protocol involving 10 lab-team members in a lab environment, an initial hypothesis was formed suggesting that facial recognition software carries innate biases towards different demographics, ages, sexes, and physical traits. While external research supported these claims, the results of the protocol tests suggested otherwise. The tests probed the accuracy of detecting emotions displayed by the lab participants. Both software exhibited limited accuracy in detecting emotions, with Vladmandic achieving only 36% accuracy and Visage SDK reaching 50%. Shockingly, the results of the protocol testing reflected a contradictory conclusion to previous research, with demographic factors such as race, sex, and hair characteristics not significantly impacting the accuracy of the software. However, discrepancies in the experiment methodology and research protocol were noted, such as participant expression variability and subjective demographic data collection. This lab experiment emphasized the importance of refining research protocols and considering every potential external factor, such as lighting conditions, for the robust evaluation of facial recognition software. Additionally, the discrepancies between these findings and existing literature prompts a critical reflection of past external research, these series of tests, and the complexities of bias assessment in technological systems.

Introduction

Facial recognition technologies and systems are becoming more of a prevalent feature in today’s societies. From commonplace usages such as smartphone lock screens to more secure and niche examples, such as retinal scans for restricted areas, these technologies are widely known and used by the populace. It is important to understand how these widespread technologies are and how their capabilities function, and to determine whether or not there is room for improvement. This thought process further prompts research and investigation on the subject matter in order to form conclusions on the potential pitfalls of these facial recognition technologies.

The following research will address the testing of an experiment protocol relating to facial recognition technologies. The experiment protocol seeks to either confirm or deny the existence of innate biases as well as inaccuracies. Two facial recognition software programs were chosen to be used in the testing of this protocol. These facial recognition programs were chosen due to four important criteria which are as follows: 1. The program must have a level of popularity, use cases, and continuous developer support; this is to prevent using a lesser-known program that may lack regular updates and lack proof of accurate results. 2. The program must have the capability to analyze and detect emotions, which is an important criterion that will be explained further in this paper. 3. One of the software programs must be open-source, free, and easily accessible by any individual and the other software program must be a paid program. 4. Both programs must be functional on a laptop computer and efficiently and effectively use the webcam as its primary input device. Using this criterion, the two facial recognition programs, open-source and paid, were Vladmandic and Visage SDK accordingly.

Prior to the start of any testing of this non-invasive IRB protocol in a lab environment on a voluntary basis, a hypothesis was formed suggesting that facial recognition software carries innate biases towards different demographics, ages, sexes, and physical traits. This preliminary hypothesis was formed on the basis of external research supporting this claim. However, ultimately, the results of the protocol testing contradicted the preliminary hypothesis and the information presented in the literature (Lohr 2018; Atay et al. 2021; Perkowitz 2021). To briefly introduce the concept of this experimental test, the 10 lab-team member participants had their basic demographics recorded, and then they displayed different emotions to be detected by the programs. If the emotion displayed matched what the program detected, it would be a successful detection. In accordance with the hypothesis, the accuracy of detection should vary with different demographics and hypothetically be less accurate with individuals who were minorities or women, which external research has shown to be less fairly represented in data sets (Lohr 2018; Atay et al. 2021; Perkowitz 2021).
Ultimately, after the testing of the protocol, Vladmandic cumulatively achieved 36 percent accuracy and Visage SDK achieved 50 percent accuracy. As mentioned prior, the results of test itself when analyzed did not indicate that there are any innate biases in the software, and demographic factors did not significantly impact the accuracy. However, before developing firm conclusions, a deeper reflection on the experiment methodology and research protocol are necessary. Following this reflection, discrepancies, and potential opportunities for inconsistencies in the protocol were discovered. These discrepancies included participant expression variability, a lack of a diverse data set, and subjective demographic data collection. Furthermore, discrepancies in the setting and testing area were also noted, such as variable lighting conditions amongst other factors. In addition, the protocol not only tested the existence of biases in facial recognition technologies, but also tested the strength of the protocol by emphasizing the importance of continuously improving and refining research protocols and considering varying external factor. Ultimately, these findings expose a contradiction between the experiment protocol, the hypothesis, and existing literature, which prompts a critical reflection on the results of both the external research and the protocol test itself. From this, we can develop conclusions regarding the complexities of bias assessment in technological systems.

Introducing Vladmandic and Visage SDK

The two chosen facial recognition systems for this experiment protocol were Vladmandic and Visage SDK. Vladmandic is an open-source facial recognition and eye-gaze tracking program from a GitHub repository. It is completely free for anyone to use and when using search keywords such as “facial recognition” and “eye gaze-tracking,” it shows up as one of the most popular and widely used programs on the GitHub website. On the Vladmandic GitHub page, you can follow a link and run a demo version of Vladmandic on hosted servers, but to achieve better accuracy, you can download the full version onto your local device and run it on there. Vladmandic’s capabilities include, but are not limited to, detecting approximate age, emotion, gaze, sex, and an individual’s distance from the camera.

 

Figure 1: A screenshot of the Vladmandic software in use

Figure 2: A screenshot of the Visage SDK software in use

Meanwhile, Visage SDK is a more robust, proprietary software from the Visage company, which specializes in the field of facial recognition technologies and solutions. It is a paid and licensed software that has been used in a variety of real-world cases. Notably, it has been used in the medical field to help prevent neurodevelopmental disorders in young children such as amblyopia by detecting symptoms early on. The version that was used in the lab protocol experiment was the demo version, which was hosted on the Visage technologies servers. Having said this, the full version would have many more capabilities. While Vladmandic appeared to have many more features than the Visage SDK demo, the Visage SDK demo had arguably much better tracking and  more accurate performance due to the tracking of 151 facial points.

 

 

 

The Protocol Standards, Elements, and Goals

 The primary testing goal of the research protocol was for each facial recognition program (Vladmandic and Visage SDK) to successfully detect the emotion expressed by the volunteer lab participant. Both Vladmandic and Visage SDK display probabilities (as percentages) of likeliness for the expression to portray a particular emotion; in other words, if a participant were to show a happy face, either program could, for example, display 67% Happy, 23% Surprised, and 10% Neutral. In that case, happiness would be the emotion detected as it had the highest probability of detection, and that would lead to a successful match between the program and the participant. Each facial expression was held by the participant for 5 to 10 seconds which was timed on a standard analog wristwatch. There were five emotions that were tested, those being: Happiness, Sadness, Anger, Fear, and Surprise. Every lab volunteer in this protocol experiment signed a consent form prior to any experimentation taking place. After that, the demographics and any other physical traits were recorded from the participants in an excel sheet. These demographics included age, sex race, hair color, hair length (with the options either being short, medium, or long), facial hair, glasses, makeup, hat, and contacts. All would be factors that could potentially skew the results of the testing and demonstrate that there are biases or inaccuracies in facial recognition programs.

Figure 3: The anonymized demographics table.

The volunteer lab participants were then given a cue denoting that the protocol testing, and the screen capture software started. In this case, OBS Studio was used to record the standards such as distance from the camera, lighting, and posture, which were kept consistent throughout the testing. Then each participant was verbally given an emotion to express and was told a prompt relating to the emotion to help incite a more realistic and accurate expression. There were two prompts per emotion to help elicit better reactions. An example of this would be prompt 1 for the happy emotion: " You won a gift card to your favorite restaurant from a raffle.” Participants held their expression for the duration of 5 to 10 seconds to allow the facial recognition program to process the input data and yield a more accurate result. This process was done for both Vladmandic and Visage SDK, with all the emotions being expressed for one software before moving onto the next. All the emotions and the screen were recorded for data review purposes afterwards. After the completion of the experiment for all lab-team participants, the recorded results were analyzed and calculated.

The goal of this lab experiment was to test whether Vladmandic, Visage SDK, and to some extent, facial recognition software overall, contains any inherent biases towards different demographics and physical traits such as race, sex, hair color, and hair length. Additionally, this protocol can be used further as a template for future experiments of this nature because it is inherently testing the strength and efficiency of the experiment standards and elements. This protocol test revealed what discrepancies may occur in testing environments and points to potentially correct. This research also tested the software themselves and tested which emotions are easier to detect, which aren’t, and what may these programs benefit from in future updates.

 

Results and Conclusions

After the protocol testing was completed and results were calculated, out of 50 potential correct emotion detections amongst 10 participants, only on 18 occasions did Vladmandic result in accurate readings, whereas Visage SDK resulted in accurate readings 25 out of 50 times. Therefore, in this case, Vladmandic was only accurate 36 percent of the time and Visage SDK was accurate 50 percent of the time. Vladmandic correctly detected 6 happy emotions, 7 sad emotions, 2 angry emotions, 0 fearful emotions, and 3 surprised emotions and Visage SDK correctly detected 9 happy emotions, 6 sad emotions, 5 angry emotions, 0 fearful emotions, and 5 surprised emotions. Based purely upon the collected data, Vladmandic has much stronger capabilities with regards to detecting happy and sad emotions in comparison to the other 3 emotions tested. On the other hand, aside from the fearful emotions, as they are the outlier in this case, Visage SDK appeared to have more comprehensive and fair results, with happy emotions being detected the best.

 Figure 4: Vladmandic results with correct detections in green and incorrect detections in green.

Figure 5: Visage SDK results with correct detections in green and incorrect detections in green.

Figure 6: Bar chart comparing Vladmandic and Visage SDK accuracies with each emotion.

To further understand the results and why the accuracies for one of the best open-source software and one of the best paid software was so low, it is important to consider other inaccuracies with the testing. It is also observed that the supposed biases in facial recognition systems had little to no effect on the results of the experiment. These results become even more abnormal when external research is considered, which directly contradicts the conclusions of this protocol test (Lohr 2018; Atay et al. 2021; Perkowitz 2021).There should be demographic biases and a trend of inaccuracies in facial recognition programs due to a variety of factors such as biased training sets and programmers. Nevertheless, this could be the result of a weak research protocol.

The following are some considerations that may have affected the accuracy of the experiment itself. The lighting of the recording space may have skewed results. The duration of the facial expression could have been more strictly timed; rather than a range of 5 to 10 seconds of holding an expression, perhaps using a nonvariable time duration would have yielded more consistent and accurate results. The demographic checklist could have been less subjective. For example, in the checklist there were 3 hair length options, short, medium, and long. The participants were asked for their hair length, and even if some participants had similar hair lengths, their responses were often subjective, as what is considered long hair for one individual may be medium or short hair for another. Additionally, some of the male participants had what they referred to as medium length hair, however, when compared to the hair length of the female participants, it would be extremely short in reference which may potentially indicate bias in the protocol standards and demographic checklists themselves. Furthermore, differing hairstyles could possibly affect occlusion of the camera or face. Likewise, for the hat option on the checklist, a simple “yes” is not enough to determine whether the particular hat being worn occludes the face. This may mean the demographic checklist questions were too rudimentary and could  be more detailed. Additionally, there may have been bias in the pool of lab participants, as there may have not been enough diversity in demographics to develop accurate results.

Yet another consideration arises when the facial expressions themselves are analyzed. The facial expressions that were exaggerated by the participants were often the facial expressions that were detected the most accurately. This suggests that for more accurate results, the facial expressions must be purposefully exaggerated to some extent. How might the accuracy be increased for participants who maintained realistic and normal facial expressions? On the other hand, for some individuals, exaggerated facial expressions could cause unrealistic and artificial results. People display emotions differently, which could also play a role in the results. The testing of this protocol was important because it can help set precedents and assist future lab experiments.

 

 

 

Bibliography

Atay, Mustafa, Hailey Gipson, Tony Gwyn, and Kaushik Roy. 2021. “Evaluation of Gender Bias in Facial Recognition with Traditional Machine Learning Algorithms.” In 2021 IEEE Symposium Series on Computational Intelligence (SSCI), 1–7. Orlando, FL, USA: IEEE. https://doi.org/10.1109/SSCI50451.2021.9660186.

Lohr, Steve. 2018. “Facial Recognition Is Accurate, If You’re a White Guy.” The New York Times, February 9, 2018, sec. Technology. https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html.

Perkowitz, Sidney. 2021. “The Bias in the Machine: Facial Recognition Technology and Racial Disparities.” MIT Case Studies in Social and Ethical Responsibilities of Computing, no. Winter 2021 (February). https://doi.org/10.21428/2c646de5.62272586.