Analyzing & Recognizing Distorted Text from CAPTCHA Images

Team: 10

School: Capital High

Area of Science: Cyber Security, Artificial Intelligence

Interim: Problem Definition:
Half of all web traffic is generated by bots, which are computer programs that simulate human activity. 66 percent of those bots are created for malicious intent. such as a DDoS (Distributed Denial of Service) attack - where a user cannot access a business website[1]. A successful DDoS attack results not only in a short-term loss of the site, but can also have severe long-term effects on the business’s online brand reputation, a significant increase of fees by hosting providers, and, in some cases, compromising of data of the entire business. CAPTCHA (Completely Automated Public Turing Test To Tell Computers and Humans Apart) helps combat this. It has been used on over four million websites as of December 2020[2]. And yet, more malicious bots are easily getting past the system, though CAPTCHA does stop unreliable and poorly designed bots. Because recent bots use AI (Artificial Intelligence), at some point CAPTCHAs will no longer be able to distinguish between genuine users and malicious bots. To understand how these malicious bots pass through CAPTCHAs, we will create a machine-learning algorithm that can read and type out the distorted text displayed in an image. We will be using image texts similar to what CAPTCHA has used to prevent the robots from recognizing the content of distorted images. So with this, we will see if CAPTCHA can be bypassed easily by running the algorithm imitating a human.

Problem Solution & Expected Results:
To make the program read and analyze text from an image, we will be using both the MNIST (a large database of handwritten digits that is commonly used for training various image processing systems[3]) and our own handwritten/typed images. With these two datasets, PyTorch - a python library used to create machine-learning algorithms as well as with NumPy library and a coding environment, Anaconda, we will create a machine-learning algorithm that will read, analyze, and output different examples of text[4]. For example, we expect our program to recognize and output the text displayed in these two CAPTCHAs, which we have created at[5]:

Based on these two examples, our program should read and then output the first distorted text as 3x4mpl3 and the second one as t3t?ng. It is expected to type out exactly as shown. And our hypothesis of CAPTCHA being easily bypassed will be evidently terrifying and eye-opening.

Project Progress:
We have made some research about how the AI works in the CAPTCHA and about the AI recognition rate on different text distortion[6, 7]. As of now, we are experimenting with and learning how PyTorch and NumPy libraries work. When we pass that hurdle, we will be on the way to train the computer to analyze and correctly output the distorted text.

[1] Gayer, Ofer. “Understanding Bots and How They Hurt Your Business.” Imperva, Feb 2016,
[2] SimilarTech. “Captcha Technologies Market Share and Web Usage Statistics.” SimilarTech,
[3] PyTorch. “What is PyTorch?” PyTorch, 2017,
[4] Vision System Design. “Support vector machines speed pattern recognition.” Vision System Design, 1 Sep 2004,
[6] Von Ahn, L., Blum, M., Hopper, N. J., & Langford, J. (2003). CAPTCHA: Using Hard AI Problems for Security. Advances in Cryptology — EUROCRYPT 2003, 294–311. doi:10.1007/3-540-39200-9_18
[7] Yan, J., & El Ahmad, A. S. (2008). A low-cost attack on a Microsoft captcha. Proceedings of the 15th ACM Conference on Computer and Communications Security - CCS ’08. doi:10.1145/1455770.1455839

Team Members:

  Hansel Chavez
  Jonathan Garcia
  Manuel Bojorquez
  Isel Aragon

Sponsoring Teacher: Irina Cislaru

Mail the entire Team