IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics

EXTENDED ESSAY – MATHEMATICS
BUILDING A PREDICTIVE TEXT LIST USING JACCARD INDEX AND BAYESIAN
STATISTICS
RESEARCH QUESTION
How can I build a predictive text list for uncompleted word in a paragraph by using Jaccard
Index and Bayesian Statistics?
Word Count: 3852

2
TABLE OF CONTENTS
1. INTRODUCTION........................................................................................... 4
1. 1. RATIONALE............................................................................................. 4
1. 2. AIM OF THE STUDY AND APPROACH............................................... 4
2. WORD SIMILARITY COMPARISON ....................................................... 6
2. 1. WHAT IS JACCARD INDEX AND JACCARD DISTANCE................. 6
2. 1. 1. Jaccard Index ...................................................................................... 6
2. 1. 2. Jaccard Distance.................................................................................. 7
2. 2. FINDING SIMILAR WORDS .................................................................. 8
2. 2. 1. Why Jaccard Index and Jaccard Distance....................................... 8
2. 2. 2. Applying Jaccard Index .................................................................. 9
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT ...................... 12
3. 1. BAYESIAN STATISTICS ...................................................................... 12
3. 1. 1. What is Bayes’ Theorem................................................................... 12
3. 1. 2. Differences Between Bayesian Statistics and Classical Statistics.... 13
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics.................... 14
3. 2. USING BAYES TO FIND SEMANTIC VALUES ................................ 15
3. 2. 1. What is Semantic Value and How it is Calculated........................... 15

3
3. 2. 2. Using Bayesian Theorem in Semantics............................................ 16
3. 3. SYNTACTIC APPROPRIATENESS...................................................... 19
4. COMBINING THE DATA........................................................................... 20
5. CONCLUSION.............................................................................................. 23
6. BIBLIOGRAPHY ......................................................................................... 24
7. APPENDIECES............................................................................................. 26

4
1. INTRODUCTION
1. 1. RATIONALE
As personal computers become widespread, our writing process is transferred to digital
systems. Nowadays instead of a pen, most people use their computer or smartphone to write
texts about a variety of topics. However, it takes a lot of time. To enhance our quality and speed
of writing, computer scientists and engineers develop different algorithms.
I’m interested in linguistics and data science. While doing research for my Mathematics
Extended Essay, I came across to subsection of Artificial Intelligence: Natural Language
Processing. It basically is processing a text or a speech in human language by using a software.1
I asked myself “why not combining linguistics and computer science in your Extended Essay?”.
I decided to create a basic algorithm using the power of mathematics and computer to process
a text in the writing process and make the best predictions for the upcoming word while typing
it. Computer scientists have been working on that issue for a long time, but there is still
remarkable room for improvement. By using the power of statistics, we can go further in
processing texts to make more accurate predictions.
1. 2. AIM OF THE STUDY AND APPROACH
In this study, I aimed to develop a mathematical algorithm that makes the most accurate
predictions for the upcoming word. Unlike most other algorithms, I didn’t only consider letter-
by-letter similarities between words; I also considered the semantic and syntactic value of
words in a context. By using massive libraries, I tried to find the best suggestion for a misspelled
word.
1
Brownlee, J. (2017, September 22). What Is Natural Language Processing? Machine Learning Mastery.
https://machinelearningmastery.com/natural-language-processing/ Retrieval Date: December 7, 2020.

5
To do that, I used the Jaccard Index and Bayesian Statistics and create and mathematical
index which indicates the accuracy of prediction. In my study, I went on a randomly picked
article. By using a website, I picked a piece of a random article and cut it into parts.2
Then, I
changed it in the way that it seems it is written currently. Here is the original piece that I took
from the article:
“Nationwide, Republicans have a major advantage in redistricting heading into the November
elections. The party controls the process in twenty states, including key swing states like
Florida, Ohio, Michigan, Virginia, and Wisconsin, compared with seven for Democrats (the
rest are home to either a split government or independent redistricting commissions).”3
I disjoined a sentence in the middle of it. Here is the version that I will study:
“Nationwide, Republicans have a major advantage in redistricting heading into the
November elections. The party controls the process in twenty st…”
By using mathematical and statistical tools, I tried to find the best suggestion for the
word that is currently being typed. Of course, this study needs computer power, so I used some
computer programmes, tools, and codes. I included them as an appendix. Expanded lists for
tables are also included as an appendix.
In statistics studies, I used two databases for frequencies and genre-specific frequencies
of words. Before deciding that my paragraph belongs to a web page, I used the COCA (Corpus
of Contemporary American English)4
which contains more than 1 billion words from 8 different
areas to obtain fairer results. However, after finding that my paragraph is a part of a web page,
2
Website I used for picking random article: https://longform.org/random
3
Berman, A. (2012, January 31). How the GOP Is Resegregating the South. The Nation.
https://www.thenation.com/article/archive/how-gop-resegregating-south/ Retrieval Date: December 7, 2020
4
https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.

6
I used “The iWeb Corpus”5
, which contains more than 14 billion words from 22 million web
pages to work with statistics focused only on web pages.
2. WORD SIMILARITY COMPARISON
2. 1. WHAT IS JACCARDIAN INDEX AND JACCARD DISTANCE
2. 1. 1. Jaccard Index
Jaccard Similarity Index is a method to compare two sets of data. It is invented by a
Swiss professor of botany, Paul Jaccard.6
It is a measure of similarity for two finite data sets,
and it is expressed in a range of 0% to 100%. As the similarity increase, the percentage value
increases.
Jaccard Similarity Index is measured by comparing the joint values of sets with a
combination of sets. Mathematically, to calculate Jaccard Similarity Index for two finite sets
we divide the number of common objects -intersection of two sets- by the total number of
objects -union of two sets-. In mathematical notation, the Jaccard Similarity Index is expressed
as followed:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴 ∪ 𝐵|
Or in a simpler way:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
In the Venn diagram, the Jaccard Index can be shown like this:
5
https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
6
Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date: 2020, December 7.

7
2. 1. 2. Jaccard Distance
Jaccard Distance is also a method to compare two sets of data. However, unlikely to
Jaccard Index which measures how similar the sets are, Jaccard Distance measures how
dissimilar the sets are. It is also expressed between the range of 0% to 100%, but as it measures
dissimilarity the value increases as the similarity between two sets decreases.
In Jaccard Distance, 100% or 1.00 means sets are completely dissimilar while 0% or
0.00 means sets are equal to each other.
𝐷(𝐴, 𝐵) = 1 − 𝐽(𝐴, 𝐵)
𝐷(𝐴, 𝐵) = 1 −
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
FIGURE 1: Venn representation of Jaccard Index

8
2. 2. FINDING SIMILAR WORDS
2. 2. 1. Why Jaccard Index and Jaccard Distance
There are several reasons why I picked the Jaccard Index for measuring the similarities
of words. First of all, in the Jaccard Index words are assumed sets that include letters. This gave
me the opportunity to compare words letter by letter, which is an essential future for me. Since
I will base my prediction on letters that have already been typed, Jaccard Index will narrow the
possible word circle. Secondly, the Jaccard Index is widely used in Computer Science and
specifically Python -a programming language that I also used in this study. This gave me unique
flexibility because English contains hundreds of thousands of words and comparing letters of
each word with characters that are typed just now is nearly impossible without the use of a
computer algorithm.
FIGURE 2: Venn Representation of Jaccard Distance

9
2. 2. 2. Applying Jaccard Index
To apply the Jaccard Similarity Index to words, we must first define the sets. For
example, if we want to calculate the similarity index between two random words, “soldier” and
“laborer”, we first define two sets as 𝐴 = {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟}, 𝐵 = {𝑙, 𝑎, 𝑏, 𝑜, 𝑟, 𝑒}. Then we apply
the Jaccard formula:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
𝐽(𝐴, 𝐵) =
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐽(𝐴, 𝐵) =
4
9
= 0.44
Since we calculated these sets’ Jaccard Index, we can calculate their dissimilarity -Jaccard
Distance- by using a basic formula:
𝐷(𝐴, 𝐵) = 1 −
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐷(𝐴, 𝐵) = 0.66
English contains a lot of words, so to find predictive words for the word that is currently
being typed, we need to use multiple methods. The first one of them is comparing the similarity
of the piece of the word written and the similarity of the probable word. There is not a consensus
about all English words, but I used an online source that contains more than 307.113 words.7
By using a Python code, I first found the number of words starting with the letters “s, t”
from the set of 307.113 English words. By assuming the upcoming word is starting with these
7
https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.

10
letters in the order of “st”, I only picked the words that contain these letters in the correct order.
That means 4886 total words.
After that, by using a Python code again, I calculated the Jaccard Similarity index and
Jaccard distance of each of these words with the phrase “st”. That may give you the impression
of shorter words are more likely to be possible word suggestions, but that’s not the case. Since
the words are regarded as sets, repetitive letters don’t affect the similarity of words. Here is a
real example that I encountered:
𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑏) = 𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑎𝑡𝑠𝑟𝑎𝑡)
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑏}
=
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑟}
= 0.5
Here is Jaccard Similarity and Jaccard Distance of English words with similarity index higher
than or equal to 0.66 among the words starting with the “st” phrase:
TABLE 1: Jaccard Similarity and Distance of Words Starting with “st”
WORD JACCARD SIMILARITY JACCARD DISTANCE
st 1.0 0.0
sta 0.6666666666666666 0.33333333333333337
stat 0.6666666666666666 0.33333333333333337
stats 0.6666666666666666 0.33333333333333337
std 0.6666666666666666 0.33333333333333337
stet 0.6666666666666666 0.33333333333333337
stets 0.6666666666666666 0.33333333333333337
stg 0.6666666666666666 0.33333333333333337
sty 0.6666666666666666 0.33333333333333337

11
stk 0.6666666666666666 0.33333333333333337
stm 0.6666666666666666 0.33333333333333337
stoot 0.6666666666666666 0.33333333333333337
stoss 0.6666666666666666 0.33333333333333337
stot 0.6666666666666666 0.33333333333333337
stott 0.6666666666666666 0.33333333333333337
str 0.6666666666666666 0.33333333333333337
stu 0.6666666666666666 0.33333333333333337
stuss 0.6666666666666666 0.33333333333333337
stut 0.6666666666666666 0.33333333333333337
As you can see, there is not any correlation between the length of a word and its similarity with
the phrase “st”.
This table gave me a chance to make a prediction about the word that is currently being
typed, but since there is a total of 4886 words that start with the “st” phrase, this prediction may
be misleading for some reasons. First, there is more than one word that has the same similarity
ratio. Second, even though these suggestions gave a modal outlook, they were not sufficient
when I took the language’s own specialties into account. In language, words must be coherent
and follow grammatical rules at the same time.
To overcome this hardship and make my suggestion more accurate, I developed a new
model that takes semantic values into account by using Bayesian Statistics.

12
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT
3. 1. BAYESIAN STATISTICS
3. 1. 1. What is Bayes’ Theorem
Bayes’ theorem is a way of calculating probability with the help of a priori knowledge,
in other terms it is one of the means of conditional probability. It is founded by Thomas Bayes,
a Presbyterian minister, and mathematician who lived during the 18th
century. However, it is
published in 1763, after the death of Thomas Bayes, when the theorem was discovered among
his notes.8
As a branch of conditional probability, Bayes’ Theorem aims to find the probability
of an event in light of relevant prior knowledge. It is a way of calculating the possibility of an
event or situation while related knowledge is given true. Its formula is as follows:
𝑃(𝐴|𝐵) =
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵)
Or in more complex and widely used terms:
𝑃(𝐴|𝐵) =
𝑃(𝐵|𝐴) × 𝑃(𝐴)
𝑃(𝐵)
Verbally, this equation is formulated in order to find the possibility of A given that B is true
and this is called posterior in Bayesian terminology. To do that, it multiplies the probability of
B given that A is true -likelihood- and the probability of A -prior- and divides them to the
probability of B -marginal likelihood-.
As you see, in Bayes’ Theorem 3 main components are used: 1. Likelihood 2. Prior 3.
Marginal Likelihood. The likelihood is the probability that we would get if the hypothesis is
8
Routledge, R. Bayes's theorem. Encyclopedia Britannica. https://www.britannica.com/topic/Bayess-theorem
Retrieval Date: February 15, 2021.

13
true. Basically, to find it, we assume that the hypothesis is true and calculate the probability of
the event happening. Prior is the main feature of Bayesian Statistics. It is our estimation of how
probable the hypothesis is. If there are multiple hypotheses, the sum of priors must be equal to
1. Marginal Likelihood is the probability that we obtained whether the hypotheses true or not.
Its formula is as follows:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
And the final outcome of Bayes’ Rule, in our formula 𝑃(𝐴|𝐵), is called posterior.
3. 1. 2 Differences Between Bayesian Statistics and Classical Statistics
There are two main concepts of statistics, Classical statistics, and Bayesian statistics.
While Classical statistics are the product of frequentist methods, Bayesian Statistics approach
probability as a subjective experience of uncertainty. The frequentist method relies on repeating
experiments and only interpreting the data set. In this method, the null hypothesis is assumed
true. Contrary to the frequentist method, Bayesian statistics rely on combining the data with
prior knowledge. Unlikely to frequentist method, large data sets are not needed in the Bayesian
approach. In this method, prior knowledge about the hypothesis is used along with the
experimental data so it works with smaller data sets.
Prior knowledge is a gamechanger in statistics and it is the main aspect that distinguishes
these two approaches. While there is not any prior knowledge concept in Classical (frequentist)
statistics, Bayesian Statistics gives us the ability to include our prior opinion or knowledge
about the hypothesis. However, the nature of this prior knowledge is not clearly defined and it
may be completely subjective or objective, depending on the choice of the person who is doing
the calculation. There are multiple ways about deciding on the prior and they affect the
posterior. Both informative and noninformative priors can be used in calculations and the

14
posterior can vary greatly according to the type of prior. Because of that, Bayesian Statistics
are sometimes called for being subjective and lack scientific certainty.
Nonetheless, in Bayesian Statistics this is regarded as richness, not as a weakness. The
frequentist approach ignores all past studies and surrounding effects; it only focuses on data.
Classical (frequentist) statistics suppose that “nothing is going on”. However, “always
something is going on”.9
Since we do not do our experiments in an area that is completely free
of any effect, ignorance of surroundings may be misleading. So that means contrary to popular
belief, Bayesian statistics may sometimes be more accurate than Classical statistics. However,
the key part is how we decide on prior.
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics
As I mentioned in the previous section, Bayesian statistics takes outer effects more into
consideration. In Bayesian statistics, we do not need to rely only on the raw data we obtain from
calculations but we can also use some other resources. In language, the main outer resource is
habits. Each person has different habits in the use of language and to make more accurate
predictions, we should also use them in our calculations.
In modern systems that perform similar tasks to the algorithm that I am currently
working on, language habits are an integral part of word predictions. They store the data about
how many times the user uses each word and tend to suggest more frequently used words. They
also learn the topics that the consumer is writing more about and use them in their predictions.
This gives them the ability to make more accurate predictions for specific users.
9
Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz & Aken, Marcel. (2013). A
Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. (p. 2). Child development.
85. 10.1111/cdev.12169.

15
Learning is endless and this is where Bayesian statistics show its specialties. One of the
biggest opportunities of Bayesian statistics is that its openness to update. We can update the
prior as we learn more about the user’s habits. We do not only use data but we can also learn
from the data. That gives us the ability to enhance the accuracy of predictions as the user types
more.
I do not have any user-related data but in a situation that I have, the use of Bayesian
statistics will outrace the Classical statistics in these aspects. Since I want this study to be open
to further developments, I chose Bayesian statistics. However, I was still benefitted from some
of its features.
I will do my predictions with the help of Bayesian Statistics while trying to guess which
word the writer is intended to write. This will give me an opportunity to include prior, in our
context the habit factor, to my calculations. Language is strongly linked with our writing habits.
For example, in synonyms, each person has their own preference. So, predictions based only
on semantics may be misleading. To overcome this problem, we should integrate these two
factors. This will give more accurate results especially when we have personal data, but I do
not work with personal data since I do not have any information about the writer of this text.
Therefore, I will use the general writing habits of English writers.
3. 2. USING BAYES TO FIND SEMANTIC VALUES
3. 2. 1. What is Semantic Value and How it is Calculated
“Colorless green ideas sleep furiously.”10
This is the by far most famous quote in
linguistics, created by Noam Chomsky. Even though it is grammatically correct, it does not
have any meaning. By this example, Chomsky showed that there is not any bond between the
10
Chomsky, N. (2002). Syntactic Structures. (p. 15). Berlin: Mouton de Gruyter.

16
grammatical structure of a language and its semantic side. This is also true for our prediction
algorithm. Our suggestion shouldn’t be only grammatically correct but it also has to be
semantically appropriate for the sentence.
To do that, I found the words that fit the general topic of this paragraph. We have 24
words until the “st” phrase. By finding in which areas these words are generally used, I aimed
to understand the general topic of the paragraph. After that, I checked which words will be
appropriate suggestions.
3. 2. 2. Using Bayes’ Theorem in Semantics
What I wanted is pretty simple actually: finding semantically proper suggestion for
completing the “st” phrase. To do that, I first decided on the context of words that are already
written. COCA classifies words according to how frequently a word is used in each of 8 main
genres. These genres are blog posts, general web pages, TV and movie subtitles, spoken
language, fiction, popular magazines, newspapers, and academic writings. I checked each word
in the paragraph and reached to statistical overview:
TABLE 2: Total Frequencies of Words in Paragraphs in Each Genre
BLOG POSTS WEB PAGES SUBTITLES SPOKEN LANG.
12176265 13221161 8528563 11703651
FICTION MAGAZINES NEWSPAPERS ACADEMIC WRITINGS
11517946 13082154 12746944 13093324
This table shows the total numbers of how frequently words in the paragraph used in each
category. According to these numbers, it is clear that the main genre of our paragraph is web
pages.

17
It is reasonable to expect the correct suggestion among the more frequently used words.
For that reason, I picked 107 words which are the most frequently used words among the ones
starting with “st”.11
Then I formulated the probability of each 107 words to be used in the web
page genre. For example, the formula for one of the least frequently used word “styles”, is as
following:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒|𝑠𝑡𝑦𝑙𝑒𝑠) × 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠)
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
According to this formula, to find the probability of “styles” being the following word given
that the genre of the paragraph is web page, we need to multiply the probability of genre being
web page given that it includes “styles” with the probability of “styles” being used and divide
them the probability of the genre of the paragraph being web page.
As I had known the probability of genre being web page given that it includes “styles”,
or in basic terms the probability of “styles” being used in “web page” from COCA, I substituted
it. For the main feature of Bayesian Statistics, prior, I used the general usage ratio of “styles”
among the most popular 107 words starting with “st”. With substitutions, the formula became
like this:
0,126607818411097 × 0,00180771229791463
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
However, since I didn’t have any direct statistics for 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒), I used the following
formula of marginal likelihood:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
11
I picked 107 words because that was the maximum number of words that I can obtain from the COCA data set
which I was using.

18
= 0,124323532630005
With this addition, the formula became like this:
0,126607818411097 × 0,00180771229791463
0,124323532630005
And the result was:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,0018409266975627
Though, this was the probability for just one of the 107 words. For the rest, I calculated
the probabilities by following the same steps. Here is Bayes’ Box 10 of them which have the
highest probability:
TABLE 3: Bayes’ Box for Top 10 Words in Semantics
WORD FREQUENCY PRIOR (USAGE
RATIO)
LIKELIHOOD PRIOR x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
As you can see in the table above, there are different parameters that affecting the posterior
probability. Each word has its own characteristics and with the use of Bayesian Statistics, I was

19
able to assess their different parts together. This is just a small portion of the table and the full
version is included as an appendix.
However, I was still lack of enough information to make my suggestion more accurate.
Language has another aspect -syntax- and I could reach to correct algorithm only by an
algorithm containing multiple aspects of language.
3. 3 SYNTACTIC APPROPRIATENESS
Syntax is a subtopic of linguistics that is focused on the arrangement of words in a
sentence or a paragraph. In language, words should follow each other in the correct order to
have a proper meaning. Syntax is basically assessing this and our behaviors about placing words
in order.
In my case, I used syntax as another aspect of my algorithm. Since I was trying to find
the correct word suggestion only with the use of statistics, I hoped that more statistics from
different areas would increase the accuracy of my prediction. My suggestion had to be
consistent with “twenty”, so I included the usage frequency of each word which starting “st”
after the “twenty” to my research. However, I did not need to calculate the conditional
probability of it personally, by using the Bayes’ theorem. iWeb Corpus already contains this
data, so I directly included it in my research. Though, I only used the common words in the top
100 most used “twenty st” phrases and the semantic calculations. It means 57 words and here
is a small sample of it:
TABLE 4: 10 Most Used Words Starting with “st” After the “twenty”
WORD FREQUENCY USAGE RATIO
twenty students 532 0,303306727
twenty states 366 0,208665906

20
twenty steps 140 0,07981756
twenty stories 134 0,076396807
twenty studies 50 0,028506271
twenty straight 39 0,022234892
twenty staff 34 0,019384265
twenty state 31 0,017673888
twenty stores 31 0,017673888
twenty standard 29 0,016533637
4. COMBINING THE DATA
As we can see from the sample tables, the correct word “states” is not the first option in
any category. If we evaluate each category one by one, we can make multiple suggestions to
complete the phrase of “st”. However, all of them would be wrong. Hence, I needed to combine
statistics from all areas and reach one correct answer. Each of them gathered in different ways
and have different perspectives. The way of obtaining the correct result was combining these
different views into one and simple mathematical index. I created this index by multiplying the
Jaccard Similarity Index, semantic posterior probability, and syntactic probability and turning
the product into a number out of 1. Final table includes 57 words that are commonly popular in
both semantics and syntactic statistics.
TABLE 5: Final List of Suggestions
WORD JACCARD
SIMILARITY
SEMANTIC SYNTACTIC PRODUCT FINAL
INDEX
PERCENTAGE
INDEX
states 0.5 0,0575 0,2087 0,0120 0,4355 43,5547
students 0.333 0,0280 0,3033 0,0085 0,3080 30,8016
state 0.5 0,0818 0,0177 0,0014 0,0525 5,2468
stories 0.333 0,0157 0,0764 0,0012 0,0436 4,3627
story 0.4 0,0479 0,0143 0,0007 0,0248 2,4759

21
steps 0.5 0,0084 0,0798 0,0007 0,0244 2,4384
still 0.5 0,0906 0,0057 0,0005 0,0188 1,8751
studies 0.333 0,0140 0,0285 0,0004 0,0145 1,4451
strong 0.333 0,0187 0,0131 0,0002 0,0089 0,8924
staff 0.5 0,0118 0,0194 0,0002 0,0083 0,8322
student 0.333 0,0135 0,0160 0,0002 0,0078 0,7835
standard 0.333 0,0129 0,0165 0,0002 0,0078 0,7768
straight 0.2857 0,0094 0,0222 0,0002 0,0076 0,7561
study 0.4 0,0253 0,0046 0,0001 0,0042 0,4188
stars 0.5 0,0073 0,0154 0,0001 0,0041 0,4077
stores 0.4 0,004234 0,017674 0,000075 0,002716 0,271613
stone 0.4 0,005000 0,014823 0,000074 0,002690 0,269044
star 0.5 0,012268 0,005131 0,000063 0,002285 0,228489
step 0.5 0,016406 0,003421 0,000056 0,002037 0,203712
street 0.5 0,016775 0,002851 0,000048 0,001736 0,173574
start 0.5 0,036274 0,001140 0,000041 0,001501 0,150137
standards 0.333 0,008990 0,004561 0,000041 0,001488 0,148839
studio 0.333 0,003396 0,011973 0,000041 0,001476 0,147579
statements 0.333 0,005812 0,005131 0,000030 0,001082 0,108242
starts 0.5 0,007032 0,003991 0,000028 0,001019 0,101866
stations 0.333 0,002831 0,009122 0,000026 0,000937 0,093741
stages 0.4 0,002357 0,009692 0,000023 0,000829 0,082925
structures 0.333 0,003192 0,006842 0,000022 0,000793 0,079277
stock 0.4 0,007265 0,002851 0,000021 0,000752 0,075171
store 0.4 0,011840 0,001710 0,000020 0,000735 0,073511
stay 0.5 0,017407 0,001140 0,000020 0,000720 0,072048
st 1.0 0,008653 0,002281 0,000020 0,000716 0,071627
statement 0.333 0,014633 0,001140 0,000017 0,000606 0,060565
station 0.333 0,007009 0,002281 0,000016 0,000580 0,058019
stocks 0.4 0,002092 0,006271 0,000013 0,000476 0,047626
strategies 0.2857 0,003509 0,003421 0,000012 0,000436 0,043566
status 0.5 0,010270 0,001140 0,000012 0,000425 0,042507
stage 0.4 0,009545 0,001140 0,000011 0,000395 0,039505
streets 0.5 0,004701 0,002281 0,000011 0,000389 0,038917
stops 0.5 0,002682 0,003991 0,000011 0,000388 0,038847

22
stands 0.4 0,0049975 0,0017104 0,0000085 0,0003103 0,0310264
styles 0.4 0,0018409 0,0039909 0,0000073 0,0002667 0,0266684
stones 0.4 0,0018290 0,0039909 0,0000073 0,0002650 0,0264957
steel 0.5 0,0023351 0,0028506 0,0000067 0,0002416 0,0241620
studying 0.25 0,0026981 0,0022805 0,0000062 0,0002233 0,0223349
striking 0.2857 0,0024579 0,0022805 0,0000056 0,0002035 0,0203466
strategic 0.25 0,0031492 0,0017104 0,0000054 0,0001955 0,0195516
stayed 0.333 0,0035177 0,0011403 0,0000040 0,0001456 0,0145598
statistical 0.333 0,0025615 0,0011403 0,0000029 0,0001060 0,0106021
stroke 0.333 0,0015650 0,0017104 0,0000027 0,0000972 0,0097160
stairs 0.4 0,0013972 0,0011403 0,0000016 0,0000578 0,0057830
The final table is a predictive text list in which all possible words are attached to an
index. This index shows us how probable each word is compared to all words in the list. As we
can see from the final table, I managed to predict the word correctly. According to my index,
“states” became the most appropriate suggestion to complete the word with the index number
0,4355 which is 34.29% more than the second option, “students”. This shows us how strong
conditional probability and Bayesian Statistics are. They to led me to correct results. Even
though I did not use the Bayes’ rule while calculating syntactic value since its conditional
frequency has already been put into a data set, I always used conditional probability and when
it is possible Bayesian Statistics.
Though linguistics and mathematics seem so different, mathematical interpretation of
linguistics may be quite accurate. Bayesian statistics are one of the strongest tools in
mathematics and it became very effective when it is used correctly. In my research, it was one
of my two main tools with the Jaccard Index and played a huge role to make the correct
prediction.

23
5. CONCLUSION
In this paper, I tried to build a predictive text list that suggests the correct word to
complete the “st” phrase in a random paragraph. To do that, I used Jaccard Similarity Index and
Bayesian Statistics. I ended up with a list containing 57 words and I managed to predict the
word correctly. While doing that, I tried to use purely statistical methods and data to avoid
subjectivity. Especially when using the Bayes’ theorem which is often regarded as subjective,
I made my all decision completely based on data.
Before choosing my Extended Essay topic, I was sure that I would like to do research
that combines computer science and mathematics. Since I aim to develop myself in
mathematical computer science or Artificial Intelligence, Natural Language Processing seemed
so attractive to me. It is highly linked with statistics and the mathematical background that it
contains is amazing. I would like to do research which I can develop in the future. They led me
to choose this research question.
However, I faced some challenges while doing the background calculations and
reflecting them to written text. Even though this research paper contains high effort while doing
mathematics due to its nature which requires processing big data sets, reflecting it to text was
compelling. Creating a prediction index from zero requires correctly building logical bonds
between the different variables in the data. Still, I learned quite a lot from this research about
statistics and how vast areas it can be effectively used. It widened my perspective about
applications of mathematics and helped me while making career plans as a computer scientist
specialized in data engineering.

24
6. BIBLIOGRAPHY
• Brewer, Brendon. J. STATS 331: Introduction to Bayesian Statistics. University of
Auckland.
• Routledge, Richard. Bayes's theorem. Encyclopedia Britannica.
https://www.britannica.com/topic/Bayess-theorem Retrieval Date: February 15, 2021.
• Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz
& Aken, Marcel. (2013). A Gentle Introduction to Bayesian Analysis: Applications to
Developmental Research. (p. 2). Child development. 85. 10.1111/cdev.12169.
• Chomsky, Noam. (2002). Syntactic Structures. Berlin: Mouton de Gruyter.
• Berman, Ari. (2012, January 31). How the GOP Is Resegregating the South. The
Nation. https://www.thenation.com/article/archive/how-gop-resegregating-south/
Retrieval Date: December 7, 2020
• Brownlee, Jason. (2017, September 22). What Is Natural Language Processing?
Machine Learning Mastery. https://machinelearningmastery.com/natural-language-
processing/ Retrieval Date: December 7, 2020.
• Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date:
2020, December 7.
• Sieg, Adrien. (2019, November 13). Text similarities : Estimate the degree of
similarity between two texts. Retrieval Date: November 20, 2020.
https://medium.com/@adriensieg/text-similarities-da019229c894
• An Intuitive (and Short) Explanation of Bayes' Theorem. BetterExplained.
https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-
theorem/. Retrieval Date: January 28, 2021

25
• Mahendru, Khyati. (2019, June 13). Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2019/06/introduction-powerful-bayes-theorem-
data-science/. Retrieval Date: December 3, 2020.
• Glen, Stephanie. (2020, September 16). Jaccard Index / Similarity Coefficient.
Statistics How To. https://www.statisticshowto.com/jaccard-index/. Retrieval Date: 3
January, 2021
DATA BASES:
• https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
• https://www.wordfrequency.info/samples.asp Retrieval Date: February 16, 2021.
• https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
• https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
PROGRAMS:
• Visual Studio 2019
• Python 3.7

26
7. APPENDIECES
FIGURE 3: Code for Finding Words Starting with “st”
FIGURE 4: Code for Calculating Jaccard Index
FIGURE 5: Code for Calculating Jaccard Distance

27
TABLE 6: General and Genre Specific Frequencies of Words in The Paragraph
WORD FREQ BLOG WEB TVM SPOK FICTION MAGAZINE NEWS ACADEMIC
nationwide 15733 1546 1889 214 1911 106 3216 5059 1792
republican 124514 21658 20867 904 43517 475 10889 22121 4080
have 5025573 781709 687895 820686 879668 423071 503134 522808 406599
a 21889251 2783458 2827106 2519099 2716641 2749208 3104298 2959649 2229222
major 196857 24133 25756 10307 24642 6983 31077 36638 37315
advantage 55691 9484 8721 3212 5275 3253 8873 7923 8949
in 16560377 2003430 2257672 1225718 2020330 1671503 2310522 2355671 2699192
heading 28118 3089 2997 4616 3420 5655 3519 3822 1000
into 1461816 166362 180584 116756 148250 307485 226799 171402 144177
the 50074257 6272412 7101104 3784652 5769026 6311500 6805845 6582642 7447070
november 87176 24573 22587 1138 7249 2224 10651 11000 7733
elections 39380 7212 6709 352 8261 232 3155 7398 6061
party 243697 39715 35760 31649 43730 15044 24195 33310 20277
controls 26347 3131 3786 1272 1766 1639 4708 2995 7049
process 220128 31106 33489 5266 26450 5496 27362 22973 67985
twenty 36338 2939 3815 2718 3313 14068 3807 971 4707
redistricting 1724 308 424 4 202 4 104 562 116
TOTAL 96086977 12176265 13221161 8528563 11703651 11517946 13082154 12746944 13093324
TABLE 7: Bayes’ Box for All Words Used in Semantics Calculations
WORD FREQUENCY RATIO LIKELIHOOD RATIO x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
strong 152080 0,017333978 0,134468701 0,002330877 0,018748482
stay 203720 0,023219871 0,093201453 0,002164126 0,017407209
street 189237 0,021569108 0,09668828 0,00208548 0,016774619
stuff 153066 0,017446361 0,118713496 0,002071119 0,016659103
step 128479 0,014643951 0,139283463 0,00203966 0,016406067
stories 112939 0,012872712 0,151940428 0,001955885 0,015732222
stand 138407 0,015775538 0,120716438 0,001904367 0,01531783
statement 86767 0,009889645 0,183952424 0,001819224 0,014632984
studies 136505 0,01555875 0,111592982 0,001736247 0,013965556
student 147156 0,016772743 0,100220175 0,001680967 0,01352091
standard 86472 0,009856021 0,163266722 0,00160916 0,012943328

28
star 107361 0,012236936 0,124635575 0,001525158 0,012267649
starting 93909 0,010703686 0,137729078 0,001474209 0,011857842
store 97668 0,011132134 0,13223369 0,001472043 0,011840422
staff 109761 0,012510486 0,117528084 0,001470333 0,011826671
style 72902 0,008309322 0,166072261 0,001379948 0,011099651
status 75175 0,008568397 0,149012305 0,001276797 0,010269951
stage 84800 0,009665448 0,122771226 0,001186639 0,009544765
straight 89870 0,010243323 0,11370869 0,001164755 0,00936874
standards 68608 0,007819894 0,142927938 0,001117681 0,008990103
stupid 66726 0,007605385 0,1457303 0,001108335 0,008914926
st 111984 0,012763862 0,084279897 0,001075737 0,008652722
standing 90839 0,010353769 0,102345909 0,001059666 0,008523454
structure 63205 0,007204064 0,145906178 0,001051117 0,008454694
steps 69771 0,007952452 0,13157329 0,00104633 0,008416189
strategy 66602 0,007591252 0,133224228 0,001011339 0,008134732
stopped 88086 0,010039984 0,097177758 0,000975663 0,007847775
stated 38017 0,004333152 0,220138359 0,000953893 0,007672667
strength 58373 0,006653316 0,142000582 0,000944775 0,007599323
stars 69625 0,007935811 0,114298025 0,000907048 0,007295864
stock 71592 0,008160009 0,110682758 0,000903172 0,007264693
starts 58138 0,006626531 0,131927483 0,000874222 0,007031827
station 74456 0,008486446 0,102678092 0,000871372 0,007008907
storm 51639 0,005885779 0,133542478 0,000786002 0,006322226
stood 83875 0,009560017 0,07961848 0,000761154 0,006122365
strange 55570 0,006333832 0,119003059 0,000753745 0,006062773
steve 58774 0,006699022 0,11185218 0,0007493 0,006027018
statements 36168 0,004122405 0,175265428 0,000722515 0,005811571
stress 50001 0,005699081 0,126617468 0,000721603 0,005804237
struggle 41878 0,004773227 0,140766035 0,000671908 0,005404513
stick 52290 0,00595998 0,112583668 0,000670996 0,005397179
stone 57458 0,006549025 0,094921508 0,000621643 0,005000206
stands 49137 0,005600603 0,110934734 0,000621301 0,004997456
stuck 45673 0,005205778 0,115297878 0,000600215 0,004827849
strike 40442 0,004609552 0,129864992 0,000598619 0,004815013
strongly 31957 0,003642438 0,161404387 0,000587905 0,004728835
streets 54014 0,00615648 0,094938349 0,000584486 0,004701331
stores 41045 0,004678282 0,112510659 0,000526357 0,004233765
statistics 29092 0,003315887 0,155025437 0,000514047 0,004134751
stronger 32229 0,00367344 0,135809364 0,000498888 0,004012817
storage 26955 0,003072313 0,160601002 0,000493417 0,003968811
string 21127 0,002408041 0,191035168 0,000460021 0,003700189
struck 35801 0,004080574 0,111477333 0,000454892 0,003658933
stayed 39818 0,004538429 0,096363454 0,000437339 0,003517747
strategies 39768 0,00453273 0,096233152 0,000436199 0,003508579
studio 39195 0,00446742 0,09450185 0,000422179 0,003395813
struggling 26677 0,003040627 0,138471342 0,00042104 0,003386645

29
stable 25662 0,002924938 0,138726522 0,000405766 0,003263794
studied 33515 0,003820018 0,105922721 0,000404627 0,003254626
structures 26349 0,003003242 0,132149228 0,000396876 0,003192284
stream 24393 0,002780298 0,141843972 0,000394369 0,003172115
strategic 26107 0,002975659 0,131573907 0,000391519 0,003149195
staying 32297 0,003681191 0,102579187 0,000377614 0,003037346
stretch 26762 0,003050315 0,115761154 0,000353108 0,002840235
stations 24804 0,002827143 0,124496049 0,000351968 0,002831067
strikes 20666 0,002355497 0,143714313 0,000338519 0,002722885
studying 27294 0,003110952 0,107825896 0,000335441 0,002698131
stops 26908 0,003066956 0,108703731 0,00033339 0,002681629
stephen 26970 0,003074023 0,104857249 0,000322334 0,0025927
statistical 17579 0,002003643 0,158939644 0,000318458 0,002561528
striking 19419 0,002213365 0,138060662 0,000305579 0,002457931
strip 21378 0,00243665 0,122836561 0,00029931 0,002407507
stomach 27740 0,003161787 0,094520548 0,000298854 0,00240384
stages 19026 0,002168571 0,135130874 0,000293041 0,002357083
stolen 22366 0,002549262 0,11459358 0,000292129 0,002349749
steel 32776 0,003735787 0,077709299 0,000290305 0,00233508
steady 24491 0,002791468 0,101180025 0,000282441 0,002271821
steal 20956 0,002388551 0,117675129 0,000281073 0,002260819
stability 21016 0,00239539 0,1161496 0,000278224 0,002237899
stewart 20236 0,002306486 0,120181854 0,000277198 0,002229648
stadium 25212 0,002873647 0,094756465 0,000272297 0,002190226
stepped 34088 0,003885328 0,069877963 0,000271499 0,002183808
stranger 19155 0,002183274 0,122631167 0,000267737 0,002153554
stopping 18673 0,002128336 0,124136454 0,000264204 0,002125134
stem 19238 0,002192735 0,120178813 0,00026352 0,002119633
stocks 25119 0,002863047 0,090847566 0,000260101 0,002092129
stake 19616 0,002235819 0,107616232 0,00024061 0,001935357
styles 15860 0,001807712 0,126607818 0,000228871 0,001840927
stones 16858 0,001921464 0,11834144 0,000227389 0,001829008
structural 16140 0,001839627 0,121623296 0,000223741 0,001799671
struggled 16187 0,001844984 0,117748811 0,000217245 0,001747413
steven 21003 0,002393908 0,087416083 0,000209266 0,001683238
stanford 15967 0,001819908 0,113859836 0,000207214 0,001666735
stroke 17508 0,00199555 0,097498286 0,000194563 0,001564971
staring 25605 0,002918441 0,062331576 0,000181911 0,001463207
stairs 23029 0,00262483 0,066177428 0,000173705 0,001397197
stole 17208 0,001961356 0,077754533 0,000152504 0,001226673
stared 23981 0,002733339 0,046828739 0,000127999 0,001029562
stir 19479 0,002220204 0,05303147 0,000117741 0,00094705
TOTAL 8773520 0,124323533

30
TABLE 8: Usage Frequencies After Twenty for Words Starting with “st”
WORD FREQUENCY RATIO WORD FREQUENCY RATIO
st 543864 0,006317719 staff 2602407 0,03023049
stage 1569009 0,018226169 stages 427339 0,004964123
stairs 28180 0,000327349 standard 2427376 0,028197267
standards 1225651 0,0142376 stands 595947 0,006922733
star 1466255 0,017032542 stars 831227 0,00965583
start 5497965 0,063866327 starts 985564 0,011448664
state 6327227 0,073499331 statement 1456109 0,016914683
statements 619896 0,007200934 states 3419996 0,039727896
station 1048322 0,012177683 stations 380514 0,004420187
statistical 173317 0,002013312 status 1256637 0,014597545
stay 2338130 0,027160554 stayed 352807 0,004098332
steam 366250 0,004254491 steel 923989 0,010733387
step 2670622 0,0310229 steps 1337165 0,015532987
still 9337823 0,10847149 stock 1648872 0,019153887
stocks 320921 0,003727933 stone 763898 0,008873712
stones 273985 0,003182708 stops 384562 0,00446721
store 2370994 0,027542314 stores 861682 0,010009606
stories 1251367 0,014536326 story 3185323 0,037001851
straight 1228434 0,014269928 strategic 566227 0,006577495
strategies 699520 0,008125874 street 2085396 0,024224705
streets 510693 0,005932392 striking 265952 0,003089394
strings 265409 0,003083086 stroke 307775 0,003575224
strong 2479756 0,028805732 structures 427084 0,00496116
student 2920987 0,033931229 students 6528359 0,075835752
studies 1471363 0,017091879 studio 789269 0,009168431
study 2888625 0,033555301 studying 374100 0,004345679
stunning 458706 0,005328493 styles 546658 0,006350175

IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics

Recommended

Recommended

More Related Content

Similar to IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics

Similar to IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics (20)

More from Michelle Bojorquez

More from Michelle Bojorquez (20)

Recently uploaded

Recently uploaded (20)

IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics