Procuring digital preservation CAN be quick and painless with our new dynamic...
IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics
1. EXTENDED ESSAY – MATHEMATICS
BUILDING A PREDICTIVE TEXT LIST USING JACCARD INDEX AND BAYESIAN
STATISTICS
RESEARCH QUESTION
How can I build a predictive text list for uncompleted word in a paragraph by using Jaccard
Index and Bayesian Statistics?
Word Count: 3852
2. 2
TABLE OF CONTENTS
1. INTRODUCTION........................................................................................... 4
1. 1. RATIONALE............................................................................................. 4
1. 2. AIM OF THE STUDY AND APPROACH............................................... 4
2. WORD SIMILARITY COMPARISON ....................................................... 6
2. 1. WHAT IS JACCARD INDEX AND JACCARD DISTANCE................. 6
2. 1. 1. Jaccard Index ...................................................................................... 6
2. 1. 2. Jaccard Distance.................................................................................. 7
2. 2. FINDING SIMILAR WORDS .................................................................. 8
2. 2. 1. Why Jaccard Index and Jaccard Distance....................................... 8
2. 2. 2. Applying Jaccard Index .................................................................. 9
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT ...................... 12
3. 1. BAYESIAN STATISTICS ...................................................................... 12
3. 1. 1. What is Bayes’ Theorem................................................................... 12
3. 1. 2. Differences Between Bayesian Statistics and Classical Statistics.... 13
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics.................... 14
3. 2. USING BAYES TO FIND SEMANTIC VALUES ................................ 15
3. 2. 1. What is Semantic Value and How it is Calculated........................... 15
4. 4
1. INTRODUCTION
1. 1. RATIONALE
As personal computers become widespread, our writing process is transferred to digital
systems. Nowadays instead of a pen, most people use their computer or smartphone to write
texts about a variety of topics. However, it takes a lot of time. To enhance our quality and speed
of writing, computer scientists and engineers develop different algorithms.
I’m interested in linguistics and data science. While doing research for my Mathematics
Extended Essay, I came across to subsection of Artificial Intelligence: Natural Language
Processing. It basically is processing a text or a speech in human language by using a software.1
I asked myself “why not combining linguistics and computer science in your Extended Essay?”.
I decided to create a basic algorithm using the power of mathematics and computer to process
a text in the writing process and make the best predictions for the upcoming word while typing
it. Computer scientists have been working on that issue for a long time, but there is still
remarkable room for improvement. By using the power of statistics, we can go further in
processing texts to make more accurate predictions.
1. 2. AIM OF THE STUDY AND APPROACH
In this study, I aimed to develop a mathematical algorithm that makes the most accurate
predictions for the upcoming word. Unlike most other algorithms, I didn’t only consider letter-
by-letter similarities between words; I also considered the semantic and syntactic value of
words in a context. By using massive libraries, I tried to find the best suggestion for a misspelled
word.
1
Brownlee, J. (2017, September 22). What Is Natural Language Processing? Machine Learning Mastery.
https://machinelearningmastery.com/natural-language-processing/ Retrieval Date: December 7, 2020.
5. 5
To do that, I used the Jaccard Index and Bayesian Statistics and create and mathematical
index which indicates the accuracy of prediction. In my study, I went on a randomly picked
article. By using a website, I picked a piece of a random article and cut it into parts.2
Then, I
changed it in the way that it seems it is written currently. Here is the original piece that I took
from the article:
“Nationwide, Republicans have a major advantage in redistricting heading into the November
elections. The party controls the process in twenty states, including key swing states like
Florida, Ohio, Michigan, Virginia, and Wisconsin, compared with seven for Democrats (the
rest are home to either a split government or independent redistricting commissions).”3
I disjoined a sentence in the middle of it. Here is the version that I will study:
“Nationwide, Republicans have a major advantage in redistricting heading into the
November elections. The party controls the process in twenty st…”
By using mathematical and statistical tools, I tried to find the best suggestion for the
word that is currently being typed. Of course, this study needs computer power, so I used some
computer programmes, tools, and codes. I included them as an appendix. Expanded lists for
tables are also included as an appendix.
In statistics studies, I used two databases for frequencies and genre-specific frequencies
of words. Before deciding that my paragraph belongs to a web page, I used the COCA (Corpus
of Contemporary American English)4
which contains more than 1 billion words from 8 different
areas to obtain fairer results. However, after finding that my paragraph is a part of a web page,
2
Website I used for picking random article: https://longform.org/random
3
Berman, A. (2012, January 31). How the GOP Is Resegregating the South. The Nation.
https://www.thenation.com/article/archive/how-gop-resegregating-south/ Retrieval Date: December 7, 2020
4
https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
6. 6
I used “The iWeb Corpus”5
, which contains more than 14 billion words from 22 million web
pages to work with statistics focused only on web pages.
2. WORD SIMILARITY COMPARISON
2. 1. WHAT IS JACCARDIAN INDEX AND JACCARD DISTANCE
2. 1. 1. Jaccard Index
Jaccard Similarity Index is a method to compare two sets of data. It is invented by a
Swiss professor of botany, Paul Jaccard.6
It is a measure of similarity for two finite data sets,
and it is expressed in a range of 0% to 100%. As the similarity increase, the percentage value
increases.
Jaccard Similarity Index is measured by comparing the joint values of sets with a
combination of sets. Mathematically, to calculate Jaccard Similarity Index for two finite sets
we divide the number of common objects -intersection of two sets- by the total number of
objects -union of two sets-. In mathematical notation, the Jaccard Similarity Index is expressed
as followed:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴 ∪ 𝐵|
Or in a simpler way:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
In the Venn diagram, the Jaccard Index can be shown like this:
5
https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
6
Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date: 2020, December 7.
7. 7
2. 1. 2. Jaccard Distance
Jaccard Distance is also a method to compare two sets of data. However, unlikely to
Jaccard Index which measures how similar the sets are, Jaccard Distance measures how
dissimilar the sets are. It is also expressed between the range of 0% to 100%, but as it measures
dissimilarity the value increases as the similarity between two sets decreases.
In Jaccard Distance, 100% or 1.00 means sets are completely dissimilar while 0% or
0.00 means sets are equal to each other.
𝐷(𝐴, 𝐵) = 1 − 𝐽(𝐴, 𝐵)
𝐷(𝐴, 𝐵) = 1 −
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
FIGURE 1: Venn representation of Jaccard Index
8. 8
2. 2. FINDING SIMILAR WORDS
2. 2. 1. Why Jaccard Index and Jaccard Distance
There are several reasons why I picked the Jaccard Index for measuring the similarities
of words. First of all, in the Jaccard Index words are assumed sets that include letters. This gave
me the opportunity to compare words letter by letter, which is an essential future for me. Since
I will base my prediction on letters that have already been typed, Jaccard Index will narrow the
possible word circle. Secondly, the Jaccard Index is widely used in Computer Science and
specifically Python -a programming language that I also used in this study. This gave me unique
flexibility because English contains hundreds of thousands of words and comparing letters of
each word with characters that are typed just now is nearly impossible without the use of a
computer algorithm.
FIGURE 2: Venn Representation of Jaccard Distance
9. 9
2. 2. 2. Applying Jaccard Index
To apply the Jaccard Similarity Index to words, we must first define the sets. For
example, if we want to calculate the similarity index between two random words, “soldier” and
“laborer”, we first define two sets as 𝐴 = {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟}, 𝐵 = {𝑙, 𝑎, 𝑏, 𝑜, 𝑟, 𝑒}. Then we apply
the Jaccard formula:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
𝐽(𝐴, 𝐵) =
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐽(𝐴, 𝐵) =
4
9
= 0.44
Since we calculated these sets’ Jaccard Index, we can calculate their dissimilarity -Jaccard
Distance- by using a basic formula:
𝐷(𝐴, 𝐵) = 1 −
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐷(𝐴, 𝐵) = 0.66
English contains a lot of words, so to find predictive words for the word that is currently
being typed, we need to use multiple methods. The first one of them is comparing the similarity
of the piece of the word written and the similarity of the probable word. There is not a consensus
about all English words, but I used an online source that contains more than 307.113 words.7
By using a Python code, I first found the number of words starting with the letters “s, t”
from the set of 307.113 English words. By assuming the upcoming word is starting with these
7
https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
10. 10
letters in the order of “st”, I only picked the words that contain these letters in the correct order.
That means 4886 total words.
After that, by using a Python code again, I calculated the Jaccard Similarity index and
Jaccard distance of each of these words with the phrase “st”. That may give you the impression
of shorter words are more likely to be possible word suggestions, but that’s not the case. Since
the words are regarded as sets, repetitive letters don’t affect the similarity of words. Here is a
real example that I encountered:
𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑏) = 𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑎𝑡𝑠𝑟𝑎𝑡)
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑏}
=
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑟}
= 0.5
Here is Jaccard Similarity and Jaccard Distance of English words with similarity index higher
than or equal to 0.66 among the words starting with the “st” phrase:
TABLE 1: Jaccard Similarity and Distance of Words Starting with “st”
WORD JACCARD SIMILARITY JACCARD DISTANCE
st 1.0 0.0
sta 0.6666666666666666 0.33333333333333337
stat 0.6666666666666666 0.33333333333333337
stats 0.6666666666666666 0.33333333333333337
std 0.6666666666666666 0.33333333333333337
stet 0.6666666666666666 0.33333333333333337
stets 0.6666666666666666 0.33333333333333337
stg 0.6666666666666666 0.33333333333333337
sty 0.6666666666666666 0.33333333333333337
11. 11
stk 0.6666666666666666 0.33333333333333337
stm 0.6666666666666666 0.33333333333333337
stoot 0.6666666666666666 0.33333333333333337
stoss 0.6666666666666666 0.33333333333333337
stot 0.6666666666666666 0.33333333333333337
stott 0.6666666666666666 0.33333333333333337
str 0.6666666666666666 0.33333333333333337
stu 0.6666666666666666 0.33333333333333337
stuss 0.6666666666666666 0.33333333333333337
stut 0.6666666666666666 0.33333333333333337
As you can see, there is not any correlation between the length of a word and its similarity with
the phrase “st”.
This table gave me a chance to make a prediction about the word that is currently being
typed, but since there is a total of 4886 words that start with the “st” phrase, this prediction may
be misleading for some reasons. First, there is more than one word that has the same similarity
ratio. Second, even though these suggestions gave a modal outlook, they were not sufficient
when I took the language’s own specialties into account. In language, words must be coherent
and follow grammatical rules at the same time.
To overcome this hardship and make my suggestion more accurate, I developed a new
model that takes semantic values into account by using Bayesian Statistics.
12. 12
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT
3. 1. BAYESIAN STATISTICS
3. 1. 1. What is Bayes’ Theorem
Bayes’ theorem is a way of calculating probability with the help of a priori knowledge,
in other terms it is one of the means of conditional probability. It is founded by Thomas Bayes,
a Presbyterian minister, and mathematician who lived during the 18th
century. However, it is
published in 1763, after the death of Thomas Bayes, when the theorem was discovered among
his notes.8
As a branch of conditional probability, Bayes’ Theorem aims to find the probability
of an event in light of relevant prior knowledge. It is a way of calculating the possibility of an
event or situation while related knowledge is given true. Its formula is as follows:
𝑃(𝐴|𝐵) =
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵)
Or in more complex and widely used terms:
𝑃(𝐴|𝐵) =
𝑃(𝐵|𝐴) × 𝑃(𝐴)
𝑃(𝐵)
Verbally, this equation is formulated in order to find the possibility of A given that B is true
and this is called posterior in Bayesian terminology. To do that, it multiplies the probability of
B given that A is true -likelihood- and the probability of A -prior- and divides them to the
probability of B -marginal likelihood-.
As you see, in Bayes’ Theorem 3 main components are used: 1. Likelihood 2. Prior 3.
Marginal Likelihood. The likelihood is the probability that we would get if the hypothesis is
8
Routledge, R. Bayes's theorem. Encyclopedia Britannica. https://www.britannica.com/topic/Bayess-theorem
Retrieval Date: February 15, 2021.
13. 13
true. Basically, to find it, we assume that the hypothesis is true and calculate the probability of
the event happening. Prior is the main feature of Bayesian Statistics. It is our estimation of how
probable the hypothesis is. If there are multiple hypotheses, the sum of priors must be equal to
1. Marginal Likelihood is the probability that we obtained whether the hypotheses true or not.
Its formula is as follows:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
And the final outcome of Bayes’ Rule, in our formula 𝑃(𝐴|𝐵), is called posterior.
3. 1. 2 Differences Between Bayesian Statistics and Classical Statistics
There are two main concepts of statistics, Classical statistics, and Bayesian statistics.
While Classical statistics are the product of frequentist methods, Bayesian Statistics approach
probability as a subjective experience of uncertainty. The frequentist method relies on repeating
experiments and only interpreting the data set. In this method, the null hypothesis is assumed
true. Contrary to the frequentist method, Bayesian statistics rely on combining the data with
prior knowledge. Unlikely to frequentist method, large data sets are not needed in the Bayesian
approach. In this method, prior knowledge about the hypothesis is used along with the
experimental data so it works with smaller data sets.
Prior knowledge is a gamechanger in statistics and it is the main aspect that distinguishes
these two approaches. While there is not any prior knowledge concept in Classical (frequentist)
statistics, Bayesian Statistics gives us the ability to include our prior opinion or knowledge
about the hypothesis. However, the nature of this prior knowledge is not clearly defined and it
may be completely subjective or objective, depending on the choice of the person who is doing
the calculation. There are multiple ways about deciding on the prior and they affect the
posterior. Both informative and noninformative priors can be used in calculations and the
14. 14
posterior can vary greatly according to the type of prior. Because of that, Bayesian Statistics
are sometimes called for being subjective and lack scientific certainty.
Nonetheless, in Bayesian Statistics this is regarded as richness, not as a weakness. The
frequentist approach ignores all past studies and surrounding effects; it only focuses on data.
Classical (frequentist) statistics suppose that “nothing is going on”. However, “always
something is going on”.9
Since we do not do our experiments in an area that is completely free
of any effect, ignorance of surroundings may be misleading. So that means contrary to popular
belief, Bayesian statistics may sometimes be more accurate than Classical statistics. However,
the key part is how we decide on prior.
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics
As I mentioned in the previous section, Bayesian statistics takes outer effects more into
consideration. In Bayesian statistics, we do not need to rely only on the raw data we obtain from
calculations but we can also use some other resources. In language, the main outer resource is
habits. Each person has different habits in the use of language and to make more accurate
predictions, we should also use them in our calculations.
In modern systems that perform similar tasks to the algorithm that I am currently
working on, language habits are an integral part of word predictions. They store the data about
how many times the user uses each word and tend to suggest more frequently used words. They
also learn the topics that the consumer is writing more about and use them in their predictions.
This gives them the ability to make more accurate predictions for specific users.
9
Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz & Aken, Marcel. (2013). A
Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. (p. 2). Child development.
85. 10.1111/cdev.12169.
15. 15
Learning is endless and this is where Bayesian statistics show its specialties. One of the
biggest opportunities of Bayesian statistics is that its openness to update. We can update the
prior as we learn more about the user’s habits. We do not only use data but we can also learn
from the data. That gives us the ability to enhance the accuracy of predictions as the user types
more.
I do not have any user-related data but in a situation that I have, the use of Bayesian
statistics will outrace the Classical statistics in these aspects. Since I want this study to be open
to further developments, I chose Bayesian statistics. However, I was still benefitted from some
of its features.
I will do my predictions with the help of Bayesian Statistics while trying to guess which
word the writer is intended to write. This will give me an opportunity to include prior, in our
context the habit factor, to my calculations. Language is strongly linked with our writing habits.
For example, in synonyms, each person has their own preference. So, predictions based only
on semantics may be misleading. To overcome this problem, we should integrate these two
factors. This will give more accurate results especially when we have personal data, but I do
not work with personal data since I do not have any information about the writer of this text.
Therefore, I will use the general writing habits of English writers.
3. 2. USING BAYES TO FIND SEMANTIC VALUES
3. 2. 1. What is Semantic Value and How it is Calculated
“Colorless green ideas sleep furiously.”10
This is the by far most famous quote in
linguistics, created by Noam Chomsky. Even though it is grammatically correct, it does not
have any meaning. By this example, Chomsky showed that there is not any bond between the
10
Chomsky, N. (2002). Syntactic Structures. (p. 15). Berlin: Mouton de Gruyter.
16. 16
grammatical structure of a language and its semantic side. This is also true for our prediction
algorithm. Our suggestion shouldn’t be only grammatically correct but it also has to be
semantically appropriate for the sentence.
To do that, I found the words that fit the general topic of this paragraph. We have 24
words until the “st” phrase. By finding in which areas these words are generally used, I aimed
to understand the general topic of the paragraph. After that, I checked which words will be
appropriate suggestions.
3. 2. 2. Using Bayes’ Theorem in Semantics
What I wanted is pretty simple actually: finding semantically proper suggestion for
completing the “st” phrase. To do that, I first decided on the context of words that are already
written. COCA classifies words according to how frequently a word is used in each of 8 main
genres. These genres are blog posts, general web pages, TV and movie subtitles, spoken
language, fiction, popular magazines, newspapers, and academic writings. I checked each word
in the paragraph and reached to statistical overview:
TABLE 2: Total Frequencies of Words in Paragraphs in Each Genre
BLOG POSTS WEB PAGES SUBTITLES SPOKEN LANG.
12176265 13221161 8528563 11703651
FICTION MAGAZINES NEWSPAPERS ACADEMIC WRITINGS
11517946 13082154 12746944 13093324
This table shows the total numbers of how frequently words in the paragraph used in each
category. According to these numbers, it is clear that the main genre of our paragraph is web
pages.
17. 17
It is reasonable to expect the correct suggestion among the more frequently used words.
For that reason, I picked 107 words which are the most frequently used words among the ones
starting with “st”.11
Then I formulated the probability of each 107 words to be used in the web
page genre. For example, the formula for one of the least frequently used word “styles”, is as
following:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒|𝑠𝑡𝑦𝑙𝑒𝑠) × 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠)
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
According to this formula, to find the probability of “styles” being the following word given
that the genre of the paragraph is web page, we need to multiply the probability of genre being
web page given that it includes “styles” with the probability of “styles” being used and divide
them the probability of the genre of the paragraph being web page.
As I had known the probability of genre being web page given that it includes “styles”,
or in basic terms the probability of “styles” being used in “web page” from COCA, I substituted
it. For the main feature of Bayesian Statistics, prior, I used the general usage ratio of “styles”
among the most popular 107 words starting with “st”. With substitutions, the formula became
like this:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
0,126607818411097 × 0,00180771229791463
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
However, since I didn’t have any direct statistics for 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒), I used the following
formula of marginal likelihood:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
11
I picked 107 words because that was the maximum number of words that I can obtain from the COCA data set
which I was using.
18. 18
= 0,124323532630005
With this addition, the formula became like this:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
0,126607818411097 × 0,00180771229791463
0,124323532630005
And the result was:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,0018409266975627
Though, this was the probability for just one of the 107 words. For the rest, I calculated
the probabilities by following the same steps. Here is Bayes’ Box 10 of them which have the
highest probability:
TABLE 3: Bayes’ Box for Top 10 Words in Semantics
WORD FREQUENCY PRIOR (USAGE
RATIO)
LIKELIHOOD PRIOR x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
As you can see in the table above, there are different parameters that affecting the posterior
probability. Each word has its own characteristics and with the use of Bayesian Statistics, I was
19. 19
able to assess their different parts together. This is just a small portion of the table and the full
version is included as an appendix.
However, I was still lack of enough information to make my suggestion more accurate.
Language has another aspect -syntax- and I could reach to correct algorithm only by an
algorithm containing multiple aspects of language.
3. 3 SYNTACTIC APPROPRIATENESS
Syntax is a subtopic of linguistics that is focused on the arrangement of words in a
sentence or a paragraph. In language, words should follow each other in the correct order to
have a proper meaning. Syntax is basically assessing this and our behaviors about placing words
in order.
In my case, I used syntax as another aspect of my algorithm. Since I was trying to find
the correct word suggestion only with the use of statistics, I hoped that more statistics from
different areas would increase the accuracy of my prediction. My suggestion had to be
consistent with “twenty”, so I included the usage frequency of each word which starting “st”
after the “twenty” to my research. However, I did not need to calculate the conditional
probability of it personally, by using the Bayes’ theorem. iWeb Corpus already contains this
data, so I directly included it in my research. Though, I only used the common words in the top
100 most used “twenty st” phrases and the semantic calculations. It means 57 words and here
is a small sample of it:
TABLE 4: 10 Most Used Words Starting with “st” After the “twenty”
WORD FREQUENCY USAGE RATIO
twenty students 532 0,303306727
twenty states 366 0,208665906
20. 20
twenty steps 140 0,07981756
twenty stories 134 0,076396807
twenty studies 50 0,028506271
twenty straight 39 0,022234892
twenty staff 34 0,019384265
twenty state 31 0,017673888
twenty stores 31 0,017673888
twenty standard 29 0,016533637
4. COMBINING THE DATA
As we can see from the sample tables, the correct word “states” is not the first option in
any category. If we evaluate each category one by one, we can make multiple suggestions to
complete the phrase of “st”. However, all of them would be wrong. Hence, I needed to combine
statistics from all areas and reach one correct answer. Each of them gathered in different ways
and have different perspectives. The way of obtaining the correct result was combining these
different views into one and simple mathematical index. I created this index by multiplying the
Jaccard Similarity Index, semantic posterior probability, and syntactic probability and turning
the product into a number out of 1. Final table includes 57 words that are commonly popular in
both semantics and syntactic statistics.
TABLE 5: Final List of Suggestions
WORD JACCARD
SIMILARITY
SEMANTIC SYNTACTIC PRODUCT FINAL
INDEX
PERCENTAGE
INDEX
states 0.5 0,0575 0,2087 0,0120 0,4355 43,5547
students 0.333 0,0280 0,3033 0,0085 0,3080 30,8016
state 0.5 0,0818 0,0177 0,0014 0,0525 5,2468
stories 0.333 0,0157 0,0764 0,0012 0,0436 4,3627
story 0.4 0,0479 0,0143 0,0007 0,0248 2,4759
22. 22
stands 0.4 0,0049975 0,0017104 0,0000085 0,0003103 0,0310264
styles 0.4 0,0018409 0,0039909 0,0000073 0,0002667 0,0266684
stones 0.4 0,0018290 0,0039909 0,0000073 0,0002650 0,0264957
steel 0.5 0,0023351 0,0028506 0,0000067 0,0002416 0,0241620
studying 0.25 0,0026981 0,0022805 0,0000062 0,0002233 0,0223349
striking 0.2857 0,0024579 0,0022805 0,0000056 0,0002035 0,0203466
strategic 0.25 0,0031492 0,0017104 0,0000054 0,0001955 0,0195516
stayed 0.333 0,0035177 0,0011403 0,0000040 0,0001456 0,0145598
statistical 0.333 0,0025615 0,0011403 0,0000029 0,0001060 0,0106021
stroke 0.333 0,0015650 0,0017104 0,0000027 0,0000972 0,0097160
stairs 0.4 0,0013972 0,0011403 0,0000016 0,0000578 0,0057830
The final table is a predictive text list in which all possible words are attached to an
index. This index shows us how probable each word is compared to all words in the list. As we
can see from the final table, I managed to predict the word correctly. According to my index,
“states” became the most appropriate suggestion to complete the word with the index number
0,4355 which is 34.29% more than the second option, “students”. This shows us how strong
conditional probability and Bayesian Statistics are. They to led me to correct results. Even
though I did not use the Bayes’ rule while calculating syntactic value since its conditional
frequency has already been put into a data set, I always used conditional probability and when
it is possible Bayesian Statistics.
Though linguistics and mathematics seem so different, mathematical interpretation of
linguistics may be quite accurate. Bayesian statistics are one of the strongest tools in
mathematics and it became very effective when it is used correctly. In my research, it was one
of my two main tools with the Jaccard Index and played a huge role to make the correct
prediction.
23. 23
5. CONCLUSION
In this paper, I tried to build a predictive text list that suggests the correct word to
complete the “st” phrase in a random paragraph. To do that, I used Jaccard Similarity Index and
Bayesian Statistics. I ended up with a list containing 57 words and I managed to predict the
word correctly. While doing that, I tried to use purely statistical methods and data to avoid
subjectivity. Especially when using the Bayes’ theorem which is often regarded as subjective,
I made my all decision completely based on data.
Before choosing my Extended Essay topic, I was sure that I would like to do research
that combines computer science and mathematics. Since I aim to develop myself in
mathematical computer science or Artificial Intelligence, Natural Language Processing seemed
so attractive to me. It is highly linked with statistics and the mathematical background that it
contains is amazing. I would like to do research which I can develop in the future. They led me
to choose this research question.
However, I faced some challenges while doing the background calculations and
reflecting them to written text. Even though this research paper contains high effort while doing
mathematics due to its nature which requires processing big data sets, reflecting it to text was
compelling. Creating a prediction index from zero requires correctly building logical bonds
between the different variables in the data. Still, I learned quite a lot from this research about
statistics and how vast areas it can be effectively used. It widened my perspective about
applications of mathematics and helped me while making career plans as a computer scientist
specialized in data engineering.
24. 24
6. BIBLIOGRAPHY
• Brewer, Brendon. J. STATS 331: Introduction to Bayesian Statistics. University of
Auckland.
• Routledge, Richard. Bayes's theorem. Encyclopedia Britannica.
https://www.britannica.com/topic/Bayess-theorem Retrieval Date: February 15, 2021.
• Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz
& Aken, Marcel. (2013). A Gentle Introduction to Bayesian Analysis: Applications to
Developmental Research. (p. 2). Child development. 85. 10.1111/cdev.12169.
• Chomsky, Noam. (2002). Syntactic Structures. Berlin: Mouton de Gruyter.
• Berman, Ari. (2012, January 31). How the GOP Is Resegregating the South. The
Nation. https://www.thenation.com/article/archive/how-gop-resegregating-south/
Retrieval Date: December 7, 2020
• Brownlee, Jason. (2017, September 22). What Is Natural Language Processing?
Machine Learning Mastery. https://machinelearningmastery.com/natural-language-
processing/ Retrieval Date: December 7, 2020.
• Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date:
2020, December 7.
• Sieg, Adrien. (2019, November 13). Text similarities : Estimate the degree of
similarity between two texts. Retrieval Date: November 20, 2020.
https://medium.com/@adriensieg/text-similarities-da019229c894
• An Intuitive (and Short) Explanation of Bayes' Theorem. BetterExplained.
https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-
theorem/. Retrieval Date: January 28, 2021
25. 25
• Mahendru, Khyati. (2019, June 13). Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2019/06/introduction-powerful-bayes-theorem-
data-science/. Retrieval Date: December 3, 2020.
• Glen, Stephanie. (2020, September 16). Jaccard Index / Similarity Coefficient.
Statistics How To. https://www.statisticshowto.com/jaccard-index/. Retrieval Date: 3
January, 2021
DATA BASES:
• https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
• https://www.wordfrequency.info/samples.asp Retrieval Date: February 16, 2021.
• https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
• https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
PROGRAMS:
• Visual Studio 2019
• Python 3.7
26. 26
7. APPENDIECES
FIGURE 3: Code for Finding Words Starting with “st”
FIGURE 4: Code for Calculating Jaccard Index
FIGURE 5: Code for Calculating Jaccard Distance
27. 27
TABLE 6: General and Genre Specific Frequencies of Words in The Paragraph
WORD FREQ BLOG WEB TVM SPOK FICTION MAGAZINE NEWS ACADEMIC
nationwide 15733 1546 1889 214 1911 106 3216 5059 1792
republican 124514 21658 20867 904 43517 475 10889 22121 4080
have 5025573 781709 687895 820686 879668 423071 503134 522808 406599
a 21889251 2783458 2827106 2519099 2716641 2749208 3104298 2959649 2229222
major 196857 24133 25756 10307 24642 6983 31077 36638 37315
advantage 55691 9484 8721 3212 5275 3253 8873 7923 8949
in 16560377 2003430 2257672 1225718 2020330 1671503 2310522 2355671 2699192
heading 28118 3089 2997 4616 3420 5655 3519 3822 1000
into 1461816 166362 180584 116756 148250 307485 226799 171402 144177
the 50074257 6272412 7101104 3784652 5769026 6311500 6805845 6582642 7447070
november 87176 24573 22587 1138 7249 2224 10651 11000 7733
elections 39380 7212 6709 352 8261 232 3155 7398 6061
party 243697 39715 35760 31649 43730 15044 24195 33310 20277
controls 26347 3131 3786 1272 1766 1639 4708 2995 7049
process 220128 31106 33489 5266 26450 5496 27362 22973 67985
twenty 36338 2939 3815 2718 3313 14068 3807 971 4707
redistricting 1724 308 424 4 202 4 104 562 116
TOTAL 96086977 12176265 13221161 8528563 11703651 11517946 13082154 12746944 13093324
TABLE 7: Bayes’ Box for All Words Used in Semantics Calculations
WORD FREQUENCY RATIO LIKELIHOOD RATIO x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
strong 152080 0,017333978 0,134468701 0,002330877 0,018748482
stay 203720 0,023219871 0,093201453 0,002164126 0,017407209
street 189237 0,021569108 0,09668828 0,00208548 0,016774619
stuff 153066 0,017446361 0,118713496 0,002071119 0,016659103
step 128479 0,014643951 0,139283463 0,00203966 0,016406067
stories 112939 0,012872712 0,151940428 0,001955885 0,015732222
stand 138407 0,015775538 0,120716438 0,001904367 0,01531783
statement 86767 0,009889645 0,183952424 0,001819224 0,014632984
studies 136505 0,01555875 0,111592982 0,001736247 0,013965556
student 147156 0,016772743 0,100220175 0,001680967 0,01352091
standard 86472 0,009856021 0,163266722 0,00160916 0,012943328