SlideShare a Scribd company logo
1 of 30
Download to read offline
EXTENDED ESSAY – MATHEMATICS
BUILDING A PREDICTIVE TEXT LIST USING JACCARD INDEX AND BAYESIAN
STATISTICS
RESEARCH QUESTION
How can I build a predictive text list for uncompleted word in a paragraph by using Jaccard
Index and Bayesian Statistics?
Word Count: 3852
2
TABLE OF CONTENTS
1. INTRODUCTION........................................................................................... 4
1. 1. RATIONALE............................................................................................. 4
1. 2. AIM OF THE STUDY AND APPROACH............................................... 4
2. WORD SIMILARITY COMPARISON ....................................................... 6
2. 1. WHAT IS JACCARD INDEX AND JACCARD DISTANCE................. 6
2. 1. 1. Jaccard Index ...................................................................................... 6
2. 1. 2. Jaccard Distance.................................................................................. 7
2. 2. FINDING SIMILAR WORDS .................................................................. 8
2. 2. 1. Why Jaccard Index and Jaccard Distance....................................... 8
2. 2. 2. Applying Jaccard Index .................................................................. 9
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT ...................... 12
3. 1. BAYESIAN STATISTICS ...................................................................... 12
3. 1. 1. What is Bayes’ Theorem................................................................... 12
3. 1. 2. Differences Between Bayesian Statistics and Classical Statistics.... 13
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics.................... 14
3. 2. USING BAYES TO FIND SEMANTIC VALUES ................................ 15
3. 2. 1. What is Semantic Value and How it is Calculated........................... 15
3
3. 2. 2. Using Bayesian Theorem in Semantics............................................ 16
3. 3. SYNTACTIC APPROPRIATENESS...................................................... 19
4. COMBINING THE DATA........................................................................... 20
5. CONCLUSION.............................................................................................. 23
6. BIBLIOGRAPHY ......................................................................................... 24
7. APPENDIECES............................................................................................. 26
4
1. INTRODUCTION
1. 1. RATIONALE
As personal computers become widespread, our writing process is transferred to digital
systems. Nowadays instead of a pen, most people use their computer or smartphone to write
texts about a variety of topics. However, it takes a lot of time. To enhance our quality and speed
of writing, computer scientists and engineers develop different algorithms.
I’m interested in linguistics and data science. While doing research for my Mathematics
Extended Essay, I came across to subsection of Artificial Intelligence: Natural Language
Processing. It basically is processing a text or a speech in human language by using a software.1
I asked myself “why not combining linguistics and computer science in your Extended Essay?”.
I decided to create a basic algorithm using the power of mathematics and computer to process
a text in the writing process and make the best predictions for the upcoming word while typing
it. Computer scientists have been working on that issue for a long time, but there is still
remarkable room for improvement. By using the power of statistics, we can go further in
processing texts to make more accurate predictions.
1. 2. AIM OF THE STUDY AND APPROACH
In this study, I aimed to develop a mathematical algorithm that makes the most accurate
predictions for the upcoming word. Unlike most other algorithms, I didn’t only consider letter-
by-letter similarities between words; I also considered the semantic and syntactic value of
words in a context. By using massive libraries, I tried to find the best suggestion for a misspelled
word.
1
Brownlee, J. (2017, September 22). What Is Natural Language Processing? Machine Learning Mastery.
https://machinelearningmastery.com/natural-language-processing/ Retrieval Date: December 7, 2020.
5
To do that, I used the Jaccard Index and Bayesian Statistics and create and mathematical
index which indicates the accuracy of prediction. In my study, I went on a randomly picked
article. By using a website, I picked a piece of a random article and cut it into parts.2
Then, I
changed it in the way that it seems it is written currently. Here is the original piece that I took
from the article:
“Nationwide, Republicans have a major advantage in redistricting heading into the November
elections. The party controls the process in twenty states, including key swing states like
Florida, Ohio, Michigan, Virginia, and Wisconsin, compared with seven for Democrats (the
rest are home to either a split government or independent redistricting commissions).”3
I disjoined a sentence in the middle of it. Here is the version that I will study:
“Nationwide, Republicans have a major advantage in redistricting heading into the
November elections. The party controls the process in twenty st…”
By using mathematical and statistical tools, I tried to find the best suggestion for the
word that is currently being typed. Of course, this study needs computer power, so I used some
computer programmes, tools, and codes. I included them as an appendix. Expanded lists for
tables are also included as an appendix.
In statistics studies, I used two databases for frequencies and genre-specific frequencies
of words. Before deciding that my paragraph belongs to a web page, I used the COCA (Corpus
of Contemporary American English)4
which contains more than 1 billion words from 8 different
areas to obtain fairer results. However, after finding that my paragraph is a part of a web page,
2
Website I used for picking random article: https://longform.org/random
3
Berman, A. (2012, January 31). How the GOP Is Resegregating the South. The Nation.
https://www.thenation.com/article/archive/how-gop-resegregating-south/ Retrieval Date: December 7, 2020
4
https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
6
I used “The iWeb Corpus”5
, which contains more than 14 billion words from 22 million web
pages to work with statistics focused only on web pages.
2. WORD SIMILARITY COMPARISON
2. 1. WHAT IS JACCARDIAN INDEX AND JACCARD DISTANCE
2. 1. 1. Jaccard Index
Jaccard Similarity Index is a method to compare two sets of data. It is invented by a
Swiss professor of botany, Paul Jaccard.6
It is a measure of similarity for two finite data sets,
and it is expressed in a range of 0% to 100%. As the similarity increase, the percentage value
increases.
Jaccard Similarity Index is measured by comparing the joint values of sets with a
combination of sets. Mathematically, to calculate Jaccard Similarity Index for two finite sets
we divide the number of common objects -intersection of two sets- by the total number of
objects -union of two sets-. In mathematical notation, the Jaccard Similarity Index is expressed
as followed:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴 ∪ 𝐵|
Or in a simpler way:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
In the Venn diagram, the Jaccard Index can be shown like this:
5
https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
6
Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date: 2020, December 7.
7
2. 1. 2. Jaccard Distance
Jaccard Distance is also a method to compare two sets of data. However, unlikely to
Jaccard Index which measures how similar the sets are, Jaccard Distance measures how
dissimilar the sets are. It is also expressed between the range of 0% to 100%, but as it measures
dissimilarity the value increases as the similarity between two sets decreases.
In Jaccard Distance, 100% or 1.00 means sets are completely dissimilar while 0% or
0.00 means sets are equal to each other.
𝐷(𝐴, 𝐵) = 1 − 𝐽(𝐴, 𝐵)
𝐷(𝐴, 𝐵) = 1 −
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
FIGURE 1: Venn representation of Jaccard Index
8
2. 2. FINDING SIMILAR WORDS
2. 2. 1. Why Jaccard Index and Jaccard Distance
There are several reasons why I picked the Jaccard Index for measuring the similarities
of words. First of all, in the Jaccard Index words are assumed sets that include letters. This gave
me the opportunity to compare words letter by letter, which is an essential future for me. Since
I will base my prediction on letters that have already been typed, Jaccard Index will narrow the
possible word circle. Secondly, the Jaccard Index is widely used in Computer Science and
specifically Python -a programming language that I also used in this study. This gave me unique
flexibility because English contains hundreds of thousands of words and comparing letters of
each word with characters that are typed just now is nearly impossible without the use of a
computer algorithm.
FIGURE 2: Venn Representation of Jaccard Distance
9
2. 2. 2. Applying Jaccard Index
To apply the Jaccard Similarity Index to words, we must first define the sets. For
example, if we want to calculate the similarity index between two random words, “soldier” and
“laborer”, we first define two sets as 𝐴 = {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟}, 𝐵 = {𝑙, 𝑎, 𝑏, 𝑜, 𝑟, 𝑒}. Then we apply
the Jaccard formula:
𝐽(𝐴, 𝐵) =
|𝐴 ∩ 𝐵|
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
𝐽(𝐴, 𝐵) =
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐽(𝐴, 𝐵) =
4
9
= 0.44
Since we calculated these sets’ Jaccard Index, we can calculate their dissimilarity -Jaccard
Distance- by using a basic formula:
𝐷(𝐴, 𝐵) = 1 −
{𝑙, 𝑜, 𝑒, 𝑟}
{𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏}
𝐷(𝐴, 𝐵) = 0.66
English contains a lot of words, so to find predictive words for the word that is currently
being typed, we need to use multiple methods. The first one of them is comparing the similarity
of the piece of the word written and the similarity of the probable word. There is not a consensus
about all English words, but I used an online source that contains more than 307.113 words.7
By using a Python code, I first found the number of words starting with the letters “s, t”
from the set of 307.113 English words. By assuming the upcoming word is starting with these
7
https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
10
letters in the order of “st”, I only picked the words that contain these letters in the correct order.
That means 4886 total words.
After that, by using a Python code again, I calculated the Jaccard Similarity index and
Jaccard distance of each of these words with the phrase “st”. That may give you the impression
of shorter words are more likely to be possible word suggestions, but that’s not the case. Since
the words are regarded as sets, repetitive letters don’t affect the similarity of words. Here is a
real example that I encountered:
𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑏) = 𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑎𝑡𝑠𝑟𝑎𝑡)
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑏}
=
{𝑠, 𝑡}
{𝑠, 𝑡, 𝑎, 𝑟}
= 0.5
Here is Jaccard Similarity and Jaccard Distance of English words with similarity index higher
than or equal to 0.66 among the words starting with the “st” phrase:
TABLE 1: Jaccard Similarity and Distance of Words Starting with “st”
WORD JACCARD SIMILARITY JACCARD DISTANCE
st 1.0 0.0
sta 0.6666666666666666 0.33333333333333337
stat 0.6666666666666666 0.33333333333333337
stats 0.6666666666666666 0.33333333333333337
std 0.6666666666666666 0.33333333333333337
stet 0.6666666666666666 0.33333333333333337
stets 0.6666666666666666 0.33333333333333337
stg 0.6666666666666666 0.33333333333333337
sty 0.6666666666666666 0.33333333333333337
11
stk 0.6666666666666666 0.33333333333333337
stm 0.6666666666666666 0.33333333333333337
stoot 0.6666666666666666 0.33333333333333337
stoss 0.6666666666666666 0.33333333333333337
stot 0.6666666666666666 0.33333333333333337
stott 0.6666666666666666 0.33333333333333337
str 0.6666666666666666 0.33333333333333337
stu 0.6666666666666666 0.33333333333333337
stuss 0.6666666666666666 0.33333333333333337
stut 0.6666666666666666 0.33333333333333337
As you can see, there is not any correlation between the length of a word and its similarity with
the phrase “st”.
This table gave me a chance to make a prediction about the word that is currently being
typed, but since there is a total of 4886 words that start with the “st” phrase, this prediction may
be misleading for some reasons. First, there is more than one word that has the same similarity
ratio. Second, even though these suggestions gave a modal outlook, they were not sufficient
when I took the language’s own specialties into account. In language, words must be coherent
and follow grammatical rules at the same time.
To overcome this hardship and make my suggestion more accurate, I developed a new
model that takes semantic values into account by using Bayesian Statistics.
12
3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT
3. 1. BAYESIAN STATISTICS
3. 1. 1. What is Bayes’ Theorem
Bayes’ theorem is a way of calculating probability with the help of a priori knowledge,
in other terms it is one of the means of conditional probability. It is founded by Thomas Bayes,
a Presbyterian minister, and mathematician who lived during the 18th
century. However, it is
published in 1763, after the death of Thomas Bayes, when the theorem was discovered among
his notes.8
As a branch of conditional probability, Bayes’ Theorem aims to find the probability
of an event in light of relevant prior knowledge. It is a way of calculating the possibility of an
event or situation while related knowledge is given true. Its formula is as follows:
𝑃(𝐴|𝐵) =
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵)
Or in more complex and widely used terms:
𝑃(𝐴|𝐵) =
𝑃(𝐵|𝐴) × 𝑃(𝐴)
𝑃(𝐵)
Verbally, this equation is formulated in order to find the possibility of A given that B is true
and this is called posterior in Bayesian terminology. To do that, it multiplies the probability of
B given that A is true -likelihood- and the probability of A -prior- and divides them to the
probability of B -marginal likelihood-.
As you see, in Bayes’ Theorem 3 main components are used: 1. Likelihood 2. Prior 3.
Marginal Likelihood. The likelihood is the probability that we would get if the hypothesis is
8
Routledge, R. Bayes's theorem. Encyclopedia Britannica. https://www.britannica.com/topic/Bayess-theorem
Retrieval Date: February 15, 2021.
13
true. Basically, to find it, we assume that the hypothesis is true and calculate the probability of
the event happening. Prior is the main feature of Bayesian Statistics. It is our estimation of how
probable the hypothesis is. If there are multiple hypotheses, the sum of priors must be equal to
1. Marginal Likelihood is the probability that we obtained whether the hypotheses true or not.
Its formula is as follows:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
And the final outcome of Bayes’ Rule, in our formula 𝑃(𝐴|𝐵), is called posterior.
3. 1. 2 Differences Between Bayesian Statistics and Classical Statistics
There are two main concepts of statistics, Classical statistics, and Bayesian statistics.
While Classical statistics are the product of frequentist methods, Bayesian Statistics approach
probability as a subjective experience of uncertainty. The frequentist method relies on repeating
experiments and only interpreting the data set. In this method, the null hypothesis is assumed
true. Contrary to the frequentist method, Bayesian statistics rely on combining the data with
prior knowledge. Unlikely to frequentist method, large data sets are not needed in the Bayesian
approach. In this method, prior knowledge about the hypothesis is used along with the
experimental data so it works with smaller data sets.
Prior knowledge is a gamechanger in statistics and it is the main aspect that distinguishes
these two approaches. While there is not any prior knowledge concept in Classical (frequentist)
statistics, Bayesian Statistics gives us the ability to include our prior opinion or knowledge
about the hypothesis. However, the nature of this prior knowledge is not clearly defined and it
may be completely subjective or objective, depending on the choice of the person who is doing
the calculation. There are multiple ways about deciding on the prior and they affect the
posterior. Both informative and noninformative priors can be used in calculations and the
14
posterior can vary greatly according to the type of prior. Because of that, Bayesian Statistics
are sometimes called for being subjective and lack scientific certainty.
Nonetheless, in Bayesian Statistics this is regarded as richness, not as a weakness. The
frequentist approach ignores all past studies and surrounding effects; it only focuses on data.
Classical (frequentist) statistics suppose that “nothing is going on”. However, “always
something is going on”.9
Since we do not do our experiments in an area that is completely free
of any effect, ignorance of surroundings may be misleading. So that means contrary to popular
belief, Bayesian statistics may sometimes be more accurate than Classical statistics. However,
the key part is how we decide on prior.
3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics
As I mentioned in the previous section, Bayesian statistics takes outer effects more into
consideration. In Bayesian statistics, we do not need to rely only on the raw data we obtain from
calculations but we can also use some other resources. In language, the main outer resource is
habits. Each person has different habits in the use of language and to make more accurate
predictions, we should also use them in our calculations.
In modern systems that perform similar tasks to the algorithm that I am currently
working on, language habits are an integral part of word predictions. They store the data about
how many times the user uses each word and tend to suggest more frequently used words. They
also learn the topics that the consumer is writing more about and use them in their predictions.
This gives them the ability to make more accurate predictions for specific users.
9
Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz & Aken, Marcel. (2013). A
Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. (p. 2). Child development.
85. 10.1111/cdev.12169.
15
Learning is endless and this is where Bayesian statistics show its specialties. One of the
biggest opportunities of Bayesian statistics is that its openness to update. We can update the
prior as we learn more about the user’s habits. We do not only use data but we can also learn
from the data. That gives us the ability to enhance the accuracy of predictions as the user types
more.
I do not have any user-related data but in a situation that I have, the use of Bayesian
statistics will outrace the Classical statistics in these aspects. Since I want this study to be open
to further developments, I chose Bayesian statistics. However, I was still benefitted from some
of its features.
I will do my predictions with the help of Bayesian Statistics while trying to guess which
word the writer is intended to write. This will give me an opportunity to include prior, in our
context the habit factor, to my calculations. Language is strongly linked with our writing habits.
For example, in synonyms, each person has their own preference. So, predictions based only
on semantics may be misleading. To overcome this problem, we should integrate these two
factors. This will give more accurate results especially when we have personal data, but I do
not work with personal data since I do not have any information about the writer of this text.
Therefore, I will use the general writing habits of English writers.
3. 2. USING BAYES TO FIND SEMANTIC VALUES
3. 2. 1. What is Semantic Value and How it is Calculated
“Colorless green ideas sleep furiously.”10
This is the by far most famous quote in
linguistics, created by Noam Chomsky. Even though it is grammatically correct, it does not
have any meaning. By this example, Chomsky showed that there is not any bond between the
10
Chomsky, N. (2002). Syntactic Structures. (p. 15). Berlin: Mouton de Gruyter.
16
grammatical structure of a language and its semantic side. This is also true for our prediction
algorithm. Our suggestion shouldn’t be only grammatically correct but it also has to be
semantically appropriate for the sentence.
To do that, I found the words that fit the general topic of this paragraph. We have 24
words until the “st” phrase. By finding in which areas these words are generally used, I aimed
to understand the general topic of the paragraph. After that, I checked which words will be
appropriate suggestions.
3. 2. 2. Using Bayes’ Theorem in Semantics
What I wanted is pretty simple actually: finding semantically proper suggestion for
completing the “st” phrase. To do that, I first decided on the context of words that are already
written. COCA classifies words according to how frequently a word is used in each of 8 main
genres. These genres are blog posts, general web pages, TV and movie subtitles, spoken
language, fiction, popular magazines, newspapers, and academic writings. I checked each word
in the paragraph and reached to statistical overview:
TABLE 2: Total Frequencies of Words in Paragraphs in Each Genre
BLOG POSTS WEB PAGES SUBTITLES SPOKEN LANG.
12176265 13221161 8528563 11703651
FICTION MAGAZINES NEWSPAPERS ACADEMIC WRITINGS
11517946 13082154 12746944 13093324
This table shows the total numbers of how frequently words in the paragraph used in each
category. According to these numbers, it is clear that the main genre of our paragraph is web
pages.
17
It is reasonable to expect the correct suggestion among the more frequently used words.
For that reason, I picked 107 words which are the most frequently used words among the ones
starting with “st”.11
Then I formulated the probability of each 107 words to be used in the web
page genre. For example, the formula for one of the least frequently used word “styles”, is as
following:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒|𝑠𝑡𝑦𝑙𝑒𝑠) × 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠)
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
According to this formula, to find the probability of “styles” being the following word given
that the genre of the paragraph is web page, we need to multiply the probability of genre being
web page given that it includes “styles” with the probability of “styles” being used and divide
them the probability of the genre of the paragraph being web page.
As I had known the probability of genre being web page given that it includes “styles”,
or in basic terms the probability of “styles” being used in “web page” from COCA, I substituted
it. For the main feature of Bayesian Statistics, prior, I used the general usage ratio of “styles”
among the most popular 107 words starting with “st”. With substitutions, the formula became
like this:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
0,126607818411097 × 0,00180771229791463
𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒)
However, since I didn’t have any direct statistics for 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒), I used the following
formula of marginal likelihood:
𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖)
𝑛
𝑖=1
11
I picked 107 words because that was the maximum number of words that I can obtain from the COCA data set
which I was using.
18
= 0,124323532630005
With this addition, the formula became like this:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) =
0,126607818411097 × 0,00180771229791463
0,124323532630005
And the result was:
𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,0018409266975627
Though, this was the probability for just one of the 107 words. For the rest, I calculated
the probabilities by following the same steps. Here is Bayes’ Box 10 of them which have the
highest probability:
TABLE 3: Bayes’ Box for Top 10 Words in Semantics
WORD FREQUENCY PRIOR (USAGE
RATIO)
LIKELIHOOD PRIOR x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
As you can see in the table above, there are different parameters that affecting the posterior
probability. Each word has its own characteristics and with the use of Bayesian Statistics, I was
19
able to assess their different parts together. This is just a small portion of the table and the full
version is included as an appendix.
However, I was still lack of enough information to make my suggestion more accurate.
Language has another aspect -syntax- and I could reach to correct algorithm only by an
algorithm containing multiple aspects of language.
3. 3 SYNTACTIC APPROPRIATENESS
Syntax is a subtopic of linguistics that is focused on the arrangement of words in a
sentence or a paragraph. In language, words should follow each other in the correct order to
have a proper meaning. Syntax is basically assessing this and our behaviors about placing words
in order.
In my case, I used syntax as another aspect of my algorithm. Since I was trying to find
the correct word suggestion only with the use of statistics, I hoped that more statistics from
different areas would increase the accuracy of my prediction. My suggestion had to be
consistent with “twenty”, so I included the usage frequency of each word which starting “st”
after the “twenty” to my research. However, I did not need to calculate the conditional
probability of it personally, by using the Bayes’ theorem. iWeb Corpus already contains this
data, so I directly included it in my research. Though, I only used the common words in the top
100 most used “twenty st” phrases and the semantic calculations. It means 57 words and here
is a small sample of it:
TABLE 4: 10 Most Used Words Starting with “st” After the “twenty”
WORD FREQUENCY USAGE RATIO
twenty students 532 0,303306727
twenty states 366 0,208665906
20
twenty steps 140 0,07981756
twenty stories 134 0,076396807
twenty studies 50 0,028506271
twenty straight 39 0,022234892
twenty staff 34 0,019384265
twenty state 31 0,017673888
twenty stores 31 0,017673888
twenty standard 29 0,016533637
4. COMBINING THE DATA
As we can see from the sample tables, the correct word “states” is not the first option in
any category. If we evaluate each category one by one, we can make multiple suggestions to
complete the phrase of “st”. However, all of them would be wrong. Hence, I needed to combine
statistics from all areas and reach one correct answer. Each of them gathered in different ways
and have different perspectives. The way of obtaining the correct result was combining these
different views into one and simple mathematical index. I created this index by multiplying the
Jaccard Similarity Index, semantic posterior probability, and syntactic probability and turning
the product into a number out of 1. Final table includes 57 words that are commonly popular in
both semantics and syntactic statistics.
TABLE 5: Final List of Suggestions
WORD JACCARD
SIMILARITY
SEMANTIC SYNTACTIC PRODUCT FINAL
INDEX
PERCENTAGE
INDEX
states 0.5 0,0575 0,2087 0,0120 0,4355 43,5547
students 0.333 0,0280 0,3033 0,0085 0,3080 30,8016
state 0.5 0,0818 0,0177 0,0014 0,0525 5,2468
stories 0.333 0,0157 0,0764 0,0012 0,0436 4,3627
story 0.4 0,0479 0,0143 0,0007 0,0248 2,4759
21
steps 0.5 0,0084 0,0798 0,0007 0,0244 2,4384
still 0.5 0,0906 0,0057 0,0005 0,0188 1,8751
studies 0.333 0,0140 0,0285 0,0004 0,0145 1,4451
strong 0.333 0,0187 0,0131 0,0002 0,0089 0,8924
staff 0.5 0,0118 0,0194 0,0002 0,0083 0,8322
student 0.333 0,0135 0,0160 0,0002 0,0078 0,7835
standard 0.333 0,0129 0,0165 0,0002 0,0078 0,7768
straight 0.2857 0,0094 0,0222 0,0002 0,0076 0,7561
study 0.4 0,0253 0,0046 0,0001 0,0042 0,4188
stars 0.5 0,0073 0,0154 0,0001 0,0041 0,4077
stores 0.4 0,004234 0,017674 0,000075 0,002716 0,271613
stone 0.4 0,005000 0,014823 0,000074 0,002690 0,269044
star 0.5 0,012268 0,005131 0,000063 0,002285 0,228489
step 0.5 0,016406 0,003421 0,000056 0,002037 0,203712
street 0.5 0,016775 0,002851 0,000048 0,001736 0,173574
start 0.5 0,036274 0,001140 0,000041 0,001501 0,150137
standards 0.333 0,008990 0,004561 0,000041 0,001488 0,148839
studio 0.333 0,003396 0,011973 0,000041 0,001476 0,147579
statements 0.333 0,005812 0,005131 0,000030 0,001082 0,108242
starts 0.5 0,007032 0,003991 0,000028 0,001019 0,101866
stations 0.333 0,002831 0,009122 0,000026 0,000937 0,093741
stages 0.4 0,002357 0,009692 0,000023 0,000829 0,082925
structures 0.333 0,003192 0,006842 0,000022 0,000793 0,079277
stock 0.4 0,007265 0,002851 0,000021 0,000752 0,075171
store 0.4 0,011840 0,001710 0,000020 0,000735 0,073511
stay 0.5 0,017407 0,001140 0,000020 0,000720 0,072048
st 1.0 0,008653 0,002281 0,000020 0,000716 0,071627
statement 0.333 0,014633 0,001140 0,000017 0,000606 0,060565
station 0.333 0,007009 0,002281 0,000016 0,000580 0,058019
stocks 0.4 0,002092 0,006271 0,000013 0,000476 0,047626
strategies 0.2857 0,003509 0,003421 0,000012 0,000436 0,043566
status 0.5 0,010270 0,001140 0,000012 0,000425 0,042507
stage 0.4 0,009545 0,001140 0,000011 0,000395 0,039505
streets 0.5 0,004701 0,002281 0,000011 0,000389 0,038917
stops 0.5 0,002682 0,003991 0,000011 0,000388 0,038847
22
stands 0.4 0,0049975 0,0017104 0,0000085 0,0003103 0,0310264
styles 0.4 0,0018409 0,0039909 0,0000073 0,0002667 0,0266684
stones 0.4 0,0018290 0,0039909 0,0000073 0,0002650 0,0264957
steel 0.5 0,0023351 0,0028506 0,0000067 0,0002416 0,0241620
studying 0.25 0,0026981 0,0022805 0,0000062 0,0002233 0,0223349
striking 0.2857 0,0024579 0,0022805 0,0000056 0,0002035 0,0203466
strategic 0.25 0,0031492 0,0017104 0,0000054 0,0001955 0,0195516
stayed 0.333 0,0035177 0,0011403 0,0000040 0,0001456 0,0145598
statistical 0.333 0,0025615 0,0011403 0,0000029 0,0001060 0,0106021
stroke 0.333 0,0015650 0,0017104 0,0000027 0,0000972 0,0097160
stairs 0.4 0,0013972 0,0011403 0,0000016 0,0000578 0,0057830
The final table is a predictive text list in which all possible words are attached to an
index. This index shows us how probable each word is compared to all words in the list. As we
can see from the final table, I managed to predict the word correctly. According to my index,
“states” became the most appropriate suggestion to complete the word with the index number
0,4355 which is 34.29% more than the second option, “students”. This shows us how strong
conditional probability and Bayesian Statistics are. They to led me to correct results. Even
though I did not use the Bayes’ rule while calculating syntactic value since its conditional
frequency has already been put into a data set, I always used conditional probability and when
it is possible Bayesian Statistics.
Though linguistics and mathematics seem so different, mathematical interpretation of
linguistics may be quite accurate. Bayesian statistics are one of the strongest tools in
mathematics and it became very effective when it is used correctly. In my research, it was one
of my two main tools with the Jaccard Index and played a huge role to make the correct
prediction.
23
5. CONCLUSION
In this paper, I tried to build a predictive text list that suggests the correct word to
complete the “st” phrase in a random paragraph. To do that, I used Jaccard Similarity Index and
Bayesian Statistics. I ended up with a list containing 57 words and I managed to predict the
word correctly. While doing that, I tried to use purely statistical methods and data to avoid
subjectivity. Especially when using the Bayes’ theorem which is often regarded as subjective,
I made my all decision completely based on data.
Before choosing my Extended Essay topic, I was sure that I would like to do research
that combines computer science and mathematics. Since I aim to develop myself in
mathematical computer science or Artificial Intelligence, Natural Language Processing seemed
so attractive to me. It is highly linked with statistics and the mathematical background that it
contains is amazing. I would like to do research which I can develop in the future. They led me
to choose this research question.
However, I faced some challenges while doing the background calculations and
reflecting them to written text. Even though this research paper contains high effort while doing
mathematics due to its nature which requires processing big data sets, reflecting it to text was
compelling. Creating a prediction index from zero requires correctly building logical bonds
between the different variables in the data. Still, I learned quite a lot from this research about
statistics and how vast areas it can be effectively used. It widened my perspective about
applications of mathematics and helped me while making career plans as a computer scientist
specialized in data engineering.
24
6. BIBLIOGRAPHY
• Brewer, Brendon. J. STATS 331: Introduction to Bayesian Statistics. University of
Auckland.
• Routledge, Richard. Bayes's theorem. Encyclopedia Britannica.
https://www.britannica.com/topic/Bayess-theorem Retrieval Date: February 15, 2021.
• Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz
& Aken, Marcel. (2013). A Gentle Introduction to Bayesian Analysis: Applications to
Developmental Research. (p. 2). Child development. 85. 10.1111/cdev.12169.
• Chomsky, Noam. (2002). Syntactic Structures. Berlin: Mouton de Gruyter.
• Berman, Ari. (2012, January 31). How the GOP Is Resegregating the South. The
Nation. https://www.thenation.com/article/archive/how-gop-resegregating-south/
Retrieval Date: December 7, 2020
• Brownlee, Jason. (2017, September 22). What Is Natural Language Processing?
Machine Learning Mastery. https://machinelearningmastery.com/natural-language-
processing/ Retrieval Date: December 7, 2020.
• Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date:
2020, December 7.
• Sieg, Adrien. (2019, November 13). Text similarities : Estimate the degree of
similarity between two texts. Retrieval Date: November 20, 2020.
https://medium.com/@adriensieg/text-similarities-da019229c894
• An Intuitive (and Short) Explanation of Bayes' Theorem. BetterExplained.
https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-
theorem/. Retrieval Date: January 28, 2021
25
• Mahendru, Khyati. (2019, June 13). Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2019/06/introduction-powerful-bayes-theorem-
data-science/. Retrieval Date: December 3, 2020.
• Glen, Stephanie. (2020, September 16). Jaccard Index / Similarity Coefficient.
Statistics How To. https://www.statisticshowto.com/jaccard-index/. Retrieval Date: 3
January, 2021
DATA BASES:
• https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
• https://www.wordfrequency.info/samples.asp Retrieval Date: February 16, 2021.
• https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021.
• https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
PROGRAMS:
• Visual Studio 2019
• Python 3.7
26
7. APPENDIECES
FIGURE 3: Code for Finding Words Starting with “st”
FIGURE 4: Code for Calculating Jaccard Index
FIGURE 5: Code for Calculating Jaccard Distance
27
TABLE 6: General and Genre Specific Frequencies of Words in The Paragraph
WORD FREQ BLOG WEB TVM SPOK FICTION MAGAZINE NEWS ACADEMIC
nationwide 15733 1546 1889 214 1911 106 3216 5059 1792
republican 124514 21658 20867 904 43517 475 10889 22121 4080
have 5025573 781709 687895 820686 879668 423071 503134 522808 406599
a 21889251 2783458 2827106 2519099 2716641 2749208 3104298 2959649 2229222
major 196857 24133 25756 10307 24642 6983 31077 36638 37315
advantage 55691 9484 8721 3212 5275 3253 8873 7923 8949
in 16560377 2003430 2257672 1225718 2020330 1671503 2310522 2355671 2699192
heading 28118 3089 2997 4616 3420 5655 3519 3822 1000
into 1461816 166362 180584 116756 148250 307485 226799 171402 144177
the 50074257 6272412 7101104 3784652 5769026 6311500 6805845 6582642 7447070
november 87176 24573 22587 1138 7249 2224 10651 11000 7733
elections 39380 7212 6709 352 8261 232 3155 7398 6061
party 243697 39715 35760 31649 43730 15044 24195 33310 20277
controls 26347 3131 3786 1272 1766 1639 4708 2995 7049
process 220128 31106 33489 5266 26450 5496 27362 22973 67985
twenty 36338 2939 3815 2718 3313 14068 3807 971 4707
redistricting 1724 308 424 4 202 4 104 562 116
TOTAL 96086977 12176265 13221161 8528563 11703651 11517946 13082154 12746944 13093324
TABLE 7: Bayes’ Box for All Words Used in Semantics Calculations
WORD FREQUENCY RATIO LIKELIHOOD RATIO x
LIKELIHOOD
POSTERIOR
still 791726 0,090240405 0,124828539 0,011264578 0,090606965
state 577192 0,065787962 0,154553424 0,010167755 0,081784635
states 396934 0,045242274 0,158016194 0,007149012 0,057503289
story 319852 0,036456519 0,163194227 0,005949493 0,047854926
start 275954 0,031453054 0,143378969 0,004509706 0,036273957
students 383366 0,043695803 0,079600173 0,003478193 0,027976952
stop 270980 0,030886121 0,109273747 0,003375042 0,027147251
started 234505 0,026728725 0,119617919 0,003197234 0,025717049
study 261496 0,029805141 0,105523603 0,003145146 0,025298073
strong 152080 0,017333978 0,134468701 0,002330877 0,018748482
stay 203720 0,023219871 0,093201453 0,002164126 0,017407209
street 189237 0,021569108 0,09668828 0,00208548 0,016774619
stuff 153066 0,017446361 0,118713496 0,002071119 0,016659103
step 128479 0,014643951 0,139283463 0,00203966 0,016406067
stories 112939 0,012872712 0,151940428 0,001955885 0,015732222
stand 138407 0,015775538 0,120716438 0,001904367 0,01531783
statement 86767 0,009889645 0,183952424 0,001819224 0,014632984
studies 136505 0,01555875 0,111592982 0,001736247 0,013965556
student 147156 0,016772743 0,100220175 0,001680967 0,01352091
standard 86472 0,009856021 0,163266722 0,00160916 0,012943328
28
star 107361 0,012236936 0,124635575 0,001525158 0,012267649
starting 93909 0,010703686 0,137729078 0,001474209 0,011857842
store 97668 0,011132134 0,13223369 0,001472043 0,011840422
staff 109761 0,012510486 0,117528084 0,001470333 0,011826671
style 72902 0,008309322 0,166072261 0,001379948 0,011099651
status 75175 0,008568397 0,149012305 0,001276797 0,010269951
stage 84800 0,009665448 0,122771226 0,001186639 0,009544765
straight 89870 0,010243323 0,11370869 0,001164755 0,00936874
standards 68608 0,007819894 0,142927938 0,001117681 0,008990103
stupid 66726 0,007605385 0,1457303 0,001108335 0,008914926
st 111984 0,012763862 0,084279897 0,001075737 0,008652722
standing 90839 0,010353769 0,102345909 0,001059666 0,008523454
structure 63205 0,007204064 0,145906178 0,001051117 0,008454694
steps 69771 0,007952452 0,13157329 0,00104633 0,008416189
strategy 66602 0,007591252 0,133224228 0,001011339 0,008134732
stopped 88086 0,010039984 0,097177758 0,000975663 0,007847775
stated 38017 0,004333152 0,220138359 0,000953893 0,007672667
strength 58373 0,006653316 0,142000582 0,000944775 0,007599323
stars 69625 0,007935811 0,114298025 0,000907048 0,007295864
stock 71592 0,008160009 0,110682758 0,000903172 0,007264693
starts 58138 0,006626531 0,131927483 0,000874222 0,007031827
station 74456 0,008486446 0,102678092 0,000871372 0,007008907
storm 51639 0,005885779 0,133542478 0,000786002 0,006322226
stood 83875 0,009560017 0,07961848 0,000761154 0,006122365
strange 55570 0,006333832 0,119003059 0,000753745 0,006062773
steve 58774 0,006699022 0,11185218 0,0007493 0,006027018
statements 36168 0,004122405 0,175265428 0,000722515 0,005811571
stress 50001 0,005699081 0,126617468 0,000721603 0,005804237
struggle 41878 0,004773227 0,140766035 0,000671908 0,005404513
stick 52290 0,00595998 0,112583668 0,000670996 0,005397179
stone 57458 0,006549025 0,094921508 0,000621643 0,005000206
stands 49137 0,005600603 0,110934734 0,000621301 0,004997456
stuck 45673 0,005205778 0,115297878 0,000600215 0,004827849
strike 40442 0,004609552 0,129864992 0,000598619 0,004815013
strongly 31957 0,003642438 0,161404387 0,000587905 0,004728835
streets 54014 0,00615648 0,094938349 0,000584486 0,004701331
stores 41045 0,004678282 0,112510659 0,000526357 0,004233765
statistics 29092 0,003315887 0,155025437 0,000514047 0,004134751
stronger 32229 0,00367344 0,135809364 0,000498888 0,004012817
storage 26955 0,003072313 0,160601002 0,000493417 0,003968811
string 21127 0,002408041 0,191035168 0,000460021 0,003700189
struck 35801 0,004080574 0,111477333 0,000454892 0,003658933
stayed 39818 0,004538429 0,096363454 0,000437339 0,003517747
strategies 39768 0,00453273 0,096233152 0,000436199 0,003508579
studio 39195 0,00446742 0,09450185 0,000422179 0,003395813
struggling 26677 0,003040627 0,138471342 0,00042104 0,003386645
29
stable 25662 0,002924938 0,138726522 0,000405766 0,003263794
studied 33515 0,003820018 0,105922721 0,000404627 0,003254626
structures 26349 0,003003242 0,132149228 0,000396876 0,003192284
stream 24393 0,002780298 0,141843972 0,000394369 0,003172115
strategic 26107 0,002975659 0,131573907 0,000391519 0,003149195
staying 32297 0,003681191 0,102579187 0,000377614 0,003037346
stretch 26762 0,003050315 0,115761154 0,000353108 0,002840235
stations 24804 0,002827143 0,124496049 0,000351968 0,002831067
strikes 20666 0,002355497 0,143714313 0,000338519 0,002722885
studying 27294 0,003110952 0,107825896 0,000335441 0,002698131
stops 26908 0,003066956 0,108703731 0,00033339 0,002681629
stephen 26970 0,003074023 0,104857249 0,000322334 0,0025927
statistical 17579 0,002003643 0,158939644 0,000318458 0,002561528
striking 19419 0,002213365 0,138060662 0,000305579 0,002457931
strip 21378 0,00243665 0,122836561 0,00029931 0,002407507
stomach 27740 0,003161787 0,094520548 0,000298854 0,00240384
stages 19026 0,002168571 0,135130874 0,000293041 0,002357083
stolen 22366 0,002549262 0,11459358 0,000292129 0,002349749
steel 32776 0,003735787 0,077709299 0,000290305 0,00233508
steady 24491 0,002791468 0,101180025 0,000282441 0,002271821
steal 20956 0,002388551 0,117675129 0,000281073 0,002260819
stability 21016 0,00239539 0,1161496 0,000278224 0,002237899
stewart 20236 0,002306486 0,120181854 0,000277198 0,002229648
stadium 25212 0,002873647 0,094756465 0,000272297 0,002190226
stepped 34088 0,003885328 0,069877963 0,000271499 0,002183808
stranger 19155 0,002183274 0,122631167 0,000267737 0,002153554
stopping 18673 0,002128336 0,124136454 0,000264204 0,002125134
stem 19238 0,002192735 0,120178813 0,00026352 0,002119633
stocks 25119 0,002863047 0,090847566 0,000260101 0,002092129
stake 19616 0,002235819 0,107616232 0,00024061 0,001935357
styles 15860 0,001807712 0,126607818 0,000228871 0,001840927
stones 16858 0,001921464 0,11834144 0,000227389 0,001829008
structural 16140 0,001839627 0,121623296 0,000223741 0,001799671
struggled 16187 0,001844984 0,117748811 0,000217245 0,001747413
steven 21003 0,002393908 0,087416083 0,000209266 0,001683238
stanford 15967 0,001819908 0,113859836 0,000207214 0,001666735
stroke 17508 0,00199555 0,097498286 0,000194563 0,001564971
staring 25605 0,002918441 0,062331576 0,000181911 0,001463207
stairs 23029 0,00262483 0,066177428 0,000173705 0,001397197
stole 17208 0,001961356 0,077754533 0,000152504 0,001226673
stared 23981 0,002733339 0,046828739 0,000127999 0,001029562
stir 19479 0,002220204 0,05303147 0,000117741 0,00094705
TOTAL 8773520 0,124323533
30
TABLE 8: Usage Frequencies After Twenty for Words Starting with “st”
WORD FREQUENCY RATIO WORD FREQUENCY RATIO
st 543864 0,006317719 staff 2602407 0,03023049
stage 1569009 0,018226169 stages 427339 0,004964123
stairs 28180 0,000327349 standard 2427376 0,028197267
standards 1225651 0,0142376 stands 595947 0,006922733
star 1466255 0,017032542 stars 831227 0,00965583
start 5497965 0,063866327 starts 985564 0,011448664
state 6327227 0,073499331 statement 1456109 0,016914683
statements 619896 0,007200934 states 3419996 0,039727896
station 1048322 0,012177683 stations 380514 0,004420187
statistical 173317 0,002013312 status 1256637 0,014597545
stay 2338130 0,027160554 stayed 352807 0,004098332
steam 366250 0,004254491 steel 923989 0,010733387
step 2670622 0,0310229 steps 1337165 0,015532987
still 9337823 0,10847149 stock 1648872 0,019153887
stocks 320921 0,003727933 stone 763898 0,008873712
stones 273985 0,003182708 stops 384562 0,00446721
store 2370994 0,027542314 stores 861682 0,010009606
stories 1251367 0,014536326 story 3185323 0,037001851
straight 1228434 0,014269928 strategic 566227 0,006577495
strategies 699520 0,008125874 street 2085396 0,024224705
streets 510693 0,005932392 striking 265952 0,003089394
strings 265409 0,003083086 stroke 307775 0,003575224
strong 2479756 0,028805732 structures 427084 0,00496116
student 2920987 0,033931229 students 6528359 0,075835752
studies 1471363 0,017091879 studio 789269 0,009168431
study 2888625 0,033555301 studying 374100 0,004345679
stunning 458706 0,005328493 styles 546658 0,006350175

More Related Content

Similar to IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics

INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGIJwest
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkAnastasios Theodosiou
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsAn Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsIJECEIAES
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approchanil maurya
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmijnlc
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithmkevig
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than DataAmit Sheth
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Siddharth Verma
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 

Similar to IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics (20)

INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache Spark
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
Yahoo Answers! Answer Evaluation
Yahoo Answers! Answer EvaluationYahoo Answers! Answer Evaluation
Yahoo Answers! Answer Evaluation
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsAn Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Lexicon base approch
Lexicon base approchLexicon base approch
Lexicon base approch
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
 
Bx044461467
Bx044461467Bx044461467
Bx044461467
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 

More from Michelle Bojorquez

Free Writing Paper For Kids With Borders - Lined Writin
Free Writing Paper For Kids With Borders - Lined WritinFree Writing Paper For Kids With Borders - Lined Writin
Free Writing Paper For Kids With Borders - Lined WritinMichelle Bojorquez
 
PPT - The Structure Of The Essay PowerPoint Present
PPT - The Structure Of The Essay PowerPoint PresentPPT - The Structure Of The Essay PowerPoint Present
PPT - The Structure Of The Essay PowerPoint PresentMichelle Bojorquez
 
Practice Essay Writing Worksheet. Online assignment writing service.
Practice Essay Writing Worksheet. Online assignment writing service.Practice Essay Writing Worksheet. Online assignment writing service.
Practice Essay Writing Worksheet. Online assignment writing service.Michelle Bojorquez
 
What Is Diversity Essay. Diversity Essay. 2022-11-04
What Is Diversity Essay. Diversity Essay. 2022-11-04What Is Diversity Essay. Diversity Essay. 2022-11-04
What Is Diversity Essay. Diversity Essay. 2022-11-04Michelle Bojorquez
 
How To Write A Term Paper. Online assignment writing service.
How To Write A Term Paper. Online assignment writing service.How To Write A Term Paper. Online assignment writing service.
How To Write A Term Paper. Online assignment writing service.Michelle Bojorquez
 
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...Michelle Bojorquez
 
Writing Research Papers A Complete Guide By James
Writing Research Papers  A Complete Guide By JamesWriting Research Papers  A Complete Guide By James
Writing Research Papers A Complete Guide By JamesMichelle Bojorquez
 
The Definition Of Expository. What Is Expository Preachi
The Definition Of Expository. What Is Expository PreachiThe Definition Of Expository. What Is Expository Preachi
The Definition Of Expository. What Is Expository PreachiMichelle Bojorquez
 
Blue Parchment Paper Stationery Set. Writing Paper Hand - Etsy
Blue Parchment Paper Stationery Set. Writing Paper Hand - EtsyBlue Parchment Paper Stationery Set. Writing Paper Hand - Etsy
Blue Parchment Paper Stationery Set. Writing Paper Hand - EtsyMichelle Bojorquez
 
5 Theme Examples How. Online assignment writing service.
5 Theme Examples How. Online assignment writing service.5 Theme Examples How. Online assignment writing service.
5 Theme Examples How. Online assignment writing service.Michelle Bojorquez
 
How To Write The Perfect Paper Tips For Writing Pap
How To Write The Perfect Paper Tips For Writing PapHow To Write The Perfect Paper Tips For Writing Pap
How To Write The Perfect Paper Tips For Writing PapMichelle Bojorquez
 
How To Write An Essay In 9 Simple Steps 7ESL
How To Write An Essay In 9 Simple Steps  7ESLHow To Write An Essay In 9 Simple Steps  7ESL
How To Write An Essay In 9 Simple Steps 7ESLMichelle Bojorquez
 
How To Make A Narrative Report.. Online assignment writing service.
How To Make A Narrative Report.. Online assignment writing service.How To Make A Narrative Report.. Online assignment writing service.
How To Make A Narrative Report.. Online assignment writing service.Michelle Bojorquez
 
Guide To Essay Writing LAW - GUIDE TO WRITING A LE
Guide To Essay Writing LAW - GUIDE TO WRITING A LEGuide To Essay Writing LAW - GUIDE TO WRITING A LE
Guide To Essay Writing LAW - GUIDE TO WRITING A LEMichelle Bojorquez
 
University Admission Essays. Online assignment writing service.
University Admission Essays. Online assignment writing service.University Admission Essays. Online assignment writing service.
University Admission Essays. Online assignment writing service.Michelle Bojorquez
 
Sample Essay Outline With Thesis Statement - Thesis
Sample Essay Outline With Thesis Statement - ThesisSample Essay Outline With Thesis Statement - Thesis
Sample Essay Outline With Thesis Statement - ThesisMichelle Bojorquez
 
Essay Writing Service Ranking. Online assignment writing service.
Essay Writing Service Ranking. Online assignment writing service.Essay Writing Service Ranking. Online assignment writing service.
Essay Writing Service Ranking. Online assignment writing service.Michelle Bojorquez
 
Free Printable Lined Paper For Handwrit. Online assignment writing service.
Free Printable Lined Paper For Handwrit. Online assignment writing service.Free Printable Lined Paper For Handwrit. Online assignment writing service.
Free Printable Lined Paper For Handwrit. Online assignment writing service.Michelle Bojorquez
 
Wide-Ruled Lined Paper On Ledger-Sized Paper In Port
Wide-Ruled Lined Paper On Ledger-Sized Paper In PortWide-Ruled Lined Paper On Ledger-Sized Paper In Port
Wide-Ruled Lined Paper On Ledger-Sized Paper In PortMichelle Bojorquez
 
Football 2014 Themed Lined Papers And Pageborders B
Football 2014 Themed Lined Papers And Pageborders BFootball 2014 Themed Lined Papers And Pageborders B
Football 2014 Themed Lined Papers And Pageborders BMichelle Bojorquez
 

More from Michelle Bojorquez (20)

Free Writing Paper For Kids With Borders - Lined Writin
Free Writing Paper For Kids With Borders - Lined WritinFree Writing Paper For Kids With Borders - Lined Writin
Free Writing Paper For Kids With Borders - Lined Writin
 
PPT - The Structure Of The Essay PowerPoint Present
PPT - The Structure Of The Essay PowerPoint PresentPPT - The Structure Of The Essay PowerPoint Present
PPT - The Structure Of The Essay PowerPoint Present
 
Practice Essay Writing Worksheet. Online assignment writing service.
Practice Essay Writing Worksheet. Online assignment writing service.Practice Essay Writing Worksheet. Online assignment writing service.
Practice Essay Writing Worksheet. Online assignment writing service.
 
What Is Diversity Essay. Diversity Essay. 2022-11-04
What Is Diversity Essay. Diversity Essay. 2022-11-04What Is Diversity Essay. Diversity Essay. 2022-11-04
What Is Diversity Essay. Diversity Essay. 2022-11-04
 
How To Write A Term Paper. Online assignment writing service.
How To Write A Term Paper. Online assignment writing service.How To Write A Term Paper. Online assignment writing service.
How To Write A Term Paper. Online assignment writing service.
 
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...
Tips Cara Menulis Proposal Karya Tulis Ilmiah (4) - Menyusun Naskah ...
 
Writing Research Papers A Complete Guide By James
Writing Research Papers  A Complete Guide By JamesWriting Research Papers  A Complete Guide By James
Writing Research Papers A Complete Guide By James
 
The Definition Of Expository. What Is Expository Preachi
The Definition Of Expository. What Is Expository PreachiThe Definition Of Expository. What Is Expository Preachi
The Definition Of Expository. What Is Expository Preachi
 
Blue Parchment Paper Stationery Set. Writing Paper Hand - Etsy
Blue Parchment Paper Stationery Set. Writing Paper Hand - EtsyBlue Parchment Paper Stationery Set. Writing Paper Hand - Etsy
Blue Parchment Paper Stationery Set. Writing Paper Hand - Etsy
 
5 Theme Examples How. Online assignment writing service.
5 Theme Examples How. Online assignment writing service.5 Theme Examples How. Online assignment writing service.
5 Theme Examples How. Online assignment writing service.
 
How To Write The Perfect Paper Tips For Writing Pap
How To Write The Perfect Paper Tips For Writing PapHow To Write The Perfect Paper Tips For Writing Pap
How To Write The Perfect Paper Tips For Writing Pap
 
How To Write An Essay In 9 Simple Steps 7ESL
How To Write An Essay In 9 Simple Steps  7ESLHow To Write An Essay In 9 Simple Steps  7ESL
How To Write An Essay In 9 Simple Steps 7ESL
 
How To Make A Narrative Report.. Online assignment writing service.
How To Make A Narrative Report.. Online assignment writing service.How To Make A Narrative Report.. Online assignment writing service.
How To Make A Narrative Report.. Online assignment writing service.
 
Guide To Essay Writing LAW - GUIDE TO WRITING A LE
Guide To Essay Writing LAW - GUIDE TO WRITING A LEGuide To Essay Writing LAW - GUIDE TO WRITING A LE
Guide To Essay Writing LAW - GUIDE TO WRITING A LE
 
University Admission Essays. Online assignment writing service.
University Admission Essays. Online assignment writing service.University Admission Essays. Online assignment writing service.
University Admission Essays. Online assignment writing service.
 
Sample Essay Outline With Thesis Statement - Thesis
Sample Essay Outline With Thesis Statement - ThesisSample Essay Outline With Thesis Statement - Thesis
Sample Essay Outline With Thesis Statement - Thesis
 
Essay Writing Service Ranking. Online assignment writing service.
Essay Writing Service Ranking. Online assignment writing service.Essay Writing Service Ranking. Online assignment writing service.
Essay Writing Service Ranking. Online assignment writing service.
 
Free Printable Lined Paper For Handwrit. Online assignment writing service.
Free Printable Lined Paper For Handwrit. Online assignment writing service.Free Printable Lined Paper For Handwrit. Online assignment writing service.
Free Printable Lined Paper For Handwrit. Online assignment writing service.
 
Wide-Ruled Lined Paper On Ledger-Sized Paper In Port
Wide-Ruled Lined Paper On Ledger-Sized Paper In PortWide-Ruled Lined Paper On Ledger-Sized Paper In Port
Wide-Ruled Lined Paper On Ledger-Sized Paper In Port
 
Football 2014 Themed Lined Papers And Pageborders B
Football 2014 Themed Lined Papers And Pageborders BFootball 2014 Themed Lined Papers And Pageborders B
Football 2014 Themed Lined Papers And Pageborders B
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 

IB Mathematics Extended Essay (2021) - Building A Predictive Text List Using Jaccard Index And Bayesian Statistics

  • 1. EXTENDED ESSAY – MATHEMATICS BUILDING A PREDICTIVE TEXT LIST USING JACCARD INDEX AND BAYESIAN STATISTICS RESEARCH QUESTION How can I build a predictive text list for uncompleted word in a paragraph by using Jaccard Index and Bayesian Statistics? Word Count: 3852
  • 2. 2 TABLE OF CONTENTS 1. INTRODUCTION........................................................................................... 4 1. 1. RATIONALE............................................................................................. 4 1. 2. AIM OF THE STUDY AND APPROACH............................................... 4 2. WORD SIMILARITY COMPARISON ....................................................... 6 2. 1. WHAT IS JACCARD INDEX AND JACCARD DISTANCE................. 6 2. 1. 1. Jaccard Index ...................................................................................... 6 2. 1. 2. Jaccard Distance.................................................................................. 7 2. 2. FINDING SIMILAR WORDS .................................................................. 8 2. 2. 1. Why Jaccard Index and Jaccard Distance....................................... 8 2. 2. 2. Applying Jaccard Index .................................................................. 9 3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT ...................... 12 3. 1. BAYESIAN STATISTICS ...................................................................... 12 3. 1. 1. What is Bayes’ Theorem................................................................... 12 3. 1. 2. Differences Between Bayesian Statistics and Classical Statistics.... 13 3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics.................... 14 3. 2. USING BAYES TO FIND SEMANTIC VALUES ................................ 15 3. 2. 1. What is Semantic Value and How it is Calculated........................... 15
  • 3. 3 3. 2. 2. Using Bayesian Theorem in Semantics............................................ 16 3. 3. SYNTACTIC APPROPRIATENESS...................................................... 19 4. COMBINING THE DATA........................................................................... 20 5. CONCLUSION.............................................................................................. 23 6. BIBLIOGRAPHY ......................................................................................... 24 7. APPENDIECES............................................................................................. 26
  • 4. 4 1. INTRODUCTION 1. 1. RATIONALE As personal computers become widespread, our writing process is transferred to digital systems. Nowadays instead of a pen, most people use their computer or smartphone to write texts about a variety of topics. However, it takes a lot of time. To enhance our quality and speed of writing, computer scientists and engineers develop different algorithms. I’m interested in linguistics and data science. While doing research for my Mathematics Extended Essay, I came across to subsection of Artificial Intelligence: Natural Language Processing. It basically is processing a text or a speech in human language by using a software.1 I asked myself “why not combining linguistics and computer science in your Extended Essay?”. I decided to create a basic algorithm using the power of mathematics and computer to process a text in the writing process and make the best predictions for the upcoming word while typing it. Computer scientists have been working on that issue for a long time, but there is still remarkable room for improvement. By using the power of statistics, we can go further in processing texts to make more accurate predictions. 1. 2. AIM OF THE STUDY AND APPROACH In this study, I aimed to develop a mathematical algorithm that makes the most accurate predictions for the upcoming word. Unlike most other algorithms, I didn’t only consider letter- by-letter similarities between words; I also considered the semantic and syntactic value of words in a context. By using massive libraries, I tried to find the best suggestion for a misspelled word. 1 Brownlee, J. (2017, September 22). What Is Natural Language Processing? Machine Learning Mastery. https://machinelearningmastery.com/natural-language-processing/ Retrieval Date: December 7, 2020.
  • 5. 5 To do that, I used the Jaccard Index and Bayesian Statistics and create and mathematical index which indicates the accuracy of prediction. In my study, I went on a randomly picked article. By using a website, I picked a piece of a random article and cut it into parts.2 Then, I changed it in the way that it seems it is written currently. Here is the original piece that I took from the article: “Nationwide, Republicans have a major advantage in redistricting heading into the November elections. The party controls the process in twenty states, including key swing states like Florida, Ohio, Michigan, Virginia, and Wisconsin, compared with seven for Democrats (the rest are home to either a split government or independent redistricting commissions).”3 I disjoined a sentence in the middle of it. Here is the version that I will study: “Nationwide, Republicans have a major advantage in redistricting heading into the November elections. The party controls the process in twenty st…” By using mathematical and statistical tools, I tried to find the best suggestion for the word that is currently being typed. Of course, this study needs computer power, so I used some computer programmes, tools, and codes. I included them as an appendix. Expanded lists for tables are also included as an appendix. In statistics studies, I used two databases for frequencies and genre-specific frequencies of words. Before deciding that my paragraph belongs to a web page, I used the COCA (Corpus of Contemporary American English)4 which contains more than 1 billion words from 8 different areas to obtain fairer results. However, after finding that my paragraph is a part of a web page, 2 Website I used for picking random article: https://longform.org/random 3 Berman, A. (2012, January 31). How the GOP Is Resegregating the South. The Nation. https://www.thenation.com/article/archive/how-gop-resegregating-south/ Retrieval Date: December 7, 2020 4 https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021.
  • 6. 6 I used “The iWeb Corpus”5 , which contains more than 14 billion words from 22 million web pages to work with statistics focused only on web pages. 2. WORD SIMILARITY COMPARISON 2. 1. WHAT IS JACCARDIAN INDEX AND JACCARD DISTANCE 2. 1. 1. Jaccard Index Jaccard Similarity Index is a method to compare two sets of data. It is invented by a Swiss professor of botany, Paul Jaccard.6 It is a measure of similarity for two finite data sets, and it is expressed in a range of 0% to 100%. As the similarity increase, the percentage value increases. Jaccard Similarity Index is measured by comparing the joint values of sets with a combination of sets. Mathematically, to calculate Jaccard Similarity Index for two finite sets we divide the number of common objects -intersection of two sets- by the total number of objects -union of two sets-. In mathematical notation, the Jaccard Similarity Index is expressed as followed: 𝐽(𝐴, 𝐵) = |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵| Or in a simpler way: 𝐽(𝐴, 𝐵) = |𝐴 ∩ 𝐵| |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| In the Venn diagram, the Jaccard Index can be shown like this: 5 https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021. 6 Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date: 2020, December 7.
  • 7. 7 2. 1. 2. Jaccard Distance Jaccard Distance is also a method to compare two sets of data. However, unlikely to Jaccard Index which measures how similar the sets are, Jaccard Distance measures how dissimilar the sets are. It is also expressed between the range of 0% to 100%, but as it measures dissimilarity the value increases as the similarity between two sets decreases. In Jaccard Distance, 100% or 1.00 means sets are completely dissimilar while 0% or 0.00 means sets are equal to each other. 𝐷(𝐴, 𝐵) = 1 − 𝐽(𝐴, 𝐵) 𝐷(𝐴, 𝐵) = 1 − |𝐴 ∩ 𝐵| |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| FIGURE 1: Venn representation of Jaccard Index
  • 8. 8 2. 2. FINDING SIMILAR WORDS 2. 2. 1. Why Jaccard Index and Jaccard Distance There are several reasons why I picked the Jaccard Index for measuring the similarities of words. First of all, in the Jaccard Index words are assumed sets that include letters. This gave me the opportunity to compare words letter by letter, which is an essential future for me. Since I will base my prediction on letters that have already been typed, Jaccard Index will narrow the possible word circle. Secondly, the Jaccard Index is widely used in Computer Science and specifically Python -a programming language that I also used in this study. This gave me unique flexibility because English contains hundreds of thousands of words and comparing letters of each word with characters that are typed just now is nearly impossible without the use of a computer algorithm. FIGURE 2: Venn Representation of Jaccard Distance
  • 9. 9 2. 2. 2. Applying Jaccard Index To apply the Jaccard Similarity Index to words, we must first define the sets. For example, if we want to calculate the similarity index between two random words, “soldier” and “laborer”, we first define two sets as 𝐴 = {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟}, 𝐵 = {𝑙, 𝑎, 𝑏, 𝑜, 𝑟, 𝑒}. Then we apply the Jaccard formula: 𝐽(𝐴, 𝐵) = |𝐴 ∩ 𝐵| |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| 𝐽(𝐴, 𝐵) = {𝑙, 𝑜, 𝑒, 𝑟} {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏} 𝐽(𝐴, 𝐵) = 4 9 = 0.44 Since we calculated these sets’ Jaccard Index, we can calculate their dissimilarity -Jaccard Distance- by using a basic formula: 𝐷(𝐴, 𝐵) = 1 − {𝑙, 𝑜, 𝑒, 𝑟} {𝑠, 𝑜, 𝑙, 𝑑, 𝑖, 𝑒, 𝑟, 𝑎, 𝑏} 𝐷(𝐴, 𝐵) = 0.66 English contains a lot of words, so to find predictive words for the word that is currently being typed, we need to use multiple methods. The first one of them is comparing the similarity of the piece of the word written and the similarity of the probable word. There is not a consensus about all English words, but I used an online source that contains more than 307.113 words.7 By using a Python code, I first found the number of words starting with the letters “s, t” from the set of 307.113 English words. By assuming the upcoming word is starting with these 7 https://github.com/dwyl/english-words Retrieval Date: December 12, 2020.
  • 10. 10 letters in the order of “st”, I only picked the words that contain these letters in the correct order. That means 4886 total words. After that, by using a Python code again, I calculated the Jaccard Similarity index and Jaccard distance of each of these words with the phrase “st”. That may give you the impression of shorter words are more likely to be possible word suggestions, but that’s not the case. Since the words are regarded as sets, repetitive letters don’t affect the similarity of words. Here is a real example that I encountered: 𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑏) = 𝐽(𝑠𝑡𝑎, 𝑠𝑡𝑎𝑎𝑡𝑠𝑟𝑎𝑡) {𝑠, 𝑡} {𝑠, 𝑡, 𝑎, 𝑏} = {𝑠, 𝑡} {𝑠, 𝑡, 𝑎, 𝑟} = 0.5 Here is Jaccard Similarity and Jaccard Distance of English words with similarity index higher than or equal to 0.66 among the words starting with the “st” phrase: TABLE 1: Jaccard Similarity and Distance of Words Starting with “st” WORD JACCARD SIMILARITY JACCARD DISTANCE st 1.0 0.0 sta 0.6666666666666666 0.33333333333333337 stat 0.6666666666666666 0.33333333333333337 stats 0.6666666666666666 0.33333333333333337 std 0.6666666666666666 0.33333333333333337 stet 0.6666666666666666 0.33333333333333337 stets 0.6666666666666666 0.33333333333333337 stg 0.6666666666666666 0.33333333333333337 sty 0.6666666666666666 0.33333333333333337
  • 11. 11 stk 0.6666666666666666 0.33333333333333337 stm 0.6666666666666666 0.33333333333333337 stoot 0.6666666666666666 0.33333333333333337 stoss 0.6666666666666666 0.33333333333333337 stot 0.6666666666666666 0.33333333333333337 stott 0.6666666666666666 0.33333333333333337 str 0.6666666666666666 0.33333333333333337 stu 0.6666666666666666 0.33333333333333337 stuss 0.6666666666666666 0.33333333333333337 stut 0.6666666666666666 0.33333333333333337 As you can see, there is not any correlation between the length of a word and its similarity with the phrase “st”. This table gave me a chance to make a prediction about the word that is currently being typed, but since there is a total of 4886 words that start with the “st” phrase, this prediction may be misleading for some reasons. First, there is more than one word that has the same similarity ratio. Second, even though these suggestions gave a modal outlook, they were not sufficient when I took the language’s own specialties into account. In language, words must be coherent and follow grammatical rules at the same time. To overcome this hardship and make my suggestion more accurate, I developed a new model that takes semantic values into account by using Bayesian Statistics.
  • 12. 12 3. TAKING SEMANTIC AND SYNTAX INTO ACCOUNT 3. 1. BAYESIAN STATISTICS 3. 1. 1. What is Bayes’ Theorem Bayes’ theorem is a way of calculating probability with the help of a priori knowledge, in other terms it is one of the means of conditional probability. It is founded by Thomas Bayes, a Presbyterian minister, and mathematician who lived during the 18th century. However, it is published in 1763, after the death of Thomas Bayes, when the theorem was discovered among his notes.8 As a branch of conditional probability, Bayes’ Theorem aims to find the probability of an event in light of relevant prior knowledge. It is a way of calculating the possibility of an event or situation while related knowledge is given true. Its formula is as follows: 𝑃(𝐴|𝐵) = 𝑃(𝐴 ∩ 𝐵) 𝑃(𝐵) Or in more complex and widely used terms: 𝑃(𝐴|𝐵) = 𝑃(𝐵|𝐴) × 𝑃(𝐴) 𝑃(𝐵) Verbally, this equation is formulated in order to find the possibility of A given that B is true and this is called posterior in Bayesian terminology. To do that, it multiplies the probability of B given that A is true -likelihood- and the probability of A -prior- and divides them to the probability of B -marginal likelihood-. As you see, in Bayes’ Theorem 3 main components are used: 1. Likelihood 2. Prior 3. Marginal Likelihood. The likelihood is the probability that we would get if the hypothesis is 8 Routledge, R. Bayes's theorem. Encyclopedia Britannica. https://www.britannica.com/topic/Bayess-theorem Retrieval Date: February 15, 2021.
  • 13. 13 true. Basically, to find it, we assume that the hypothesis is true and calculate the probability of the event happening. Prior is the main feature of Bayesian Statistics. It is our estimation of how probable the hypothesis is. If there are multiple hypotheses, the sum of priors must be equal to 1. Marginal Likelihood is the probability that we obtained whether the hypotheses true or not. Its formula is as follows: 𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖) 𝑛 𝑖=1 And the final outcome of Bayes’ Rule, in our formula 𝑃(𝐴|𝐵), is called posterior. 3. 1. 2 Differences Between Bayesian Statistics and Classical Statistics There are two main concepts of statistics, Classical statistics, and Bayesian statistics. While Classical statistics are the product of frequentist methods, Bayesian Statistics approach probability as a subjective experience of uncertainty. The frequentist method relies on repeating experiments and only interpreting the data set. In this method, the null hypothesis is assumed true. Contrary to the frequentist method, Bayesian statistics rely on combining the data with prior knowledge. Unlikely to frequentist method, large data sets are not needed in the Bayesian approach. In this method, prior knowledge about the hypothesis is used along with the experimental data so it works with smaller data sets. Prior knowledge is a gamechanger in statistics and it is the main aspect that distinguishes these two approaches. While there is not any prior knowledge concept in Classical (frequentist) statistics, Bayesian Statistics gives us the ability to include our prior opinion or knowledge about the hypothesis. However, the nature of this prior knowledge is not clearly defined and it may be completely subjective or objective, depending on the choice of the person who is doing the calculation. There are multiple ways about deciding on the prior and they affect the posterior. Both informative and noninformative priors can be used in calculations and the
  • 14. 14 posterior can vary greatly according to the type of prior. Because of that, Bayesian Statistics are sometimes called for being subjective and lack scientific certainty. Nonetheless, in Bayesian Statistics this is regarded as richness, not as a weakness. The frequentist approach ignores all past studies and surrounding effects; it only focuses on data. Classical (frequentist) statistics suppose that “nothing is going on”. However, “always something is going on”.9 Since we do not do our experiments in an area that is completely free of any effect, ignorance of surroundings may be misleading. So that means contrary to popular belief, Bayesian statistics may sometimes be more accurate than Classical statistics. However, the key part is how we decide on prior. 3. 1. 3. Why Bayesian Statistics Instead of Classical Statistics As I mentioned in the previous section, Bayesian statistics takes outer effects more into consideration. In Bayesian statistics, we do not need to rely only on the raw data we obtain from calculations but we can also use some other resources. In language, the main outer resource is habits. Each person has different habits in the use of language and to make more accurate predictions, we should also use them in our calculations. In modern systems that perform similar tasks to the algorithm that I am currently working on, language habits are an integral part of word predictions. They store the data about how many times the user uses each word and tend to suggest more frequently used words. They also learn the topics that the consumer is writing more about and use them in their predictions. This gives them the ability to make more accurate predictions for specific users. 9 Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz & Aken, Marcel. (2013). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. (p. 2). Child development. 85. 10.1111/cdev.12169.
  • 15. 15 Learning is endless and this is where Bayesian statistics show its specialties. One of the biggest opportunities of Bayesian statistics is that its openness to update. We can update the prior as we learn more about the user’s habits. We do not only use data but we can also learn from the data. That gives us the ability to enhance the accuracy of predictions as the user types more. I do not have any user-related data but in a situation that I have, the use of Bayesian statistics will outrace the Classical statistics in these aspects. Since I want this study to be open to further developments, I chose Bayesian statistics. However, I was still benefitted from some of its features. I will do my predictions with the help of Bayesian Statistics while trying to guess which word the writer is intended to write. This will give me an opportunity to include prior, in our context the habit factor, to my calculations. Language is strongly linked with our writing habits. For example, in synonyms, each person has their own preference. So, predictions based only on semantics may be misleading. To overcome this problem, we should integrate these two factors. This will give more accurate results especially when we have personal data, but I do not work with personal data since I do not have any information about the writer of this text. Therefore, I will use the general writing habits of English writers. 3. 2. USING BAYES TO FIND SEMANTIC VALUES 3. 2. 1. What is Semantic Value and How it is Calculated “Colorless green ideas sleep furiously.”10 This is the by far most famous quote in linguistics, created by Noam Chomsky. Even though it is grammatically correct, it does not have any meaning. By this example, Chomsky showed that there is not any bond between the 10 Chomsky, N. (2002). Syntactic Structures. (p. 15). Berlin: Mouton de Gruyter.
  • 16. 16 grammatical structure of a language and its semantic side. This is also true for our prediction algorithm. Our suggestion shouldn’t be only grammatically correct but it also has to be semantically appropriate for the sentence. To do that, I found the words that fit the general topic of this paragraph. We have 24 words until the “st” phrase. By finding in which areas these words are generally used, I aimed to understand the general topic of the paragraph. After that, I checked which words will be appropriate suggestions. 3. 2. 2. Using Bayes’ Theorem in Semantics What I wanted is pretty simple actually: finding semantically proper suggestion for completing the “st” phrase. To do that, I first decided on the context of words that are already written. COCA classifies words according to how frequently a word is used in each of 8 main genres. These genres are blog posts, general web pages, TV and movie subtitles, spoken language, fiction, popular magazines, newspapers, and academic writings. I checked each word in the paragraph and reached to statistical overview: TABLE 2: Total Frequencies of Words in Paragraphs in Each Genre BLOG POSTS WEB PAGES SUBTITLES SPOKEN LANG. 12176265 13221161 8528563 11703651 FICTION MAGAZINES NEWSPAPERS ACADEMIC WRITINGS 11517946 13082154 12746944 13093324 This table shows the total numbers of how frequently words in the paragraph used in each category. According to these numbers, it is clear that the main genre of our paragraph is web pages.
  • 17. 17 It is reasonable to expect the correct suggestion among the more frequently used words. For that reason, I picked 107 words which are the most frequently used words among the ones starting with “st”.11 Then I formulated the probability of each 107 words to be used in the web page genre. For example, the formula for one of the least frequently used word “styles”, is as following: 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒|𝑠𝑡𝑦𝑙𝑒𝑠) × 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠) 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) According to this formula, to find the probability of “styles” being the following word given that the genre of the paragraph is web page, we need to multiply the probability of genre being web page given that it includes “styles” with the probability of “styles” being used and divide them the probability of the genre of the paragraph being web page. As I had known the probability of genre being web page given that it includes “styles”, or in basic terms the probability of “styles” being used in “web page” from COCA, I substituted it. For the main feature of Bayesian Statistics, prior, I used the general usage ratio of “styles” among the most popular 107 words starting with “st”. With substitutions, the formula became like this: 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,126607818411097 × 0,00180771229791463 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) However, since I didn’t have any direct statistics for 𝑃(𝑤𝑒𝑏 𝑝𝑎𝑔𝑒), I used the following formula of marginal likelihood: 𝑃(𝐵) = ∑ 𝑃(𝐵|𝐴𝑖) × 𝑃(𝐴𝑖) 𝑛 𝑖=1 11 I picked 107 words because that was the maximum number of words that I can obtain from the COCA data set which I was using.
  • 18. 18 = 0,124323532630005 With this addition, the formula became like this: 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,126607818411097 × 0,00180771229791463 0,124323532630005 And the result was: 𝑃(𝑠𝑡𝑦𝑙𝑒𝑠|𝑤𝑒𝑏 𝑝𝑎𝑔𝑒) = 0,0018409266975627 Though, this was the probability for just one of the 107 words. For the rest, I calculated the probabilities by following the same steps. Here is Bayes’ Box 10 of them which have the highest probability: TABLE 3: Bayes’ Box for Top 10 Words in Semantics WORD FREQUENCY PRIOR (USAGE RATIO) LIKELIHOOD PRIOR x LIKELIHOOD POSTERIOR still 791726 0,090240405 0,124828539 0,011264578 0,090606965 state 577192 0,065787962 0,154553424 0,010167755 0,081784635 states 396934 0,045242274 0,158016194 0,007149012 0,057503289 story 319852 0,036456519 0,163194227 0,005949493 0,047854926 start 275954 0,031453054 0,143378969 0,004509706 0,036273957 students 383366 0,043695803 0,079600173 0,003478193 0,027976952 stop 270980 0,030886121 0,109273747 0,003375042 0,027147251 started 234505 0,026728725 0,119617919 0,003197234 0,025717049 study 261496 0,029805141 0,105523603 0,003145146 0,025298073 As you can see in the table above, there are different parameters that affecting the posterior probability. Each word has its own characteristics and with the use of Bayesian Statistics, I was
  • 19. 19 able to assess their different parts together. This is just a small portion of the table and the full version is included as an appendix. However, I was still lack of enough information to make my suggestion more accurate. Language has another aspect -syntax- and I could reach to correct algorithm only by an algorithm containing multiple aspects of language. 3. 3 SYNTACTIC APPROPRIATENESS Syntax is a subtopic of linguistics that is focused on the arrangement of words in a sentence or a paragraph. In language, words should follow each other in the correct order to have a proper meaning. Syntax is basically assessing this and our behaviors about placing words in order. In my case, I used syntax as another aspect of my algorithm. Since I was trying to find the correct word suggestion only with the use of statistics, I hoped that more statistics from different areas would increase the accuracy of my prediction. My suggestion had to be consistent with “twenty”, so I included the usage frequency of each word which starting “st” after the “twenty” to my research. However, I did not need to calculate the conditional probability of it personally, by using the Bayes’ theorem. iWeb Corpus already contains this data, so I directly included it in my research. Though, I only used the common words in the top 100 most used “twenty st” phrases and the semantic calculations. It means 57 words and here is a small sample of it: TABLE 4: 10 Most Used Words Starting with “st” After the “twenty” WORD FREQUENCY USAGE RATIO twenty students 532 0,303306727 twenty states 366 0,208665906
  • 20. 20 twenty steps 140 0,07981756 twenty stories 134 0,076396807 twenty studies 50 0,028506271 twenty straight 39 0,022234892 twenty staff 34 0,019384265 twenty state 31 0,017673888 twenty stores 31 0,017673888 twenty standard 29 0,016533637 4. COMBINING THE DATA As we can see from the sample tables, the correct word “states” is not the first option in any category. If we evaluate each category one by one, we can make multiple suggestions to complete the phrase of “st”. However, all of them would be wrong. Hence, I needed to combine statistics from all areas and reach one correct answer. Each of them gathered in different ways and have different perspectives. The way of obtaining the correct result was combining these different views into one and simple mathematical index. I created this index by multiplying the Jaccard Similarity Index, semantic posterior probability, and syntactic probability and turning the product into a number out of 1. Final table includes 57 words that are commonly popular in both semantics and syntactic statistics. TABLE 5: Final List of Suggestions WORD JACCARD SIMILARITY SEMANTIC SYNTACTIC PRODUCT FINAL INDEX PERCENTAGE INDEX states 0.5 0,0575 0,2087 0,0120 0,4355 43,5547 students 0.333 0,0280 0,3033 0,0085 0,3080 30,8016 state 0.5 0,0818 0,0177 0,0014 0,0525 5,2468 stories 0.333 0,0157 0,0764 0,0012 0,0436 4,3627 story 0.4 0,0479 0,0143 0,0007 0,0248 2,4759
  • 21. 21 steps 0.5 0,0084 0,0798 0,0007 0,0244 2,4384 still 0.5 0,0906 0,0057 0,0005 0,0188 1,8751 studies 0.333 0,0140 0,0285 0,0004 0,0145 1,4451 strong 0.333 0,0187 0,0131 0,0002 0,0089 0,8924 staff 0.5 0,0118 0,0194 0,0002 0,0083 0,8322 student 0.333 0,0135 0,0160 0,0002 0,0078 0,7835 standard 0.333 0,0129 0,0165 0,0002 0,0078 0,7768 straight 0.2857 0,0094 0,0222 0,0002 0,0076 0,7561 study 0.4 0,0253 0,0046 0,0001 0,0042 0,4188 stars 0.5 0,0073 0,0154 0,0001 0,0041 0,4077 stores 0.4 0,004234 0,017674 0,000075 0,002716 0,271613 stone 0.4 0,005000 0,014823 0,000074 0,002690 0,269044 star 0.5 0,012268 0,005131 0,000063 0,002285 0,228489 step 0.5 0,016406 0,003421 0,000056 0,002037 0,203712 street 0.5 0,016775 0,002851 0,000048 0,001736 0,173574 start 0.5 0,036274 0,001140 0,000041 0,001501 0,150137 standards 0.333 0,008990 0,004561 0,000041 0,001488 0,148839 studio 0.333 0,003396 0,011973 0,000041 0,001476 0,147579 statements 0.333 0,005812 0,005131 0,000030 0,001082 0,108242 starts 0.5 0,007032 0,003991 0,000028 0,001019 0,101866 stations 0.333 0,002831 0,009122 0,000026 0,000937 0,093741 stages 0.4 0,002357 0,009692 0,000023 0,000829 0,082925 structures 0.333 0,003192 0,006842 0,000022 0,000793 0,079277 stock 0.4 0,007265 0,002851 0,000021 0,000752 0,075171 store 0.4 0,011840 0,001710 0,000020 0,000735 0,073511 stay 0.5 0,017407 0,001140 0,000020 0,000720 0,072048 st 1.0 0,008653 0,002281 0,000020 0,000716 0,071627 statement 0.333 0,014633 0,001140 0,000017 0,000606 0,060565 station 0.333 0,007009 0,002281 0,000016 0,000580 0,058019 stocks 0.4 0,002092 0,006271 0,000013 0,000476 0,047626 strategies 0.2857 0,003509 0,003421 0,000012 0,000436 0,043566 status 0.5 0,010270 0,001140 0,000012 0,000425 0,042507 stage 0.4 0,009545 0,001140 0,000011 0,000395 0,039505 streets 0.5 0,004701 0,002281 0,000011 0,000389 0,038917 stops 0.5 0,002682 0,003991 0,000011 0,000388 0,038847
  • 22. 22 stands 0.4 0,0049975 0,0017104 0,0000085 0,0003103 0,0310264 styles 0.4 0,0018409 0,0039909 0,0000073 0,0002667 0,0266684 stones 0.4 0,0018290 0,0039909 0,0000073 0,0002650 0,0264957 steel 0.5 0,0023351 0,0028506 0,0000067 0,0002416 0,0241620 studying 0.25 0,0026981 0,0022805 0,0000062 0,0002233 0,0223349 striking 0.2857 0,0024579 0,0022805 0,0000056 0,0002035 0,0203466 strategic 0.25 0,0031492 0,0017104 0,0000054 0,0001955 0,0195516 stayed 0.333 0,0035177 0,0011403 0,0000040 0,0001456 0,0145598 statistical 0.333 0,0025615 0,0011403 0,0000029 0,0001060 0,0106021 stroke 0.333 0,0015650 0,0017104 0,0000027 0,0000972 0,0097160 stairs 0.4 0,0013972 0,0011403 0,0000016 0,0000578 0,0057830 The final table is a predictive text list in which all possible words are attached to an index. This index shows us how probable each word is compared to all words in the list. As we can see from the final table, I managed to predict the word correctly. According to my index, “states” became the most appropriate suggestion to complete the word with the index number 0,4355 which is 34.29% more than the second option, “students”. This shows us how strong conditional probability and Bayesian Statistics are. They to led me to correct results. Even though I did not use the Bayes’ rule while calculating syntactic value since its conditional frequency has already been put into a data set, I always used conditional probability and when it is possible Bayesian Statistics. Though linguistics and mathematics seem so different, mathematical interpretation of linguistics may be quite accurate. Bayesian statistics are one of the strongest tools in mathematics and it became very effective when it is used correctly. In my research, it was one of my two main tools with the Jaccard Index and played a huge role to make the correct prediction.
  • 23. 23 5. CONCLUSION In this paper, I tried to build a predictive text list that suggests the correct word to complete the “st” phrase in a random paragraph. To do that, I used Jaccard Similarity Index and Bayesian Statistics. I ended up with a list containing 57 words and I managed to predict the word correctly. While doing that, I tried to use purely statistical methods and data to avoid subjectivity. Especially when using the Bayes’ theorem which is often regarded as subjective, I made my all decision completely based on data. Before choosing my Extended Essay topic, I was sure that I would like to do research that combines computer science and mathematics. Since I aim to develop myself in mathematical computer science or Artificial Intelligence, Natural Language Processing seemed so attractive to me. It is highly linked with statistics and the mathematical background that it contains is amazing. I would like to do research which I can develop in the future. They led me to choose this research question. However, I faced some challenges while doing the background calculations and reflecting them to written text. Even though this research paper contains high effort while doing mathematics due to its nature which requires processing big data sets, reflecting it to text was compelling. Creating a prediction index from zero requires correctly building logical bonds between the different variables in the data. Still, I learned quite a lot from this research about statistics and how vast areas it can be effectively used. It widened my perspective about applications of mathematics and helped me while making career plans as a computer scientist specialized in data engineering.
  • 24. 24 6. BIBLIOGRAPHY • Brewer, Brendon. J. STATS 331: Introduction to Bayesian Statistics. University of Auckland. • Routledge, Richard. Bayes's theorem. Encyclopedia Britannica. https://www.britannica.com/topic/Bayess-theorem Retrieval Date: February 15, 2021. • Schoot, Rens & Kaplan, David & Denissen, Jaap & Asendorpf, Jens & Neyer, Franz & Aken, Marcel. (2013). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. (p. 2). Child development. 85. 10.1111/cdev.12169. • Chomsky, Noam. (2002). Syntactic Structures. Berlin: Mouton de Gruyter. • Berman, Ari. (2012, January 31). How the GOP Is Resegregating the South. The Nation. https://www.thenation.com/article/archive/how-gop-resegregating-south/ Retrieval Date: December 7, 2020 • Brownlee, Jason. (2017, September 22). What Is Natural Language Processing? Machine Learning Mastery. https://machinelearningmastery.com/natural-language- processing/ Retrieval Date: December 7, 2020. • Paul Jaccard. Wikipedia. https://en.wikipedia.org/wiki/Paul_Jaccard Retrieval Date: 2020, December 7. • Sieg, Adrien. (2019, November 13). Text similarities : Estimate the degree of similarity between two texts. Retrieval Date: November 20, 2020. https://medium.com/@adriensieg/text-similarities-da019229c894 • An Intuitive (and Short) Explanation of Bayes' Theorem. BetterExplained. https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes- theorem/. Retrieval Date: January 28, 2021
  • 25. 25 • Mahendru, Khyati. (2019, June 13). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2019/06/introduction-powerful-bayes-theorem- data-science/. Retrieval Date: December 3, 2020. • Glen, Stephanie. (2020, September 16). Jaccard Index / Similarity Coefficient. Statistics How To. https://www.statisticshowto.com/jaccard-index/. Retrieval Date: 3 January, 2021 DATA BASES: • https://www.english-corpora.org/coca/. Retrieval Date: February 16, 2021. • https://www.wordfrequency.info/samples.asp Retrieval Date: February 16, 2021. • https://www.english-corpora.org/iweb/. Retrieval Date: February 17, 2021. • https://github.com/dwyl/english-words Retrieval Date: December 12, 2020. PROGRAMS: • Visual Studio 2019 • Python 3.7
  • 26. 26 7. APPENDIECES FIGURE 3: Code for Finding Words Starting with “st” FIGURE 4: Code for Calculating Jaccard Index FIGURE 5: Code for Calculating Jaccard Distance
  • 27. 27 TABLE 6: General and Genre Specific Frequencies of Words in The Paragraph WORD FREQ BLOG WEB TVM SPOK FICTION MAGAZINE NEWS ACADEMIC nationwide 15733 1546 1889 214 1911 106 3216 5059 1792 republican 124514 21658 20867 904 43517 475 10889 22121 4080 have 5025573 781709 687895 820686 879668 423071 503134 522808 406599 a 21889251 2783458 2827106 2519099 2716641 2749208 3104298 2959649 2229222 major 196857 24133 25756 10307 24642 6983 31077 36638 37315 advantage 55691 9484 8721 3212 5275 3253 8873 7923 8949 in 16560377 2003430 2257672 1225718 2020330 1671503 2310522 2355671 2699192 heading 28118 3089 2997 4616 3420 5655 3519 3822 1000 into 1461816 166362 180584 116756 148250 307485 226799 171402 144177 the 50074257 6272412 7101104 3784652 5769026 6311500 6805845 6582642 7447070 november 87176 24573 22587 1138 7249 2224 10651 11000 7733 elections 39380 7212 6709 352 8261 232 3155 7398 6061 party 243697 39715 35760 31649 43730 15044 24195 33310 20277 controls 26347 3131 3786 1272 1766 1639 4708 2995 7049 process 220128 31106 33489 5266 26450 5496 27362 22973 67985 twenty 36338 2939 3815 2718 3313 14068 3807 971 4707 redistricting 1724 308 424 4 202 4 104 562 116 TOTAL 96086977 12176265 13221161 8528563 11703651 11517946 13082154 12746944 13093324 TABLE 7: Bayes’ Box for All Words Used in Semantics Calculations WORD FREQUENCY RATIO LIKELIHOOD RATIO x LIKELIHOOD POSTERIOR still 791726 0,090240405 0,124828539 0,011264578 0,090606965 state 577192 0,065787962 0,154553424 0,010167755 0,081784635 states 396934 0,045242274 0,158016194 0,007149012 0,057503289 story 319852 0,036456519 0,163194227 0,005949493 0,047854926 start 275954 0,031453054 0,143378969 0,004509706 0,036273957 students 383366 0,043695803 0,079600173 0,003478193 0,027976952 stop 270980 0,030886121 0,109273747 0,003375042 0,027147251 started 234505 0,026728725 0,119617919 0,003197234 0,025717049 study 261496 0,029805141 0,105523603 0,003145146 0,025298073 strong 152080 0,017333978 0,134468701 0,002330877 0,018748482 stay 203720 0,023219871 0,093201453 0,002164126 0,017407209 street 189237 0,021569108 0,09668828 0,00208548 0,016774619 stuff 153066 0,017446361 0,118713496 0,002071119 0,016659103 step 128479 0,014643951 0,139283463 0,00203966 0,016406067 stories 112939 0,012872712 0,151940428 0,001955885 0,015732222 stand 138407 0,015775538 0,120716438 0,001904367 0,01531783 statement 86767 0,009889645 0,183952424 0,001819224 0,014632984 studies 136505 0,01555875 0,111592982 0,001736247 0,013965556 student 147156 0,016772743 0,100220175 0,001680967 0,01352091 standard 86472 0,009856021 0,163266722 0,00160916 0,012943328
  • 28. 28 star 107361 0,012236936 0,124635575 0,001525158 0,012267649 starting 93909 0,010703686 0,137729078 0,001474209 0,011857842 store 97668 0,011132134 0,13223369 0,001472043 0,011840422 staff 109761 0,012510486 0,117528084 0,001470333 0,011826671 style 72902 0,008309322 0,166072261 0,001379948 0,011099651 status 75175 0,008568397 0,149012305 0,001276797 0,010269951 stage 84800 0,009665448 0,122771226 0,001186639 0,009544765 straight 89870 0,010243323 0,11370869 0,001164755 0,00936874 standards 68608 0,007819894 0,142927938 0,001117681 0,008990103 stupid 66726 0,007605385 0,1457303 0,001108335 0,008914926 st 111984 0,012763862 0,084279897 0,001075737 0,008652722 standing 90839 0,010353769 0,102345909 0,001059666 0,008523454 structure 63205 0,007204064 0,145906178 0,001051117 0,008454694 steps 69771 0,007952452 0,13157329 0,00104633 0,008416189 strategy 66602 0,007591252 0,133224228 0,001011339 0,008134732 stopped 88086 0,010039984 0,097177758 0,000975663 0,007847775 stated 38017 0,004333152 0,220138359 0,000953893 0,007672667 strength 58373 0,006653316 0,142000582 0,000944775 0,007599323 stars 69625 0,007935811 0,114298025 0,000907048 0,007295864 stock 71592 0,008160009 0,110682758 0,000903172 0,007264693 starts 58138 0,006626531 0,131927483 0,000874222 0,007031827 station 74456 0,008486446 0,102678092 0,000871372 0,007008907 storm 51639 0,005885779 0,133542478 0,000786002 0,006322226 stood 83875 0,009560017 0,07961848 0,000761154 0,006122365 strange 55570 0,006333832 0,119003059 0,000753745 0,006062773 steve 58774 0,006699022 0,11185218 0,0007493 0,006027018 statements 36168 0,004122405 0,175265428 0,000722515 0,005811571 stress 50001 0,005699081 0,126617468 0,000721603 0,005804237 struggle 41878 0,004773227 0,140766035 0,000671908 0,005404513 stick 52290 0,00595998 0,112583668 0,000670996 0,005397179 stone 57458 0,006549025 0,094921508 0,000621643 0,005000206 stands 49137 0,005600603 0,110934734 0,000621301 0,004997456 stuck 45673 0,005205778 0,115297878 0,000600215 0,004827849 strike 40442 0,004609552 0,129864992 0,000598619 0,004815013 strongly 31957 0,003642438 0,161404387 0,000587905 0,004728835 streets 54014 0,00615648 0,094938349 0,000584486 0,004701331 stores 41045 0,004678282 0,112510659 0,000526357 0,004233765 statistics 29092 0,003315887 0,155025437 0,000514047 0,004134751 stronger 32229 0,00367344 0,135809364 0,000498888 0,004012817 storage 26955 0,003072313 0,160601002 0,000493417 0,003968811 string 21127 0,002408041 0,191035168 0,000460021 0,003700189 struck 35801 0,004080574 0,111477333 0,000454892 0,003658933 stayed 39818 0,004538429 0,096363454 0,000437339 0,003517747 strategies 39768 0,00453273 0,096233152 0,000436199 0,003508579 studio 39195 0,00446742 0,09450185 0,000422179 0,003395813 struggling 26677 0,003040627 0,138471342 0,00042104 0,003386645
  • 29. 29 stable 25662 0,002924938 0,138726522 0,000405766 0,003263794 studied 33515 0,003820018 0,105922721 0,000404627 0,003254626 structures 26349 0,003003242 0,132149228 0,000396876 0,003192284 stream 24393 0,002780298 0,141843972 0,000394369 0,003172115 strategic 26107 0,002975659 0,131573907 0,000391519 0,003149195 staying 32297 0,003681191 0,102579187 0,000377614 0,003037346 stretch 26762 0,003050315 0,115761154 0,000353108 0,002840235 stations 24804 0,002827143 0,124496049 0,000351968 0,002831067 strikes 20666 0,002355497 0,143714313 0,000338519 0,002722885 studying 27294 0,003110952 0,107825896 0,000335441 0,002698131 stops 26908 0,003066956 0,108703731 0,00033339 0,002681629 stephen 26970 0,003074023 0,104857249 0,000322334 0,0025927 statistical 17579 0,002003643 0,158939644 0,000318458 0,002561528 striking 19419 0,002213365 0,138060662 0,000305579 0,002457931 strip 21378 0,00243665 0,122836561 0,00029931 0,002407507 stomach 27740 0,003161787 0,094520548 0,000298854 0,00240384 stages 19026 0,002168571 0,135130874 0,000293041 0,002357083 stolen 22366 0,002549262 0,11459358 0,000292129 0,002349749 steel 32776 0,003735787 0,077709299 0,000290305 0,00233508 steady 24491 0,002791468 0,101180025 0,000282441 0,002271821 steal 20956 0,002388551 0,117675129 0,000281073 0,002260819 stability 21016 0,00239539 0,1161496 0,000278224 0,002237899 stewart 20236 0,002306486 0,120181854 0,000277198 0,002229648 stadium 25212 0,002873647 0,094756465 0,000272297 0,002190226 stepped 34088 0,003885328 0,069877963 0,000271499 0,002183808 stranger 19155 0,002183274 0,122631167 0,000267737 0,002153554 stopping 18673 0,002128336 0,124136454 0,000264204 0,002125134 stem 19238 0,002192735 0,120178813 0,00026352 0,002119633 stocks 25119 0,002863047 0,090847566 0,000260101 0,002092129 stake 19616 0,002235819 0,107616232 0,00024061 0,001935357 styles 15860 0,001807712 0,126607818 0,000228871 0,001840927 stones 16858 0,001921464 0,11834144 0,000227389 0,001829008 structural 16140 0,001839627 0,121623296 0,000223741 0,001799671 struggled 16187 0,001844984 0,117748811 0,000217245 0,001747413 steven 21003 0,002393908 0,087416083 0,000209266 0,001683238 stanford 15967 0,001819908 0,113859836 0,000207214 0,001666735 stroke 17508 0,00199555 0,097498286 0,000194563 0,001564971 staring 25605 0,002918441 0,062331576 0,000181911 0,001463207 stairs 23029 0,00262483 0,066177428 0,000173705 0,001397197 stole 17208 0,001961356 0,077754533 0,000152504 0,001226673 stared 23981 0,002733339 0,046828739 0,000127999 0,001029562 stir 19479 0,002220204 0,05303147 0,000117741 0,00094705 TOTAL 8773520 0,124323533
  • 30. 30 TABLE 8: Usage Frequencies After Twenty for Words Starting with “st” WORD FREQUENCY RATIO WORD FREQUENCY RATIO st 543864 0,006317719 staff 2602407 0,03023049 stage 1569009 0,018226169 stages 427339 0,004964123 stairs 28180 0,000327349 standard 2427376 0,028197267 standards 1225651 0,0142376 stands 595947 0,006922733 star 1466255 0,017032542 stars 831227 0,00965583 start 5497965 0,063866327 starts 985564 0,011448664 state 6327227 0,073499331 statement 1456109 0,016914683 statements 619896 0,007200934 states 3419996 0,039727896 station 1048322 0,012177683 stations 380514 0,004420187 statistical 173317 0,002013312 status 1256637 0,014597545 stay 2338130 0,027160554 stayed 352807 0,004098332 steam 366250 0,004254491 steel 923989 0,010733387 step 2670622 0,0310229 steps 1337165 0,015532987 still 9337823 0,10847149 stock 1648872 0,019153887 stocks 320921 0,003727933 stone 763898 0,008873712 stones 273985 0,003182708 stops 384562 0,00446721 store 2370994 0,027542314 stores 861682 0,010009606 stories 1251367 0,014536326 story 3185323 0,037001851 straight 1228434 0,014269928 strategic 566227 0,006577495 strategies 699520 0,008125874 street 2085396 0,024224705 streets 510693 0,005932392 striking 265952 0,003089394 strings 265409 0,003083086 stroke 307775 0,003575224 strong 2479756 0,028805732 structures 427084 0,00496116 student 2920987 0,033931229 students 6528359 0,075835752 studies 1471363 0,017091879 studio 789269 0,009168431 study 2888625 0,033555301 studying 374100 0,004345679 stunning 458706 0,005328493 styles 546658 0,006350175