Based on work published in ACM Transactions on Information Systems (TOIS), 36(1), July 2017 by Lulwah Alkwai, Michael L. Nelson, and Michele C. Weigle
Presented at ACM SIGIR 2019 on July 24, 2019 by Michele C. Weigle
UiPath Community: Communication Mining from Zero to Hero
Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages
1. Comparing the Archival Rate of Arabic,
English, Danish, and Korean Language
Web Pages
Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle (@weiglemc)
Web Sciences and Digital Libraries (WS-DL) Group (@WebSciDL)
Department of Computer Science
Old Dominion University
Norfolk, Virginia, USA
July 24, 2019 / ACM SIGIR 2019 / Paris, France
Published in ACM Transactions on Information Systems (TOIS), 36(1), July 2017
Extended version of Best Student Paper awardee at IEEE/ACM JCDL 2015
3. @weiglemc, @WebSciDL
Web archives are essential for studying recent
history and culture
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 3
https://twitter.com/NetPreserve/status/1141321443373920256 (photo cropped and enlarged)
http://web.archive.org/web/19970222174751/http://www1.geocities.com/
4. @weiglemc, @WebSciDL
The Internet Archive holds the largest web
archive
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 4
https://archive.org/web
5. @weiglemc, @WebSciDL
But it’s not the only one
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 5
http://timetravel.mementoweb.org/list/19990518173206/http://geocities.com https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
6. @weiglemc, @WebSciDL
We’ve studied recent (2010s) events in the
Middle East
• Iranian Elections and Protests - Jun 2009
– SalahEldeen and Nelson, TPDL 2012
• Egyptian Revolution - Jan 20 - Mar 1, 2011
– SalahEldeen and Nelson, TPDL 2012
– AlNoamany, Weigle, Nelson, ACM WebSci 2017
• Syrian Uprising - Mar 2012
– SalahEldeen and Nelson, TPDL 2012
• Egypt’s Presidential Election - 2012
– AlNoamany, Weigle, Nelson, TPDL 2015, IJDL 2016
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 6
7. @weiglemc, @WebSciDL
But, we can only study the past Web that already
exists in the archives
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 7
Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012.
11% of resources
shared in social
media disappear
each year
8. @weiglemc, @WebSciDL
How well is the Arabic Web archived?
• Arabic is the 4th most popular language on the Internet
• Anecdotally known that web archives and search engines
favor Western and English language pages
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 8
9. @weiglemc, @WebSciDL
We investigated the state of archival of Arabic language
web pages in 2014-2015
• Gathered Arabic language web pages
• Analyzed domains, TLDs, GeoIP
• Analyzed presence in Google’s index and web archives
• Compared this to the archival and indexing of English,
Danish, and Korean language web pages
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 9
10. @weiglemc, @WebSciDL
How to gather and detect Arabic
language web pages?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 10
11. @weiglemc, @WebSciDL
Gathered URIs from Arabic website directories
• 15,743 seed URIs
– 15,092 were unique
– 11,014 were live on the Web in March-May 2014
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 11
Directory
Registered
Country
Year
Estab. Directory URI URIs
DMOZ US 1999 Dmoz.org/world/arabic 4,086
Raddadi Saudi Arabia 2000 Raddadi.com 3,271
Star28 Lebanon 2004 Star28.com 8,386
Total 15,743
12. @weiglemc, @WebSciDL
We used four methods to determine the language
of the 11,014 URIs
• HTTP Content-Language header
• HTML Title tag
• Language Detection API
• Trigrams
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 12
https://github.com/kent37/guess-language
http://detectlanguage.com
https://github.com/decultured/Python-Language-Detector
Sample of test results evaluated by a native speaker (first author)
13. @weiglemc, @WebSciDL
HTTP Content-Language classified 41% of the
seed URIs as Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 13
HTTP response header
ex:
Content-Language: ar
HTML tag
ex:
<html dir="rtl"
xmlns="http://www.w3.org/19
99/xhtml" xml:lang="ar"
lang="ar">
14. @weiglemc, @WebSciDL
HTML Title tag classified 38% of the seed URIs as
Arabic
14ACM TOIS 36(1) 2017 / ACM SIGIR 2019
Extract text from HTML
title tag
ex:
<TITLE>الشامل العرب /<دليلTITLE>
guess-language library
https://github.com/kent37/guess-language
15. @weiglemc, @WebSciDL
The Language Detection API test classified 39%
of the seed URIs as Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 15
Extract title and text
Language Detection API
http://detectlanguage.com
16. @weiglemc, @WebSciDL
The Trigram test classified 36% of the seed URIs
as Arabic
16ACM TOIS 36(1) 2017 / ACM SIGIR 2019
Extract title and text
Test sequences of three
letters (trigrams)
Python-Language-
Detector tool
https://github.com/decultured/Python-
Language-Detector
17. @weiglemc, @WebSciDL
We took the union to obtain 7976 Arabic seed
URIs
17ACM TOIS 36(1) 2017 / ACM SIGIR 2019
72.4% of the
seed URIs were
determined to be
in Arabic
18. @weiglemc, @WebSciDL
We expanded the dataset by crawling the live
Web and the past Web
• Crawled the 7976 live Arabic seed URIs, 2 levels deep
– gathered all URIs linked from each seed URI
– then, gathered all URIs linked from those URIs
– 575,242 additional URIs
• Crawled the most recent memento of the Arabic seed URIs
– 515,821 additional URIs
• Total of 663,443 unique crawled URIs gathered
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 18
20. @weiglemc, @WebSciDL
292,670 were live and Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 20
Total Arabic Dataset = 300,646 URIs
7,976 seed URIs
+ 292,670 crawled URIs
21. @weiglemc, @WebSciDL
What are the characteristics of our
Arabic language dataset?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 21
22. @weiglemc, @WebSciDL
Dataset has 17,536 unique domains
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 22
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
23. @weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP is at rank 17
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 23
24. @weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
6 top domains are news websites
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 24
25. @weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular Western domains are in the Top 10
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 25
26. @weiglemc, @WebSciDL
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Over half have a .com TLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 26
27. @weiglemc, @WebSciDL
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Only ~10% have an Arabic ccTLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 27
ccTLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
28. @weiglemc, @WebSciDL
Most are geo-located in the US
Geo-location Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 28
29. @weiglemc, @WebSciDL
Within Arabic countries, most are geo-located in
Saudi Arabia
Geo-location Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 29
Geo-location Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab Emirates 0.67%
30. @weiglemc, @WebSciDL
How well are these Arabic web pages
indexed and archived?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 30
31. @weiglemc, @WebSciDL
53.77% of Arabic language web pages are
archived
• Used Memento aggregator to determine if archived
– checks multiple web archives
• 161,678 URIs were archived
• 97% of those were found in the Internet Archive
• Mementos also found in 9 other archives
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 31
35. @weiglemc, @WebSciDL
We looked at indexing of Arabic seed URIs
• Used Google Custom
Search API to query
– limited to 1000
queries/day
– tested only seeds
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 35
36. @weiglemc, @WebSciDL
69% of Arabic language seed URIs were indexed
by Google
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 36
All seeds were live
• 82% of those listed in DMOZ were indexed
• 74.6% of the top-level seeds (path depth = 0) were indexed
42. @weiglemc, @WebSciDL
We chose English, Danish, and Korean
• English
– most popular language on the Internet
– over 65 countries have English as an official language
• Danish
– European language
– 96% of population in Denmark uses the Internet
– Denmark has government initiative to archive Danish cultural heritage on the Web
• Korean
– Asian language
– 92% of population in South Korea uses the Internet (highest in Asian countries)
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 42
43. @weiglemc, @WebSciDL
We gathered seed URIs from DMOZ
• English
– sample of size 10,000
• Danish
– sample of size 10,000
• Korean
– all 2,347 URIs in DMOZ
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 43
44. @weiglemc, @WebSciDL
Crawled the seed URIs to expand the dataset
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 44
Arabic English Danish Korean
Seeds 15,742 10,000 10,000 2,347
Live 11,014 9,384 9,245 2,070
Language 7,976 8,576 6,331 1,157
Crawled 663,443 224,249 174,369 16,016
Live 482,905 176,261 131,484 11,099
Language 292,670 137,950 99,019 7,965
Total 300,646 146,526 105,350 9,482
December 2015 - March 2016
562,004 total
45. @weiglemc, @WebSciDL
English and Arabic web pages were .com, Danish
and Korean were ccTLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 45
47.37%
84.64%
66.34%
60.28%
37.50%
46. @weiglemc, @WebSciDL
English and Arabic web pages were located in the
US, Danish and Korean were in their countries
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 46
89.54%
83.12%
57.97%
47. @weiglemc, @WebSciDL
Arabic is archived less than English, but more
than Danish and Korean
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 47
* Danish government archive is dark (not publicly available)
*
Arabic English Danish Korean
Seeds and Crawled 300,646 146,526 105,350 9,482
Archived 161,678 107,398 41,703 3,972
Percent 53.77% 73.30% 39.59% 41.89%*
48. @weiglemc, @WebSciDL
>90% of pages listed in DMOZ were archived,
regardless of language
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 48
* DMOZ closed in 2017, but curlie.org began in 2018 with DMOZ data
Arabic English Danish Korean
DMOZ Seeds 2,904 8,576 6,331 1,157
Archived 2,774 8,014 6,164 1,358
Percent 95.52% 93.44% 97.36% 89.52%
49. @weiglemc, @WebSciDL
There are a few caveats
• DMOZ is no longer operational
• Some web pages are multilingual
• Not easy to characterize “the Arabic Web”
– 67% of Arabic dataset had neither Arabic ccTLD nor Arabic geo-
location
• The language of a web page may shift over time
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 49
51. @weiglemc, @WebSciDL
Only about half of Arabic language web pages are
archived
• Analyzed over 500,000 web pages (2014-2016)
• Arabic is archived less than English, but more than Danish and Korean
• Most Arabic and English pages are geo-located in the US, Danish in
Denmark, and Korean in South Korea
• Web pages present in DMOZ are highly likely to be archived, regardless
of language
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 51
Old Dominion University Web Science and Digital Libraries (WS-DL)
Department of Computer Science https://ws-dl.cs.odu.edu/ @WebSciDL
Editor's Notes
Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website
The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages.
Next, we measured the number of URIs reported as Arabic.
Figure 1 shows the intersection between the four language tests.
Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website
The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages.
Next, we measured the number of URIs reported as Arabic.
Figure 1 shows the intersection between the four language tests.
Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website
The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages.
Next, we measured the number of URIs reported as Arabic.
Figure 1 shows the intersection between the four language tests.
Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website
The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages.
Next, we measured the number of URIs reported as Arabic.
Figure 1 shows the intersection between the four language tests.
Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website
The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages.
Next, we measured the number of URIs reported as Arabic.
Figure 1 shows the intersection between the four language tests.
3.3 Crawling Arabic Seed URIs
To increase the size of our dataset, we crawled the Arabic seed URIs
between January-March 2014.
Our first pass was to gather additional URIs linked from the live Web versions of our seed URIs.
This resulted in collecting 575,242 URIs, all of which were available on the live Web.
To gather even more URIs, we crawled the Arabic seed URIs that had at least one archived version (or, memento).
We crawled the most recent memento and gathered 515,821 URIs.
Of these, only 335,283 were available on the live Web.
Combining the three sets (original URIs, crawled live, and crawled archived),
we obtained a total of 663,443 unique URIs.
Sankey Diagram
Figures show the summary of collecting Arabic URIs for seed URIs and for crawled URIs.
Combining the seed URIs and crawled URIs,
we collected 300,646 Arabic URIs that we analyze in the remainder of the paper
3.3 Crawling Arabic Seed URIs
To increase the size of our dataset, we crawled the Arabic seed URIs
between January-March 2014.
Our first pass was to gather additional URIs linked from the live Web versions of our seed URIs.
This resulted in collecting 575,242 URIs, all of which were available on the live Web.
To gather even more URIs, we crawled the Arabic seed URIs that had at least one archived version (or, memento).
We crawled the most recent memento and gathered 515,821 URIs.
Of these, only 335,283 were available on the live Web.
Combining the three sets (original URIs, crawled live, and crawled archived),
we obtained a total of 663,443 unique URIs.
Sankey Diagram
Figures show the summary of collecting Arabic URIs for seed URIs and for crawled URIs.
Combining the seed URIs and crawled URIs,
we collected 300,646 Arabic URIs that we analyze in the remainder of the paper
4.1 Unique Domains
First, we investigate the number of unique domains in our dataset.
Out of the 300,646 Arabic URIs,there are 17,536 unique domains.
The most frequent domains are shown in Table 3.
We also tested the GeoIP location of the top-level webpage of each of these domains and found that the top 16 are all located in the US.
The first domain we find located in an Arabic country is the 17th most frequent.
We note that several of these top domains are popular Western sites, such as cnn.com and wikipedia.org.
This indicates that the Arabic language community is already using services on Western sites that are likely to be archived
We investigate the top level domain code TLD (ccTLD),
together termed effective TLD, of the unique Arabic language domains.
Generic TLDs such as .com, .net, and .org are open for any registrant.
In addition to TLDs, many sites also use the two-letter ccTLD of their home country.
Although a small percentage of the websites add the ccTLD, it may be a good indication of the source of the website.
Table 4 shows the distribution of the top 10 effective TLDs.
We also checked if the ccTLD was from a country where Arabic is an official language (listed in Table 1).
.ws is the Internet country code top-level domain (ccTLD) for Samoa
The .ws country code has been marketed as a
domain hack, with the .ws purportedly standing for "World Site" or "Web site", providing a "global" Internet presence to registrants, as it supports all internationalized domain names.
We note that the top Arabic ccTLD, .sa for Saudi Arabia, is used in fewer URIs than the generic TLDs .com, .net, and .org.
4.6 GeoIP Location
Earlier we looked at the ccTLD of the URIs to help determine where the hosts of the webpages might be located
Now we want to look at the GeoIP location of the IP address of the unique hostnames
Steps:
First, we obtained the IP addresses of the hostnames using nslookup, which uses DNS to convert the hostname to its IP address.
Then we used the MaxMind GeoLite29 database to determine location from the IP address.
Which tests at 99.8% accuracy at the country level.
We used this method to determine GeoIP for the Arabic URI dataset (300,646 URIs).
Table 10 shows the top GeoIP locations, with Arabic countries grouped together.
We found that less than 11% of the URIs are hosted in Arabic countries.
Table 11 shows the top 5 GeoIP locations from Arabic countries.
Table 7 lists the top 10 archived URI-Rs with the most mementos.
Table 7 lists the top 10 archived URI-Rs with the most mementos.
Figure 5 shows the number of URI-Ms with Memento- Datetimes in each year.
URI-M –
an archived snapshot of the URI-R at a specific date and time, which is called the Memento-Datetime, e.g., URI-Mi =URI-R@ti.
Example: www.AlJazeera.net @ Jan-1-2015 14:24:27
4.7 Search Engine Indexing
we are also interested to discover how well they are indexed in search engines such as Google.
We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google.
We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API.
However, we note that the Google user web interface may produce different results than the Custom Search API.
For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple.
In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple.
We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
4.7 Search Engine Indexing
we are also interested to discover how well they are indexed in search engines such as Google.
We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google.
We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API.
However, we note that the Google user web interface may produce different results than the Custom Search API.
For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple.
In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple.
We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
4.7 Search Engine Indexing
we are also interested to discover how well they are indexed in search engines such as Google.
We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google.
We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API.
However, we note that the Google user web interface may produce different results than the Custom Search API.
For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple.
In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple.
We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
4.7 Search Engine Indexing
we are also interested to discover how well they are indexed in search engines such as Google.
We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google.
We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API.
However, we note that the Google user web interface may produce different results than the Custom Search API.
For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple.
In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple.
We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
4.7 Search Engine Indexing
we are also interested to discover how well they are indexed in search engines such as Google.
We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google.
We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API.
However, we note that the Google user web interface may produce different results than the Custom Search API.
For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple.
In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple.
We note that all of our Arabic seeds were present on the live Web at the time of our analysis.