Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Scraping, Transforming, and
Enriching Bibliographic
Data with Google Sheets
presentation for CAMIG & Word Lab
January 31, 2020
Michael P. Williams
Area Studies Technical Services Coordinator, Penn Libraries
mpw2@upenn.edu

 Area Studies Technical Services acquires, catalogs, and
process materials in non-Western languages and in non-Latin
scripts from suppliers across the world
 Firm ordering is often done for long lists of materials
advertised by a small set of vendors as spreadsheets, and
selected by Area Studies bibliographies.
 We are transitioning from single, copy-paste-click-heavy
transactional orders to batch-ordering mediated by
spreadsheets (Excel) and MarcEdit software to transforms
tabular data into brief MARC records.
What Our Department Does

Through seven use cases of real world applications, you’ll see how I have used:
 Google Sheets' IMPORTHTML, IMPORTXML, and IMPORTDATA functions to fetch a
variety of bibliographic info on the web, with the help of HTML structures amd
XPath references.
 Text/number formulas (applicable in Excel and Google Sheets) such as
CONCATENATE, SPLIT, SUBSTITUTE, ROUNDUP, LEFT, RIGHT, MID, LEN, CHAR,
VALUE, TEXT, MOD, and PROPER to manipulate text strings.
 Conditional formulas like IF, IFNA (or IFERROR) to make statements so the
spreadsheet can “make choices.”
 Third-party applications, such as add-ons like MatchMarc which query OCLC with
an API, or a Google Scripts application like "importRegex" to apply Regular
Expressions to web scraping, or my own (really clunky) home-grown "ISBN
Toolkit" to clean, validate, and reconstruct ISBNs for additional bibliographic
information.
What’s in This Slide Deck?

I. Building Useful Brief Records in
Japanese Acquisitions

Use Case 1: Scraping Bibliographic Data
from a Bookseller’s Site
 Context: A certain vendor, Japan Publication Trading (JPT), lists their titles
with a unique reference number (e.g. JPTB1907-0001, the first title in their
July 2019 catalog) and can send us readymade MARC records from this data
for ordering purposes. Their catalog also lists their price.
 Problem: From a “wish list” of ISBNs compiled by a bibliographer, how can we
determine which titles JPT readily stocks (and which we can fast-track order)
and which titles will they need to source for us?
 Solution: Use Google Sheets IMPORTHTML function to query an ISBN on the
vendor site and return information stored as an unordered list (<ul>).
 Assumptions: We will retrieve one and ONLY one result.

Using one formula in
Google Sheets, we can
turn this website’s <ul>
element into spreadsheet
cells:
(ISBN is in A2)
=SPLIT((SUBSTITUTE(IMPO
RTHTML((CONCATENATE("
https://jptbooknews.jptc
o.co.jp/product?q=",A2)),
"list",2), char(10),
"|")),"|")

=SPLIT((SUBSTITUTE(IMPORTHTML((CONCATENATE("https://jptbooknews.jptco.co.jp/
product?q=",A2)),"list",2), CHAR(10), "|")),"|")
1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN):
CONCATENATE("https://jptbooknews.jptco.co.jp/product?q=",A2)
 https://jptbooknews.jptco.co.jp/product?q=9784065137338
2. It IMPORTs the HTML from the URL it just built, finding the 2nd occurrence of the
“list” element on the page
3. It takes the line breaks in the list (defined by formula CHAR(10)) and SUBSTITUTES a
pipe (“|”) for each
4. It then SPLITs the data in that list at the pipe, sending it across spreadsheet cells.
(Essentially a “text to columns” function)
5. Afterward, additional formulas fetch and clean that data with other
SPLIT/SUBSTITUTE functions for easier readability.
How Does This Work?

=ROUNDUP(E2*GOOGLEFINANCE("CURRENCY:JPYUSD"),2)
1. Fetches the yen price written to cell E2 and converts to USD with Google Finance formulas
2. Rounds up the value to 2 decimal places with ROUNDUP
3. Gets estimated US price
=VALUE(SUBSTITUTE(LEFT(B3,6),"JPTB","20"))
1. Takes the LEFT-most 6 characters in cell B3 (where the vendor reference number is written), e.g.
“JPTB19”
2. SUBSTITUTEs the string “JPTB” with the digits 20
3. Gets the numerical VALUE of this text string
4. Gets estimated year of publication based on the vendor reference number prefix.
More Behind the Scenes Work

Use Case 2: Using Known ISBNs to Fetch
Bibliographic Data from a Union Catalog
 Context: Even if JPT doesn’t stock a title, they can source it for us. But what
they can’t easily source is sufficient bibliographic data—especially accurate
Romanized Japanese required for us to make useful MARC records for both
ordering and pre-acquisition patron discovery.
 Problem: From that same wish list of ISBNs, how can we get critical
bibliographic data and accurate romanization?
 Solution: Use Google Sheets IMPORTXML function to query an ISBN on the
union catalog, retrieve the catalog ID, and then use the catalog ID for further
IMPORTXML and IMPORTDATA functions. Finally, use Google Translate is used to
get romanization from retrieved data.

First, we get the NCID (the record identifier) from search
results:
=SUBSTITUTE(IMPORTXML(CONCATENATE("https://ci.nii.ac.jp/books/search?ad
vanced=false&count=20&sortorder=3&q=",A2),"/html/body/div/div[3]/div[1]/d
iv/form/div/ul/li/div/dl/dt/a/@href") ,"/ncid/","")
1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN):
2. It IMPORTs the XML from the URL it just built, using an Xpath absolute
reference to drill down the page to the NCID value, which occurs as a
hyperlink (a link to the actual bib record)
 /html/body/div/div[3]/div[1]/div/form/div/ul/li/div/dl/dt/a/@href
3. It takes the URL portion it fetched and SUBSTITUTES the string “/ncid/” with
nothing (“”)

Then, we get the bib record data using the NCID we fetched
1. =CONCATENATE("https://ci.nii.ac.jp/ncid/",D2) [NCID is in D2]
 builds the URL for the bib record
2. =IMPORTXML(M2, "/html/body/div/div[3]/div[2]/div/div[4]/ul/li[7]/dl/dd")
[bib record URL is in M2]
 fetches pagination, which is in the bib record only
3. =CONCATENATE(M2,".tsv") [bib record URL is in M2]
 builds a URL for a .tsv file provided by CiNii with each bib (contains limited
info)
4. =INDEX(IMPORTDATA(N2),2) [.tsv URL is in N2]
 uses Google Sheets IMPORTDATA function to fetch the .tsv file
 INDEX function specifies row 2 only (removes header row), to get bib data in
one row

At last, use Google Translate to take the Japanese title
reading (katakana) and transliterate it into roman
characters.

…to make something like this:
*(note: actual data here fetched from MatchMarc)

II. Discovering Titles & Scraping
Bibliographic Details in South Asia
Acquisitions

Use Case 3: Scraping Bibliographic Data
from Vendor Catalog Searches
 Context: There are many South Asian languages for which we do not have
sufficient time or expertise to collect, but which are important for a
representative collection. The catalog of Hindi Book Centre
(https://www.hindibook.com/) provides long catalog lists for both Urdu and
Punjabi.
 Problem: These lists are long, and it would be time consuming to search
through each without good filtering mechanisms, which aren’t at the initial
list level.
 Solution: Scrape the URLs from a list of titles in a search, then scrape the bib
data from each URL fetched so they can be sorted and filtered.
 Assumptions: Every catalog record has the same data in the same place, with
no data fields missing. (…this proves to be mostly true)

https://www.hindibook.com/index.php?p=sr&
String=urdu-books&Field=keywords
https://www.hindibook.co
m/index.php?p=sr&format=
fullpage&Field=bookcode&S
tring=9788178018539

 First, we want to get all the URLs of results for Urdu titles. There are 1514 results, and we can
get up to 72 results per page. Since 1514 / 72 = 21.027, that means our results cover 22 pages.
 When we click Page 2, we see the URL syntax:
https://www.hindibook.com/index.php?&p=sr&String=urdu-
books&Field=keywords&perpage=72&startrow=72.
 We can guess that the start row is “0” for Page 1, and can use Google Sheets to CONCATENTE
URLs using multiples of 72 (that is, a column of =A1+72, in succession) until we reach the page
that starts at row 1512 (that’s page 22).

 For each page URL, we expect up to 72 titles returned. But we don’t
need the titles, we need the URLs to the titles’ records.
 If we right-click on any linked title, we can inspect the element, e.g.:
<a
href="index.php?p=sr&format=fullpage&Field=bookcode&
amp;String=9788178018539 " class="h7 steelblue"> 1857 KI JUNG-
E-AZADI KA GUMNAM SHAHEED RAJA NAHAR SINGH </a>
 Using XPath and IMPORTXML we can say “get me the href where there
is any <a> with class="h7 steelblue":
 =IMPORTXML(A2,"//a[@class='h7 steelblue']/@href")

 All 72 (almost complete) URLs are returned in an array, which means we’d need to
be cautious about how we sort the list page URLs in column A.

 For each page URL element in B2, we build a full URL
=CONCATENATE("https://www.hindibook.com/",B2)
 Then we IMPORT the XML from each URL in C2:
=IMPORTXML(C2,"//div[@id='panel1d']") [all book data is contained in this <div>]
 Optionally, we can get additional data (like the
Hindi Book Centre vendor number) with
additional XPaths
=IMPORTXML(C2,"//div[@id='panel2d']")
This would be helpful if we decided to use them
as a vendor.

III. Checking Holdings & Fetching
OCLC Data Using ISBNs

Use Case 4: Matching OCLC Data to
Known ISBNs
 Context: We’ve scraped a lot of Urdu ISBNs from Hindi Book Centre, but we
need to make informed decisions about whether we should, and how we
could, acquire these. We’d want to gauge whether these have OCLC records
(for easy cataloging), whether our reliable South Asian vendor DK Agencies
can provide them, and then get accurate information for ordering purposes
(titles romanized by Hindi Book Centre do not match ALA/LC romanization
standards).
 Problem: There are 1500+ items, and no staff member has time to search for
them one by one, to check duplication and get additional info and correct
Romanization.
 Solution: Use MatchMarc, a Google Sheets add-on, to query the ISBNs we
fetched and return information from OCLC.

MatchMarc: A Google Sheets
Add-on that uses the
WorldCat Search AP
By Michelle Suranofsky and
Lisa McColl
Lehigh University Libraries
has developed a new tool for
querying WorldCat using the
WorldCat Search API. […] The
tool will return a single
“best” OCLC record number,
and its bibliographic
information for a given ISBN
or LCCN, allowing the user to
set up and define “best.”
Code4Lib Journal, Issue 46,
2019-11-05
https://journal.code4lib.org
/articles/14813

These Hindi Books ISBNs….
…searched against these criteria…
…match this OCLC data.

Making decisions with this data: Assuming the bibliographer wanted all of these titles…
 If a local record is found, our holdings are in OCLC. Title is a duplicate so we don’t need to
order.
 If the existing OCLC record does not have vernacular scripts (indicated by 066$a), we’d
prefer to get DK Agencies to sell us the book and provide a MARC record with those scripts.
The DK number was in a 938$n subfield.
 If the existing OCLC record already has scripts, and good cataloging, we can order from
any vendor. (DK is good but not cheap).
 If the existing OCLC record is missing call numbers or subjects, we may want to weigh
options in purchasing.
 If there is no OCLC record at all, this will require original cataloging we cannot handle.
Purchase from DK if wanted.

Use Case 5: From Known ISBNs, Check
Franklin to Confirm Holdings
 Context: (Ideally) OCLC will display our holdings for all items we have
cataloged; our holdings for these are sent to OCLC. But for those items
already on order but not yet cataloged, we should confirm whether there are
in Alma/Franklin.
 Problem: Once again, staff time is valuable, and copying and pasting
ISBNs/titles in Alma/Franklin is time consuming with possibly little payoff.
 Solution: Use IMPORTXML, IF, and IFNA functions to query Franklin, retrieve
an MMS ID, a title, a link, or otherwise tell us we don’t have the title.

First query Franklin with an ISBN to check for an MMS ID…
=IFNA(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isxn
_search&q=",A2),"//div[@class='availability-ajax-load']/@data-availability-ids"),"no Franklin result
found")
….then fetch the title from the first search result…
=INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isx
n_search&q=",A2),"//h3[@class='index_title document-title-heading col-sm-9 col-lg-10']"),1,2)
….then generate a link to that bib.
=IF(B2="no Franklin result found","no Franklin
link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FRANKLIN_",B2)))

=IFNA(IMPORTXML(CONCATENATE("https://franklin.library.
upenn.edu/catalog?utf8=?&search_field=isxn_search&q="
,A2),"//div[@class='availability-ajax-load']/@data-
availability-ids"),"no Franklin result found")
1. It CONCATENTATEs the base search URL with the search
query in A2 (the ISBN)
2. It IMPORTs the XML from the URL it just built, retrieving
the MMS ID from the “data-availability-ids” attribute of
<div> element whose class is “availability-ajax-load”
3. And IF such an element is not applicable (IFNA), it will
display the text “no Franklin result found” instead
How It Works 1:
Perform a Query to Find an MMS ID

=INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upe
nn.edu/catalog?utf8=?&search_field=isxn_search&q=",A2),"//h3
[@class='index_title document-title-heading col-sm-9 col-lg-
10']"),1,2)
1. As above, CONCATENTATEs the same search URL with the title
retrieved from <h3> tag
a link
2. <h3> tag contains a text break, so the INDEX function says “get
row 1, column 2” (where the title will appear)
How It Works 2:
Perform a Query to Find Title

=IF(B2="no Franklin result found","no Franklin
link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FR
ANKLIN_",B2)))
1. IF the result in B2 is the text “no Franklin result found”, displays
“no Franklin link”
2. Otherwise, CONCATENTATEs the Franklin URL with the MMS ID
retrieved in B2 to generate
a link
How It Works 3:
Generate a Link to the Bib Record

IV. Putting REGEX (Regular
Expressions) to Work in Google Sheets

Use Case 6: Google’s IMPORT[X] Functions Are
Slow, and Frequently Time Out
 Context: Google Sheets is doing a lot of work importing HTML, XML, and
DATA. This causes timeouts and results take time.
 Problem: Sometimes the functions are working so hard that no results load at
all. Google also throttles the amount of queries you can do per day and at a
time across all Google Sheets in your Google Drive.
 Solution: Make your own custom function with Google Scripts (or borrow
one!) to bypass those speed issues.
 Assumptions: You can program, you know a programmer, or you are willing to
search for a solution online and just see what happens.

You don’t have to be a programmer, but you can fake it. Google Apps Script
(based on JavaScript) can plug into Google Drive applications, like Google
Sheets. Many scripts are available in forums like Stack Overflow, etc.
custom importRegex function developed by Josh Bradley (@josh_b_rad)
https://stackoverflow.com/questions/39014766/to-exceed-the-
importxml-limit-on-google-spreadsheet

For example… we know Leila Books have catalog numbers, so can
guess likely URLs by assuming the numbers in A1 match something in
their catalog:
=CONCATENATE("https://www.leilabooks.com/en/details.php?book
no=",A1)
We use the custom importRegex function to return the desired data
from the URL in B column (B$) with regular expressions, e.g.:
=importRegex($B1,"Book Title</td><td width='75%'
class='colmn2'>(.*)</td>")

Use Case 7: ISBNs Should be Unique and
Valid… But Sometimes Aren’t
 Context: In a perfect world, every resource should have a valid ISBN to
differentiate titles, editions of those titles, and formats of those editions (i.e.
1st edition of a print title has a different ISBN than the eBook edition, and
than its 2nd edition, etc.). ISBNs have come in different, equivalent “flavors”
too, which look similar but are distinct.
 Problem: The world isn’t perfect, and ISBNs aren’t free. Publishers recycle
them across titles/editions, fail to use them as expected, or format them
improperly. Else, they provide one flavor (ISBN-10) when we really expected
another (ISBN-13) for a particular application.
 Solution: Use Excel/Google Sheets to attempt to fix ISBNs using known rules
for calculating ISBNs.
 Assumptions: You have a lot of ISBNs, suspect some are broken/invalid,
and/or you also want to convert between ISBN-10s and -13s.

Using some functions like SUBSTITUTE and IF, we can clean ISBNs of extra characters
like hyphens, and then use the LEN (length) function to determine if they are valid
lengths (10 or 13).
If they are, we can then calculate the valid check digit, determine if the ISBN we
entered is valid, and if it isn’t, we can “reconstruct” the supposedly valid ISBN.

Determining the ISBN type
=IF(LEN(SUBSTITUTE(A2,"-",""))=13,"ISBN-13",IF(LEN(SUBSTITUTE(A2,"-",""))=10,"ISBN-
10","N/A"))
1. We’ve nested some IF statements. The first one, IF(LEN(SUBSTITUTE(A2,"-",""))=13,
will SUBSTITUTE the hyphens (“-”) with nothing “”), then it will calculate the LENgth.
IF that LEN is 13, it will display “ISBN-13”
2. If that LEN is not 13, it will try again, this time looking for 10 digits in that column. If
it’s 10, it will display “ISBN-10”.
3. Otherwise, it displays “N/A”: The ISBN type cannot be determined, so we cannot
presume where the missing or extra digits are.

Readymade formulas help us first calculate the ISBN-10 check digit, using the 9
digit “root” (the ISBN minus the 978- prefix, and minus the check digit).
=IF(LEN(SUBSTITUTE(A2,"-",""))=10,MOD(MID((SUBSTITUTE(A2,"-
","")),1,1)+MID((SUBSTITUTE(A2,"-","")),2,1)*2+MID((SUBSTITUTE(A2,"-
","")),3,1)*3+MID((SUBSTITUTE(A2,"-","")),4,1)*4+MID((SUBSTITUTE(A2,"-
","")),9,1)*9,11),IF(LEN(SUBSTITUTE(A2,"-",""))=13,MOD(MID((SUBSTITUTE(A2,"-
","")),4,1)+MID((SUBSTITUTE(A2,"-","")),5,1)*2+MID((SUBSTITUTE(A2,"-
","")),12,1)*9,11),"BAD ISBN"))
And one more function helps us get a value of “X” for ISBN-10’s that end in X
=IF(C2=10,"X",C2)
Helpful sources:
• http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/
• http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for-
isbn-10.html

With the ISBN-10 check digit calculated (value of 0-9 or else X), we can
reconstruct a valid ISBN-10 (and write it as a 10-character TEXT value, since it
may end with an “X”)…
=TEXT(IF(LEN(SUBSTITUTE(A2,"-",""))=13,CONCATENATE(MID((SUBSTITUTE(A2,"-
","")),4,9),D2),IF(LEN(SUBSTITUTE(A2,"-
",""))=10,CONCATENATE(MID((SUBSTITUTE(A2,"-","")),1,9),D2),"Cannot
validate")),"0000000000")
…and with that ISBN-10, we calculate the valid ISBN-13:
=TEXT(IF(LEN(E3)=10,CONCATENATE("978",MID(E3,1,9),MOD((10-
MOD(SUM(9,21,8,PRODUCT(MID(E3,1,1),3),MID(E3,2,1),PRODUCT(MID(E3,3,1),3),MI
D(E3,4,1),PRODUCT(MID(E3,5,1),3),MID(E3,6,1),PRODUCT(MID(E3,7,1),3),MID(E3,8,
1),PRODUCT(MID(E3,9,1),3)),10)),10)), "Cannot validate"), 0)
Helpful sources:
• http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/
• http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for-
isbn-10.html

And sometimes the “invalid” ISBN is “valid”, in context of the title in hand (or on a
vendor spreadsheet)
“Invalid” ISBN-10 8185360866 Valid ISBN-10 8185360863:
Two editions matched (different date)

Use Case Addendum: Fetching Book Data
from Known Bookseller URLs (my first
experiment, rebuilt!)
 Context: We want to expand language coverage of titles in our South Asia
collections, but no staffing model can accommodate the dozens of languages
we need for a representative collection. We found a vendor, DC Books, who
has a great website for Mayalayam books with info about them in English.
 Problem: A staff member who knows Malayalam can make book
recommendations, but asking him to copy and paste bibliographic data one at
a time into a spreadsheet for a bibliographer to review is laborious and time-
consuming.
 Solution: Use structured data on the DC Books website, and have the staff
member just record the URLs of books he recommends to us.

=IMPORTXML(A2,"//span[@style='font-size:14px; line-
height:26px; color:#333;’]”)
 Takes the URL identified, and imports a <span>
element where the bib data lives.
=PROPER() functions normalize ALL CAPS to Proper Case
for title, author, and publisher.
=RIGHT([cell],4) takes the YYYY from the DD-MM-YYYY
date

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Recommended

Recommended

More Related Content

Similar to Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Similar to Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets (20)

Recently uploaded

Recently uploaded (20)

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets