SlideShare a Scribd company logo
1 of 45
Scraping, Transforming, and
Enriching Bibliographic
Data with Google Sheets
presentation for CAMIG & Word Lab
January 31, 2020
Michael P. Williams
Area Studies Technical Services Coordinator, Penn Libraries
mpw2@upenn.edu
 Area Studies Technical Services acquires, catalogs, and
process materials in non-Western languages and in non-Latin
scripts from suppliers across the world
 Firm ordering is often done for long lists of materials
advertised by a small set of vendors as spreadsheets, and
selected by Area Studies bibliographies.
 We are transitioning from single, copy-paste-click-heavy
transactional orders to batch-ordering mediated by
spreadsheets (Excel) and MarcEdit software to transforms
tabular data into brief MARC records.
What Our Department Does
Through seven use cases of real world applications, you’ll see how I have used:
 Google Sheets' IMPORTHTML, IMPORTXML, and IMPORTDATA functions to fetch a
variety of bibliographic info on the web, with the help of HTML structures amd
XPath references.
 Text/number formulas (applicable in Excel and Google Sheets) such as
CONCATENATE, SPLIT, SUBSTITUTE, ROUNDUP, LEFT, RIGHT, MID, LEN, CHAR,
VALUE, TEXT, MOD, and PROPER to manipulate text strings.
 Conditional formulas like IF, IFNA (or IFERROR) to make statements so the
spreadsheet can “make choices.”
 Third-party applications, such as add-ons like MatchMarc which query OCLC with
an API, or a Google Scripts application like "importRegex" to apply Regular
Expressions to web scraping, or my own (really clunky) home-grown "ISBN
Toolkit" to clean, validate, and reconstruct ISBNs for additional bibliographic
information.
What’s in This Slide Deck?
I. Building Useful Brief Records in
Japanese Acquisitions
Use Case 1: Scraping Bibliographic Data
from a Bookseller’s Site
 Context: A certain vendor, Japan Publication Trading (JPT), lists their titles
with a unique reference number (e.g. JPTB1907-0001, the first title in their
July 2019 catalog) and can send us readymade MARC records from this data
for ordering purposes. Their catalog also lists their price.
 Problem: From a “wish list” of ISBNs compiled by a bibliographer, how can we
determine which titles JPT readily stocks (and which we can fast-track order)
and which titles will they need to source for us?
 Solution: Use Google Sheets IMPORTHTML function to query an ISBN on the
vendor site and return information stored as an unordered list (<ul>).
 Assumptions: We will retrieve one and ONLY one result.
Using one formula in
Google Sheets, we can
turn this website’s <ul>
element into spreadsheet
cells:
(ISBN is in A2)
=SPLIT((SUBSTITUTE(IMPO
RTHTML((CONCATENATE("
https://jptbooknews.jptc
o.co.jp/product?q=",A2)),
"list",2), char(10),
"|")),"|")
=SPLIT((SUBSTITUTE(IMPORTHTML((CONCATENATE("https://jptbooknews.jptco.co.jp/
product?q=",A2)),"list",2), CHAR(10), "|")),"|")
1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN):
CONCATENATE("https://jptbooknews.jptco.co.jp/product?q=",A2)
 https://jptbooknews.jptco.co.jp/product?q=9784065137338
2. It IMPORTs the HTML from the URL it just built, finding the 2nd occurrence of the
“list” element on the page
3. It takes the line breaks in the list (defined by formula CHAR(10)) and SUBSTITUTES a
pipe (“|”) for each
4. It then SPLITs the data in that list at the pipe, sending it across spreadsheet cells.
(Essentially a “text to columns” function)
5. Afterward, additional formulas fetch and clean that data with other
SPLIT/SUBSTITUTE functions for easier readability.
How Does This Work?
=ROUNDUP(E2*GOOGLEFINANCE("CURRENCY:JPYUSD"),2)
1. Fetches the yen price written to cell E2 and converts to USD with Google Finance formulas
2. Rounds up the value to 2 decimal places with ROUNDUP
3. Gets estimated US price
=VALUE(SUBSTITUTE(LEFT(B3,6),"JPTB","20"))
1. Takes the LEFT-most 6 characters in cell B3 (where the vendor reference number is written), e.g.
“JPTB19”
2. SUBSTITUTEs the string “JPTB” with the digits 20
3. Gets the numerical VALUE of this text string
4. Gets estimated year of publication based on the vendor reference number prefix.
More Behind the Scenes Work
Use Case 2: Using Known ISBNs to Fetch
Bibliographic Data from a Union Catalog
 Context: Even if JPT doesn’t stock a title, they can source it for us. But what
they can’t easily source is sufficient bibliographic data—especially accurate
Romanized Japanese required for us to make useful MARC records for both
ordering and pre-acquisition patron discovery.
 Problem: From that same wish list of ISBNs, how can we get critical
bibliographic data and accurate romanization?
 Solution: Use Google Sheets IMPORTXML function to query an ISBN on the
union catalog, retrieve the catalog ID, and then use the catalog ID for further
IMPORTXML and IMPORTDATA functions. Finally, use Google Translate is used to
get romanization from retrieved data.
 Assumptions: We will retrieve one and ONLY one result.
First, we get the NCID (the record identifier) from search
results:
=SUBSTITUTE(IMPORTXML(CONCATENATE("https://ci.nii.ac.jp/books/search?ad
vanced=false&count=20&sortorder=3&q=",A2),"/html/body/div/div[3]/div[1]/d
iv/form/div/ul/li/div/dl/dt/a/@href") ,"/ncid/","")
1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN):
2. It IMPORTs the XML from the URL it just built, using an Xpath absolute
reference to drill down the page to the NCID value, which occurs as a
hyperlink (a link to the actual bib record)
 /html/body/div/div[3]/div[1]/div/form/div/ul/li/div/dl/dt/a/@href
3. It takes the URL portion it fetched and SUBSTITUTES the string “/ncid/” with
nothing (“”)
Then, we get the bib record data using the NCID we fetched
1. =CONCATENATE("https://ci.nii.ac.jp/ncid/",D2) [NCID is in D2]
 builds the URL for the bib record
2. =IMPORTXML(M2, "/html/body/div/div[3]/div[2]/div/div[4]/ul/li[7]/dl/dd")
[bib record URL is in M2]
 fetches pagination, which is in the bib record only
3. =CONCATENATE(M2,".tsv") [bib record URL is in M2]
 builds a URL for a .tsv file provided by CiNii with each bib (contains limited
info)
4. =INDEX(IMPORTDATA(N2),2) [.tsv URL is in N2]
 uses Google Sheets IMPORTDATA function to fetch the .tsv file
 INDEX function specifies row 2 only (removes header row), to get bib data in
one row
At last, use Google Translate to take the Japanese title
reading (katakana) and transliterate it into roman
characters.
…to make something like this:
*(note: actual data here fetched from MatchMarc)
II. Discovering Titles & Scraping
Bibliographic Details in South Asia
Acquisitions
Use Case 3: Scraping Bibliographic Data
from Vendor Catalog Searches
 Context: There are many South Asian languages for which we do not have
sufficient time or expertise to collect, but which are important for a
representative collection. The catalog of Hindi Book Centre
(https://www.hindibook.com/) provides long catalog lists for both Urdu and
Punjabi.
 Problem: These lists are long, and it would be time consuming to search
through each without good filtering mechanisms, which aren’t at the initial
list level.
 Solution: Scrape the URLs from a list of titles in a search, then scrape the bib
data from each URL fetched so they can be sorted and filtered.
 Assumptions: Every catalog record has the same data in the same place, with
no data fields missing. (…this proves to be mostly true)
https://www.hindibook.com/index.php?p=sr&
String=urdu-books&Field=keywords
https://www.hindibook.co
m/index.php?p=sr&format=
fullpage&Field=bookcode&S
tring=9788178018539
 First, we want to get all the URLs of results for Urdu titles. There are 1514 results, and we can
get up to 72 results per page. Since 1514 / 72 = 21.027, that means our results cover 22 pages.
 When we click Page 2, we see the URL syntax:
https://www.hindibook.com/index.php?&p=sr&String=urdu-
books&Field=keywords&perpage=72&startrow=72.
 We can guess that the start row is “0” for Page 1, and can use Google Sheets to CONCATENTE
URLs using multiples of 72 (that is, a column of =A1+72, in succession) until we reach the page
that starts at row 1512 (that’s page 22).
 For each page URL, we expect up to 72 titles returned. But we don’t
need the titles, we need the URLs to the titles’ records.
 If we right-click on any linked title, we can inspect the element, e.g.:
<a
href="index.php?p=sr&amp;format=fullpage&amp;Field=bookcode&
amp;String=9788178018539 " class="h7 steelblue"> 1857 KI JUNG-
E-AZADI KA GUMNAM SHAHEED RAJA NAHAR SINGH </a>
 Using XPath and IMPORTXML we can say “get me the href where there
is any <a> with class="h7 steelblue":
 =IMPORTXML(A2,"//a[@class='h7 steelblue']/@href")
 All 72 (almost complete) URLs are returned in an array, which means we’d need to
be cautious about how we sort the list page URLs in column A.
 For each page URL element in B2, we build a full URL
=CONCATENATE("https://www.hindibook.com/",B2)
 Then we IMPORT the XML from each URL in C2:
=IMPORTXML(C2,"//div[@id='panel1d']") [all book data is contained in this <div>]
 Optionally, we can get additional data (like the
Hindi Book Centre vendor number) with
additional XPaths
=IMPORTXML(C2,"//div[@id='panel2d']")
This would be helpful if we decided to use them
as a vendor.
III. Checking Holdings & Fetching
OCLC Data Using ISBNs
Use Case 4: Matching OCLC Data to
Known ISBNs
 Context: We’ve scraped a lot of Urdu ISBNs from Hindi Book Centre, but we
need to make informed decisions about whether we should, and how we
could, acquire these. We’d want to gauge whether these have OCLC records
(for easy cataloging), whether our reliable South Asian vendor DK Agencies
can provide them, and then get accurate information for ordering purposes
(titles romanized by Hindi Book Centre do not match ALA/LC romanization
standards).
 Problem: There are 1500+ items, and no staff member has time to search for
them one by one, to check duplication and get additional info and correct
Romanization.
 Solution: Use MatchMarc, a Google Sheets add-on, to query the ISBNs we
fetched and return information from OCLC.
MatchMarc: A Google Sheets
Add-on that uses the
WorldCat Search AP
By Michelle Suranofsky and
Lisa McColl
Lehigh University Libraries
has developed a new tool for
querying WorldCat using the
WorldCat Search API. […] The
tool will return a single
“best” OCLC record number,
and its bibliographic
information for a given ISBN
or LCCN, allowing the user to
set up and define “best.”
Code4Lib Journal, Issue 46,
2019-11-05
https://journal.code4lib.org
/articles/14813
These Hindi Books ISBNs….
…searched against these criteria…
…match this OCLC data.
Making decisions with this data: Assuming the bibliographer wanted all of these titles…
 If a local record is found, our holdings are in OCLC. Title is a duplicate so we don’t need to
order.
 If the existing OCLC record does not have vernacular scripts (indicated by 066$a), we’d
prefer to get DK Agencies to sell us the book and provide a MARC record with those scripts.
The DK number was in a 938$n subfield.
 If the existing OCLC record already has scripts, and good cataloging, we can order from
any vendor. (DK is good but not cheap).
 If the existing OCLC record is missing call numbers or subjects, we may want to weigh
options in purchasing.
 If there is no OCLC record at all, this will require original cataloging we cannot handle.
Purchase from DK if wanted.
Use Case 5: From Known ISBNs, Check
Franklin to Confirm Holdings
 Context: (Ideally) OCLC will display our holdings for all items we have
cataloged; our holdings for these are sent to OCLC. But for those items
already on order but not yet cataloged, we should confirm whether there are
in Alma/Franklin.
 Problem: Once again, staff time is valuable, and copying and pasting
ISBNs/titles in Alma/Franklin is time consuming with possibly little payoff.
 Solution: Use IMPORTXML, IF, and IFNA functions to query Franklin, retrieve
an MMS ID, a title, a link, or otherwise tell us we don’t have the title.
 Assumptions: We will retrieve one and ONLY one result.
First query Franklin with an ISBN to check for an MMS ID…
=IFNA(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isxn
_search&q=",A2),"//div[@class='availability-ajax-load']/@data-availability-ids"),"no Franklin result
found")
….then fetch the title from the first search result…
=INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isx
n_search&q=",A2),"//h3[@class='index_title document-title-heading col-sm-9 col-lg-10']"),1,2)
….then generate a link to that bib.
=IF(B2="no Franklin result found","no Franklin
link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FRANKLIN_",B2)))
=IFNA(IMPORTXML(CONCATENATE("https://franklin.library.
upenn.edu/catalog?utf8=?&search_field=isxn_search&q="
,A2),"//div[@class='availability-ajax-load']/@data-
availability-ids"),"no Franklin result found")
1. It CONCATENTATEs the base search URL with the search
query in A2 (the ISBN)
2. It IMPORTs the XML from the URL it just built, retrieving
the MMS ID from the “data-availability-ids” attribute of
<div> element whose class is “availability-ajax-load”
3. And IF such an element is not applicable (IFNA), it will
display the text “no Franklin result found” instead
How It Works 1:
Perform a Query to Find an MMS ID
=INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upe
nn.edu/catalog?utf8=?&search_field=isxn_search&q=",A2),"//h3
[@class='index_title document-title-heading col-sm-9 col-lg-
10']"),1,2)
1. As above, CONCATENTATEs the same search URL with the title
retrieved from <h3> tag
a link
2. <h3> tag contains a text break, so the INDEX function says “get
row 1, column 2” (where the title will appear)
How It Works 2:
Perform a Query to Find Title
=IF(B2="no Franklin result found","no Franklin
link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FR
ANKLIN_",B2)))
1. IF the result in B2 is the text “no Franklin result found”, displays
“no Franklin link”
2. Otherwise, CONCATENTATEs the Franklin URL with the MMS ID
retrieved in B2 to generate
a link
How It Works 3:
Generate a Link to the Bib Record
IV. Putting REGEX (Regular
Expressions) to Work in Google Sheets
Use Case 6: Google’s IMPORT[X] Functions Are
Slow, and Frequently Time Out
 Context: Google Sheets is doing a lot of work importing HTML, XML, and
DATA. This causes timeouts and results take time.
 Problem: Sometimes the functions are working so hard that no results load at
all. Google also throttles the amount of queries you can do per day and at a
time across all Google Sheets in your Google Drive.
 Solution: Make your own custom function with Google Scripts (or borrow
one!) to bypass those speed issues.
 Assumptions: You can program, you know a programmer, or you are willing to
search for a solution online and just see what happens.
You don’t have to be a programmer, but you can fake it. Google Apps Script
(based on JavaScript) can plug into Google Drive applications, like Google
Sheets. Many scripts are available in forums like Stack Overflow, etc.
custom importRegex function developed by Josh Bradley (@josh_b_rad)
https://stackoverflow.com/questions/39014766/to-exceed-the-
importxml-limit-on-google-spreadsheet
For example… we know Leila Books have catalog numbers, so can
guess likely URLs by assuming the numbers in A1 match something in
their catalog:
=CONCATENATE("https://www.leilabooks.com/en/details.php?book
no=",A1)
We use the custom importRegex function to return the desired data
from the URL in B column (B$) with regular expressions, e.g.:
=importRegex($B1,"Book Title</td><td width='75%'
class='colmn2'>(.*)</td>")
Addendum: ISBN Toolkit
Use Case 7: ISBNs Should be Unique and
Valid… But Sometimes Aren’t
 Context: In a perfect world, every resource should have a valid ISBN to
differentiate titles, editions of those titles, and formats of those editions (i.e.
1st edition of a print title has a different ISBN than the eBook edition, and
than its 2nd edition, etc.). ISBNs have come in different, equivalent “flavors”
too, which look similar but are distinct.
 Problem: The world isn’t perfect, and ISBNs aren’t free. Publishers recycle
them across titles/editions, fail to use them as expected, or format them
improperly. Else, they provide one flavor (ISBN-10) when we really expected
another (ISBN-13) for a particular application.
 Solution: Use Excel/Google Sheets to attempt to fix ISBNs using known rules
for calculating ISBNs.
 Assumptions: You have a lot of ISBNs, suspect some are broken/invalid,
and/or you also want to convert between ISBN-10s and -13s.
Using some functions like SUBSTITUTE and IF, we can clean ISBNs of extra characters
like hyphens, and then use the LEN (length) function to determine if they are valid
lengths (10 or 13).
If they are, we can then calculate the valid check digit, determine if the ISBN we
entered is valid, and if it isn’t, we can “reconstruct” the supposedly valid ISBN.
Determining the ISBN type
=IF(LEN(SUBSTITUTE(A2,"-",""))=13,"ISBN-13",IF(LEN(SUBSTITUTE(A2,"-",""))=10,"ISBN-
10","N/A"))
1. We’ve nested some IF statements. The first one, IF(LEN(SUBSTITUTE(A2,"-",""))=13,
will SUBSTITUTE the hyphens (“-”) with nothing “”), then it will calculate the LENgth.
IF that LEN is 13, it will display “ISBN-13”
2. If that LEN is not 13, it will try again, this time looking for 10 digits in that column. If
it’s 10, it will display “ISBN-10”.
3. Otherwise, it displays “N/A”: The ISBN type cannot be determined, so we cannot
presume where the missing or extra digits are.
Readymade formulas help us first calculate the ISBN-10 check digit, using the 9
digit “root” (the ISBN minus the 978- prefix, and minus the check digit).
=IF(LEN(SUBSTITUTE(A2,"-",""))=10,MOD(MID((SUBSTITUTE(A2,"-
","")),1,1)+MID((SUBSTITUTE(A2,"-","")),2,1)*2+MID((SUBSTITUTE(A2,"-
","")),3,1)*3+MID((SUBSTITUTE(A2,"-","")),4,1)*4+MID((SUBSTITUTE(A2,"-
","")),5,1)*5+MID((SUBSTITUTE(A2,"-","")),6,1)*6+MID((SUBSTITUTE(A2,"-
","")),7,1)*7+MID((SUBSTITUTE(A2,"-","")),8,1)*8+MID((SUBSTITUTE(A2,"-
","")),9,1)*9,11),IF(LEN(SUBSTITUTE(A2,"-",""))=13,MOD(MID((SUBSTITUTE(A2,"-
","")),4,1)+MID((SUBSTITUTE(A2,"-","")),5,1)*2+MID((SUBSTITUTE(A2,"-
","")),6,1)*3+MID((SUBSTITUTE(A2,"-","")),7,1)*4+MID((SUBSTITUTE(A2,"-
","")),8,1)*5+MID((SUBSTITUTE(A2,"-","")),9,1)*6+MID((SUBSTITUTE(A2,"-
","")),10,1)*7+MID((SUBSTITUTE(A2,"-","")),11,1)*8+MID((SUBSTITUTE(A2,"-
","")),12,1)*9,11),"BAD ISBN"))
And one more function helps us get a value of “X” for ISBN-10’s that end in X
=IF(C2=10,"X",C2)
Helpful sources:
• http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/
• http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for-
isbn-10.html
With the ISBN-10 check digit calculated (value of 0-9 or else X), we can
reconstruct a valid ISBN-10 (and write it as a 10-character TEXT value, since it
may end with an “X”)…
=TEXT(IF(LEN(SUBSTITUTE(A2,"-",""))=13,CONCATENATE(MID((SUBSTITUTE(A2,"-
","")),4,9),D2),IF(LEN(SUBSTITUTE(A2,"-
",""))=10,CONCATENATE(MID((SUBSTITUTE(A2,"-","")),1,9),D2),"Cannot
validate")),"0000000000")
…and with that ISBN-10, we calculate the valid ISBN-13:
=TEXT(IF(LEN(E3)=10,CONCATENATE("978",MID(E3,1,9),MOD((10-
MOD(SUM(9,21,8,PRODUCT(MID(E3,1,1),3),MID(E3,2,1),PRODUCT(MID(E3,3,1),3),MI
D(E3,4,1),PRODUCT(MID(E3,5,1),3),MID(E3,6,1),PRODUCT(MID(E3,7,1),3),MID(E3,8,
1),PRODUCT(MID(E3,9,1),3)),10)),10)), "Cannot validate"), 0)
Helpful sources:
• http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/
• http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for-
isbn-10.html
And sometimes the “invalid” ISBN is “valid”, in context of the title in hand (or on a
vendor spreadsheet)
“Invalid” ISBN-10 8185360866 Valid ISBN-10 8185360863:
Two editions matched (different date)
Questions/Demo Time?
Use Case Addendum: Fetching Book Data
from Known Bookseller URLs (my first
experiment, rebuilt!)
 Context: We want to expand language coverage of titles in our South Asia
collections, but no staffing model can accommodate the dozens of languages
we need for a representative collection. We found a vendor, DC Books, who
has a great website for Mayalayam books with info about them in English.
 Problem: A staff member who knows Malayalam can make book
recommendations, but asking him to copy and paste bibliographic data one at
a time into a spreadsheet for a bibliographer to review is laborious and time-
consuming.
 Solution: Use structured data on the DC Books website, and have the staff
member just record the URLs of books he recommends to us.
=IMPORTXML(A2,"//span[@style='font-size:14px; line-
height:26px; color:#333;’]”)
 Takes the URL identified, and imports a <span>
element where the bib data lives.
=PROPER() functions normalize ALL CAPS to Proper Case
for title, author, and publisher.
=RIGHT([cell],4) takes the YYYY from the DD-MM-YYYY
date

More Related Content

Similar to Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
How To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxHow To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxiwebdatascraping
 
How To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfHow To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfjimmylofy
 
RPE - Template formating, style and stylesheet usage
RPE - Template formating, style and stylesheet usageRPE - Template formating, style and stylesheet usage
RPE - Template formating, style and stylesheet usageGEBS Reporting
 
XML for beginners
XML for beginnersXML for beginners
XML for beginnerssafysidhu
 
BizTalk Server – How maps work
BizTalk Server – How maps workBizTalk Server – How maps work
BizTalk Server – How maps workSandro Pereira
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesDeniz Kılınç
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMarco Tusa
 
DrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDoug Green
 
unit 4,Indexes in database.docx
unit 4,Indexes in database.docxunit 4,Indexes in database.docx
unit 4,Indexes in database.docxRaviRajput416403
 
NoSQL - A Closer Look to Couchbase
NoSQL - A Closer Look to CouchbaseNoSQL - A Closer Look to Couchbase
NoSQL - A Closer Look to CouchbaseMohammad Shaker
 
Using Rational Publishing Engine to generate documents from Rational Rhapsody
Using Rational Publishing Engine to generate documents from Rational RhapsodyUsing Rational Publishing Engine to generate documents from Rational Rhapsody
Using Rational Publishing Engine to generate documents from Rational RhapsodyGEBS Reporting
 
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...Massimo Cenci
 

Similar to Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets (20)

Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable
Bigtable Bigtable
Bigtable
 
Builder pattern
Builder patternBuilder pattern
Builder pattern
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
How To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptxHow To Crawl Amazon Website Using Python Scrap (1).pptx
How To Crawl Amazon Website Using Python Scrap (1).pptx
 
How To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdfHow To Crawl Amazon Website Using Python Scrapy.pdf
How To Crawl Amazon Website Using Python Scrapy.pdf
 
Postgres indexes
Postgres indexesPostgres indexes
Postgres indexes
 
RPE - Template formating, style and stylesheet usage
RPE - Template formating, style and stylesheet usageRPE - Template formating, style and stylesheet usage
RPE - Template formating, style and stylesheet usage
 
XML for beginners
XML for beginnersXML for beginners
XML for beginners
 
BizTalk Server – How maps work
BizTalk Server – How maps workBizTalk Server – How maps work
BizTalk Server – How maps work
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sources
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
DrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and Drupal
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
unit 4,Indexes in database.docx
unit 4,Indexes in database.docxunit 4,Indexes in database.docx
unit 4,Indexes in database.docx
 
NoSQL - A Closer Look to Couchbase
NoSQL - A Closer Look to CouchbaseNoSQL - A Closer Look to Couchbase
NoSQL - A Closer Look to Couchbase
 
Using Rational Publishing Engine to generate documents from Rational Rhapsody
Using Rational Publishing Engine to generate documents from Rational RhapsodyUsing Rational Publishing Engine to generate documents from Rational Rhapsody
Using Rational Publishing Engine to generate documents from Rational Rhapsody
 
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...
Recipes 10 of Data Warehouse and Business Intelligence - The descriptions man...
 

Recently uploaded

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets

  • 1. Scraping, Transforming, and Enriching Bibliographic Data with Google Sheets presentation for CAMIG & Word Lab January 31, 2020 Michael P. Williams Area Studies Technical Services Coordinator, Penn Libraries mpw2@upenn.edu
  • 2.  Area Studies Technical Services acquires, catalogs, and process materials in non-Western languages and in non-Latin scripts from suppliers across the world  Firm ordering is often done for long lists of materials advertised by a small set of vendors as spreadsheets, and selected by Area Studies bibliographies.  We are transitioning from single, copy-paste-click-heavy transactional orders to batch-ordering mediated by spreadsheets (Excel) and MarcEdit software to transforms tabular data into brief MARC records. What Our Department Does
  • 3. Through seven use cases of real world applications, you’ll see how I have used:  Google Sheets' IMPORTHTML, IMPORTXML, and IMPORTDATA functions to fetch a variety of bibliographic info on the web, with the help of HTML structures amd XPath references.  Text/number formulas (applicable in Excel and Google Sheets) such as CONCATENATE, SPLIT, SUBSTITUTE, ROUNDUP, LEFT, RIGHT, MID, LEN, CHAR, VALUE, TEXT, MOD, and PROPER to manipulate text strings.  Conditional formulas like IF, IFNA (or IFERROR) to make statements so the spreadsheet can “make choices.”  Third-party applications, such as add-ons like MatchMarc which query OCLC with an API, or a Google Scripts application like "importRegex" to apply Regular Expressions to web scraping, or my own (really clunky) home-grown "ISBN Toolkit" to clean, validate, and reconstruct ISBNs for additional bibliographic information. What’s in This Slide Deck?
  • 4. I. Building Useful Brief Records in Japanese Acquisitions
  • 5. Use Case 1: Scraping Bibliographic Data from a Bookseller’s Site  Context: A certain vendor, Japan Publication Trading (JPT), lists their titles with a unique reference number (e.g. JPTB1907-0001, the first title in their July 2019 catalog) and can send us readymade MARC records from this data for ordering purposes. Their catalog also lists their price.  Problem: From a “wish list” of ISBNs compiled by a bibliographer, how can we determine which titles JPT readily stocks (and which we can fast-track order) and which titles will they need to source for us?  Solution: Use Google Sheets IMPORTHTML function to query an ISBN on the vendor site and return information stored as an unordered list (<ul>).  Assumptions: We will retrieve one and ONLY one result.
  • 6. Using one formula in Google Sheets, we can turn this website’s <ul> element into spreadsheet cells: (ISBN is in A2) =SPLIT((SUBSTITUTE(IMPO RTHTML((CONCATENATE(" https://jptbooknews.jptc o.co.jp/product?q=",A2)), "list",2), char(10), "|")),"|")
  • 7. =SPLIT((SUBSTITUTE(IMPORTHTML((CONCATENATE("https://jptbooknews.jptco.co.jp/ product?q=",A2)),"list",2), CHAR(10), "|")),"|") 1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN): CONCATENATE("https://jptbooknews.jptco.co.jp/product?q=",A2)  https://jptbooknews.jptco.co.jp/product?q=9784065137338 2. It IMPORTs the HTML from the URL it just built, finding the 2nd occurrence of the “list” element on the page 3. It takes the line breaks in the list (defined by formula CHAR(10)) and SUBSTITUTES a pipe (“|”) for each 4. It then SPLITs the data in that list at the pipe, sending it across spreadsheet cells. (Essentially a “text to columns” function) 5. Afterward, additional formulas fetch and clean that data with other SPLIT/SUBSTITUTE functions for easier readability. How Does This Work?
  • 8. =ROUNDUP(E2*GOOGLEFINANCE("CURRENCY:JPYUSD"),2) 1. Fetches the yen price written to cell E2 and converts to USD with Google Finance formulas 2. Rounds up the value to 2 decimal places with ROUNDUP 3. Gets estimated US price =VALUE(SUBSTITUTE(LEFT(B3,6),"JPTB","20")) 1. Takes the LEFT-most 6 characters in cell B3 (where the vendor reference number is written), e.g. “JPTB19” 2. SUBSTITUTEs the string “JPTB” with the digits 20 3. Gets the numerical VALUE of this text string 4. Gets estimated year of publication based on the vendor reference number prefix. More Behind the Scenes Work
  • 9. Use Case 2: Using Known ISBNs to Fetch Bibliographic Data from a Union Catalog  Context: Even if JPT doesn’t stock a title, they can source it for us. But what they can’t easily source is sufficient bibliographic data—especially accurate Romanized Japanese required for us to make useful MARC records for both ordering and pre-acquisition patron discovery.  Problem: From that same wish list of ISBNs, how can we get critical bibliographic data and accurate romanization?  Solution: Use Google Sheets IMPORTXML function to query an ISBN on the union catalog, retrieve the catalog ID, and then use the catalog ID for further IMPORTXML and IMPORTDATA functions. Finally, use Google Translate is used to get romanization from retrieved data.  Assumptions: We will retrieve one and ONLY one result.
  • 10.
  • 11. First, we get the NCID (the record identifier) from search results: =SUBSTITUTE(IMPORTXML(CONCATENATE("https://ci.nii.ac.jp/books/search?ad vanced=false&count=20&sortorder=3&q=",A2),"/html/body/div/div[3]/div[1]/d iv/form/div/ul/li/div/dl/dt/a/@href") ,"/ncid/","") 1. It CONCATENTATEs the base URL with the search query in A2 (the ISBN): 2. It IMPORTs the XML from the URL it just built, using an Xpath absolute reference to drill down the page to the NCID value, which occurs as a hyperlink (a link to the actual bib record)  /html/body/div/div[3]/div[1]/div/form/div/ul/li/div/dl/dt/a/@href 3. It takes the URL portion it fetched and SUBSTITUTES the string “/ncid/” with nothing (“”)
  • 12. Then, we get the bib record data using the NCID we fetched 1. =CONCATENATE("https://ci.nii.ac.jp/ncid/",D2) [NCID is in D2]  builds the URL for the bib record 2. =IMPORTXML(M2, "/html/body/div/div[3]/div[2]/div/div[4]/ul/li[7]/dl/dd") [bib record URL is in M2]  fetches pagination, which is in the bib record only 3. =CONCATENATE(M2,".tsv") [bib record URL is in M2]  builds a URL for a .tsv file provided by CiNii with each bib (contains limited info) 4. =INDEX(IMPORTDATA(N2),2) [.tsv URL is in N2]  uses Google Sheets IMPORTDATA function to fetch the .tsv file  INDEX function specifies row 2 only (removes header row), to get bib data in one row
  • 13. At last, use Google Translate to take the Japanese title reading (katakana) and transliterate it into roman characters.
  • 14. …to make something like this: *(note: actual data here fetched from MatchMarc)
  • 15. II. Discovering Titles & Scraping Bibliographic Details in South Asia Acquisitions
  • 16. Use Case 3: Scraping Bibliographic Data from Vendor Catalog Searches  Context: There are many South Asian languages for which we do not have sufficient time or expertise to collect, but which are important for a representative collection. The catalog of Hindi Book Centre (https://www.hindibook.com/) provides long catalog lists for both Urdu and Punjabi.  Problem: These lists are long, and it would be time consuming to search through each without good filtering mechanisms, which aren’t at the initial list level.  Solution: Scrape the URLs from a list of titles in a search, then scrape the bib data from each URL fetched so they can be sorted and filtered.  Assumptions: Every catalog record has the same data in the same place, with no data fields missing. (…this proves to be mostly true)
  • 18.  First, we want to get all the URLs of results for Urdu titles. There are 1514 results, and we can get up to 72 results per page. Since 1514 / 72 = 21.027, that means our results cover 22 pages.  When we click Page 2, we see the URL syntax: https://www.hindibook.com/index.php?&p=sr&String=urdu- books&Field=keywords&perpage=72&startrow=72.  We can guess that the start row is “0” for Page 1, and can use Google Sheets to CONCATENTE URLs using multiples of 72 (that is, a column of =A1+72, in succession) until we reach the page that starts at row 1512 (that’s page 22).
  • 19.  For each page URL, we expect up to 72 titles returned. But we don’t need the titles, we need the URLs to the titles’ records.  If we right-click on any linked title, we can inspect the element, e.g.: <a href="index.php?p=sr&amp;format=fullpage&amp;Field=bookcode& amp;String=9788178018539 " class="h7 steelblue"> 1857 KI JUNG- E-AZADI KA GUMNAM SHAHEED RAJA NAHAR SINGH </a>  Using XPath and IMPORTXML we can say “get me the href where there is any <a> with class="h7 steelblue":  =IMPORTXML(A2,"//a[@class='h7 steelblue']/@href")
  • 20.  All 72 (almost complete) URLs are returned in an array, which means we’d need to be cautious about how we sort the list page URLs in column A.
  • 21.  For each page URL element in B2, we build a full URL =CONCATENATE("https://www.hindibook.com/",B2)  Then we IMPORT the XML from each URL in C2: =IMPORTXML(C2,"//div[@id='panel1d']") [all book data is contained in this <div>]  Optionally, we can get additional data (like the Hindi Book Centre vendor number) with additional XPaths =IMPORTXML(C2,"//div[@id='panel2d']") This would be helpful if we decided to use them as a vendor.
  • 22. III. Checking Holdings & Fetching OCLC Data Using ISBNs
  • 23. Use Case 4: Matching OCLC Data to Known ISBNs  Context: We’ve scraped a lot of Urdu ISBNs from Hindi Book Centre, but we need to make informed decisions about whether we should, and how we could, acquire these. We’d want to gauge whether these have OCLC records (for easy cataloging), whether our reliable South Asian vendor DK Agencies can provide them, and then get accurate information for ordering purposes (titles romanized by Hindi Book Centre do not match ALA/LC romanization standards).  Problem: There are 1500+ items, and no staff member has time to search for them one by one, to check duplication and get additional info and correct Romanization.  Solution: Use MatchMarc, a Google Sheets add-on, to query the ISBNs we fetched and return information from OCLC.
  • 24. MatchMarc: A Google Sheets Add-on that uses the WorldCat Search AP By Michelle Suranofsky and Lisa McColl Lehigh University Libraries has developed a new tool for querying WorldCat using the WorldCat Search API. […] The tool will return a single “best” OCLC record number, and its bibliographic information for a given ISBN or LCCN, allowing the user to set up and define “best.” Code4Lib Journal, Issue 46, 2019-11-05 https://journal.code4lib.org /articles/14813
  • 25. These Hindi Books ISBNs…. …searched against these criteria… …match this OCLC data.
  • 26. Making decisions with this data: Assuming the bibliographer wanted all of these titles…  If a local record is found, our holdings are in OCLC. Title is a duplicate so we don’t need to order.  If the existing OCLC record does not have vernacular scripts (indicated by 066$a), we’d prefer to get DK Agencies to sell us the book and provide a MARC record with those scripts. The DK number was in a 938$n subfield.  If the existing OCLC record already has scripts, and good cataloging, we can order from any vendor. (DK is good but not cheap).  If the existing OCLC record is missing call numbers or subjects, we may want to weigh options in purchasing.  If there is no OCLC record at all, this will require original cataloging we cannot handle. Purchase from DK if wanted.
  • 27. Use Case 5: From Known ISBNs, Check Franklin to Confirm Holdings  Context: (Ideally) OCLC will display our holdings for all items we have cataloged; our holdings for these are sent to OCLC. But for those items already on order but not yet cataloged, we should confirm whether there are in Alma/Franklin.  Problem: Once again, staff time is valuable, and copying and pasting ISBNs/titles in Alma/Franklin is time consuming with possibly little payoff.  Solution: Use IMPORTXML, IF, and IFNA functions to query Franklin, retrieve an MMS ID, a title, a link, or otherwise tell us we don’t have the title.  Assumptions: We will retrieve one and ONLY one result.
  • 28. First query Franklin with an ISBN to check for an MMS ID… =IFNA(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isxn _search&q=",A2),"//div[@class='availability-ajax-load']/@data-availability-ids"),"no Franklin result found") ….then fetch the title from the first search result… =INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upenn.edu/catalog?utf8=?&search_field=isx n_search&q=",A2),"//h3[@class='index_title document-title-heading col-sm-9 col-lg-10']"),1,2) ….then generate a link to that bib. =IF(B2="no Franklin result found","no Franklin link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FRANKLIN_",B2)))
  • 29. =IFNA(IMPORTXML(CONCATENATE("https://franklin.library. upenn.edu/catalog?utf8=?&search_field=isxn_search&q=" ,A2),"//div[@class='availability-ajax-load']/@data- availability-ids"),"no Franklin result found") 1. It CONCATENTATEs the base search URL with the search query in A2 (the ISBN) 2. It IMPORTs the XML from the URL it just built, retrieving the MMS ID from the “data-availability-ids” attribute of <div> element whose class is “availability-ajax-load” 3. And IF such an element is not applicable (IFNA), it will display the text “no Franklin result found” instead How It Works 1: Perform a Query to Find an MMS ID
  • 30. =INDEX(IMPORTXML(CONCATENATE("https://franklin.library.upe nn.edu/catalog?utf8=?&search_field=isxn_search&q=",A2),"//h3 [@class='index_title document-title-heading col-sm-9 col-lg- 10']"),1,2) 1. As above, CONCATENTATEs the same search URL with the title retrieved from <h3> tag a link 2. <h3> tag contains a text break, so the INDEX function says “get row 1, column 2” (where the title will appear) How It Works 2: Perform a Query to Find Title
  • 31. =IF(B2="no Franklin result found","no Franklin link",(CONCATENATE("https://franklin.library.upenn.edu/catalog/FR ANKLIN_",B2))) 1. IF the result in B2 is the text “no Franklin result found”, displays “no Franklin link” 2. Otherwise, CONCATENTATEs the Franklin URL with the MMS ID retrieved in B2 to generate a link How It Works 3: Generate a Link to the Bib Record
  • 32. IV. Putting REGEX (Regular Expressions) to Work in Google Sheets
  • 33. Use Case 6: Google’s IMPORT[X] Functions Are Slow, and Frequently Time Out  Context: Google Sheets is doing a lot of work importing HTML, XML, and DATA. This causes timeouts and results take time.  Problem: Sometimes the functions are working so hard that no results load at all. Google also throttles the amount of queries you can do per day and at a time across all Google Sheets in your Google Drive.  Solution: Make your own custom function with Google Scripts (or borrow one!) to bypass those speed issues.  Assumptions: You can program, you know a programmer, or you are willing to search for a solution online and just see what happens.
  • 34. You don’t have to be a programmer, but you can fake it. Google Apps Script (based on JavaScript) can plug into Google Drive applications, like Google Sheets. Many scripts are available in forums like Stack Overflow, etc. custom importRegex function developed by Josh Bradley (@josh_b_rad) https://stackoverflow.com/questions/39014766/to-exceed-the- importxml-limit-on-google-spreadsheet
  • 35. For example… we know Leila Books have catalog numbers, so can guess likely URLs by assuming the numbers in A1 match something in their catalog: =CONCATENATE("https://www.leilabooks.com/en/details.php?book no=",A1) We use the custom importRegex function to return the desired data from the URL in B column (B$) with regular expressions, e.g.: =importRegex($B1,"Book Title</td><td width='75%' class='colmn2'>(.*)</td>")
  • 37. Use Case 7: ISBNs Should be Unique and Valid… But Sometimes Aren’t  Context: In a perfect world, every resource should have a valid ISBN to differentiate titles, editions of those titles, and formats of those editions (i.e. 1st edition of a print title has a different ISBN than the eBook edition, and than its 2nd edition, etc.). ISBNs have come in different, equivalent “flavors” too, which look similar but are distinct.  Problem: The world isn’t perfect, and ISBNs aren’t free. Publishers recycle them across titles/editions, fail to use them as expected, or format them improperly. Else, they provide one flavor (ISBN-10) when we really expected another (ISBN-13) for a particular application.  Solution: Use Excel/Google Sheets to attempt to fix ISBNs using known rules for calculating ISBNs.  Assumptions: You have a lot of ISBNs, suspect some are broken/invalid, and/or you also want to convert between ISBN-10s and -13s.
  • 38. Using some functions like SUBSTITUTE and IF, we can clean ISBNs of extra characters like hyphens, and then use the LEN (length) function to determine if they are valid lengths (10 or 13). If they are, we can then calculate the valid check digit, determine if the ISBN we entered is valid, and if it isn’t, we can “reconstruct” the supposedly valid ISBN.
  • 39. Determining the ISBN type =IF(LEN(SUBSTITUTE(A2,"-",""))=13,"ISBN-13",IF(LEN(SUBSTITUTE(A2,"-",""))=10,"ISBN- 10","N/A")) 1. We’ve nested some IF statements. The first one, IF(LEN(SUBSTITUTE(A2,"-",""))=13, will SUBSTITUTE the hyphens (“-”) with nothing “”), then it will calculate the LENgth. IF that LEN is 13, it will display “ISBN-13” 2. If that LEN is not 13, it will try again, this time looking for 10 digits in that column. If it’s 10, it will display “ISBN-10”. 3. Otherwise, it displays “N/A”: The ISBN type cannot be determined, so we cannot presume where the missing or extra digits are.
  • 40. Readymade formulas help us first calculate the ISBN-10 check digit, using the 9 digit “root” (the ISBN minus the 978- prefix, and minus the check digit). =IF(LEN(SUBSTITUTE(A2,"-",""))=10,MOD(MID((SUBSTITUTE(A2,"- ","")),1,1)+MID((SUBSTITUTE(A2,"-","")),2,1)*2+MID((SUBSTITUTE(A2,"- ","")),3,1)*3+MID((SUBSTITUTE(A2,"-","")),4,1)*4+MID((SUBSTITUTE(A2,"- ","")),5,1)*5+MID((SUBSTITUTE(A2,"-","")),6,1)*6+MID((SUBSTITUTE(A2,"- ","")),7,1)*7+MID((SUBSTITUTE(A2,"-","")),8,1)*8+MID((SUBSTITUTE(A2,"- ","")),9,1)*9,11),IF(LEN(SUBSTITUTE(A2,"-",""))=13,MOD(MID((SUBSTITUTE(A2,"- ","")),4,1)+MID((SUBSTITUTE(A2,"-","")),5,1)*2+MID((SUBSTITUTE(A2,"- ","")),6,1)*3+MID((SUBSTITUTE(A2,"-","")),7,1)*4+MID((SUBSTITUTE(A2,"- ","")),8,1)*5+MID((SUBSTITUTE(A2,"-","")),9,1)*6+MID((SUBSTITUTE(A2,"- ","")),10,1)*7+MID((SUBSTITUTE(A2,"-","")),11,1)*8+MID((SUBSTITUTE(A2,"- ","")),12,1)*9,11),"BAD ISBN")) And one more function helps us get a value of “X” for ISBN-10’s that end in X =IF(C2=10,"X",C2) Helpful sources: • http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/ • http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for- isbn-10.html
  • 41. With the ISBN-10 check digit calculated (value of 0-9 or else X), we can reconstruct a valid ISBN-10 (and write it as a 10-character TEXT value, since it may end with an “X”)… =TEXT(IF(LEN(SUBSTITUTE(A2,"-",""))=13,CONCATENATE(MID((SUBSTITUTE(A2,"- ","")),4,9),D2),IF(LEN(SUBSTITUTE(A2,"- ",""))=10,CONCATENATE(MID((SUBSTITUTE(A2,"-","")),1,9),D2),"Cannot validate")),"0000000000") …and with that ISBN-10, we calculate the valid ISBN-13: =TEXT(IF(LEN(E3)=10,CONCATENATE("978",MID(E3,1,9),MOD((10- MOD(SUM(9,21,8,PRODUCT(MID(E3,1,1),3),MID(E3,2,1),PRODUCT(MID(E3,3,1),3),MI D(E3,4,1),PRODUCT(MID(E3,5,1),3),MID(E3,6,1),PRODUCT(MID(E3,7,1),3),MID(E3,8, 1),PRODUCT(MID(E3,9,1),3)),10)),10)), "Cannot validate"), 0) Helpful sources: • http://drziegler.net/generating-eanisbn-13-check-digits-in-excel/ • http://useroffline.blogspot.com/2008/08/tip-spreadsheet-conversion-for- isbn-10.html
  • 42. And sometimes the “invalid” ISBN is “valid”, in context of the title in hand (or on a vendor spreadsheet) “Invalid” ISBN-10 8185360866 Valid ISBN-10 8185360863: Two editions matched (different date)
  • 44. Use Case Addendum: Fetching Book Data from Known Bookseller URLs (my first experiment, rebuilt!)  Context: We want to expand language coverage of titles in our South Asia collections, but no staffing model can accommodate the dozens of languages we need for a representative collection. We found a vendor, DC Books, who has a great website for Mayalayam books with info about them in English.  Problem: A staff member who knows Malayalam can make book recommendations, but asking him to copy and paste bibliographic data one at a time into a spreadsheet for a bibliographer to review is laborious and time- consuming.  Solution: Use structured data on the DC Books website, and have the staff member just record the URLs of books he recommends to us.
  • 45. =IMPORTXML(A2,"//span[@style='font-size:14px; line- height:26px; color:#333;’]”)  Takes the URL identified, and imports a <span> element where the bib data lives. =PROPER() functions normalize ALL CAPS to Proper Case for title, author, and publisher. =RIGHT([cell],4) takes the YYYY from the DD-MM-YYYY date