The Challenge of Using Map-Reduce to Query Open Data

The Challenge of Using Map-Reduce to
Query Open Data
M. Pelucchi1
, G.Psaila2
, M.Toccu3
1
University of Bergamo, Italy
(current aﬃliation Tabulaex srl, Milan Italy)
2
University of Bergamo, Italy
3
University of Milano Bicocca, Italy
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 1 / 31

Open Data
(The Open Deﬁnition)
“Open means anyone can freely access, use, modify, and share
for any purpose (subject, at most, to requirements that preserve
provenance and openness)” (http://opendeﬁnition.org)
Open data refers to the information collected and held by Public
Administration that are made available to everyone through the
Internet

Open Data - Distribution size
http://opendatamonitor.eu/

Beneﬁts of Open Data
Better decisions
Increase administrative transparency
Improve the quality of public services
Lead Innovation
Development of new services or new products (creation of new
economy)

Problem for re-use Open Data
Quality of Open Data
Lack of Standardization
Variety of formats and structures
Heterogeneous and federated data sources
Dimension
Unstructured data sets
Availability of Open Data
Incorrect or not updated metadata

Hammer prototype: motivations
Public Administrations openly publish many data sets concerning
citizens and territories
Open Data corpora has become so huge that it is impossible to
deal with them by hand; as a consequence, it is necessary to use
tools that include innovative techniques able to query them
The need for a centralized query engine arises, able to select
single data items from within a mess of heterogeneous open data
sets

Open Data Value Chain

Goals
Developing a technique:
for blindly querying a corpus of open data
from within a corpus published by a portal of open data
Distinctive features:
ﬁnding data sets with speciﬁc items (rows or objects)
Blindly Querying approach

Blindly Querying Open Data Portals
Users of Open Data Portals wishing to retrieve useful pieces of
information from within the mess of published data sets, have to face
several issues
Only keyword-based search: Search engines usually provide a
simple keyword-based retrieval, thinking about data sets as
documents without structure
Large data sets: Traditional search engines return the entire
data set, instead of returning only the items of interest
Heterogeneity of data sets: The structure of data sets is not
coherent, even though two data sets concern the same topic
Schema Unawareness: the user cannot be aware of the real
schema of thousands of data sets

Preliminary Definition
Open Data Set ods:
ods: <ods_id, dataset_name,schema, metadata>
ods_id: is a unique identifier
dataset_name: is the name of the data set (not unique)
schema is a set of fields names,
metadata is a set of pairs (label, value), which are additional
meta-data associated with the data set
Open Data Corpus C: the collection of Open Data Sets
published by an Open Data Portal

Query
q: <dn, P, sc>
dn: Data Set Name
P = { pn1, pn2, . . . }
set of ﬁeld names (properties) of interest
sc: selection condition on ﬁeld values
q1:<dn=nutrition,
P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa"
OR county = "Nairobi"
OR county = "Turkana") >

Problem Deﬁnition
Relevance Measure
rm(ods)∈ [0, 1]
Minimum Threshold
th∈ [0, 1]
A data set ods is relevant if
rm(ods)> 0
and rm(ods)≥th
The result set of a query q
RS = {o1, o2, . . . }
the set of items (rows or
JSON objects) oi
such that odsj is relevant
oi ∈ Instance(odsj )
and oi satisﬁes the
selection condition q.sc

Processing Steps

Step 1 - Query Assessment
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi"
OR county = "Turkana") >
Terms in the query are searched
within the catalogue
A missing term t is replaced
with term t, the term in the
meta-data catalog with the
highest similarity score w.r.t. t
Similarity measure:
Jaro-Winkler
We obtain the assessed query q

Step 2 - Keywords Selection
Relevant Terms for q:
the most
representative/informative
terms
for ﬁnding potentially
relevant data sets

Step 3 - Neighbour Queries
For each selected keyword, a pool of similar terms in the catalog
are identiﬁed
From the assessed query q, Neighbour Queries is derived
(queries which are similar to q)
Q: the set of queries to process (q and neighbour queries)

Step 4 - Data Set Retrieval
For each query in Q, data sets are retrieved:
Vector Space Model approach
Relevant keywords for each query are used

Step 5 - Schema Fitting
The full set of ﬁeld names in each query in Q is compared with
the schema of each selected data set
Data sets whose schema better ﬁts the query will be more
relevant than other data sets
Relevance Measure of data set ods w.r.t. query nq
For a data set ods and a query nq∈Q
rm(ods, nq)= +(α − 1) × krm(ods, nq) + α × sfd(ods, nq)
krm(ods, nq): Keyword-based Relevance Measure (VSM)
sfd(ods) Schema Fitting Degree
α: balance factor

Step 6 - Instance Filtering
For each neighbour query nqi ∈ Q and each data set odsj
relevant for nqi
items (rows or JSON objects) in odsJ that satisfy the selection
condition are selected
and inserted into the result set RS.

Map-Reduce in the Hammer Prototype
The ﬁnal perspective application of the Hammer prototype is to build
a centralized query engine for many open data portals and making
available the blindly queries technique, in order to further simplify the
work of open data users.
Three main services provided by the Hammer prototype:
Open Data Portal Crawling: portal is queried to get the list
of open data sets they publish, getting their schema and
meta-data, so that a local catalog of data sets is locally built
Indexing: the inverted index to perform the technique (VSM
Dataset Retrieval) is built
Querying: queries are actually executed and evaluated on the
actual content of data sets

Crawler
The Crawler:
Gets information about the data sets contained in the corpus
published by an open data portal and builds the local catalog
Has been designed to connect with several platforms usually
adopted to publish open data portals through several connector
(CKAN Connector, Socrata Connector, ...)
Implement with Map-Reduce
Splits the computation by the end point URL; an end point is
assigned to a single worker; during the input split phase the
Crawler queries the open data portals and retrieve the list of all
data sets
Map is used to download a single data set; for each data set,
retrieve the metadata and the ﬁrst row to obtain the schema
Reduce merges the data sets by their unique identiﬁer and store
the data on MongoDB data lake

Indexer
The Indexer:
Extracts terms and labels from schema and meta-data of each
open data set in the corpus C and builds the Inverted Index IN
Within Indexer we adopted the Map-Reduce technique. For each data
set and for each label in the schema and in the meta-data:
The map primitive generates a pair <t, ods_id>
Reduce transforms the set of pairs into a nested structure
IN= { <t,ods_list: { ods_id1, . . . , ods_idn}> }
Finally, the inverted index IN is stored into the local database

Indexer

Query Engine
Query Engine exploits the inverted
index IN to retrieve the list of data
sets relevant for the query (that is
for each neighbor query).
The picture shows how the query
engine works:
It receives the query and
generates the set Q of queries
to process
Then, by exploiting the catalog
and the inverted index IN, it
determines relevant data sets
and then provides (as output)
the items within those data sets
that satisfy the selection
condition nq.sc of a query
nq ∈ Q

Query Engine

Case Study - Experimental Evaluation
and application
We implemented our technique within the Hammer prototype so as
to evaluate the eﬀectiveness of the technique and study performance
Africa Open Data portal: 2301 Data Sets, non homogeneous data
sets, diﬀerent structures for same topics, about 5 millions rows
We assembled a testbed for evaluating the Hammer prototype: a
(small) Hadoop
On local PCs and it was used for development and tuning
On Google Cloud Platform and it was used to evaluate performance
in an industrial settings

Case Study - Experimental Evaluation
and application
We have executed 9 blind queries on our Testbed and have collected
performance metrics for each phase

Case Study - Query
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi" OR county =
"Turkana") >
243 neighbour queries / 73 selected neighbour queries
3 relevant data sets / 3 rows found
about 1.5 hours to perform the query and create the data visualization
https://mauropelucchiunimib.carto.com/builder/d9a7b6b4-8605-4d61-9929-ee9197f466cf/embed

Conclusions
In this work we faced the challenge of parallelizing many parts of
the prototype by applying the Map-Reduce approach, in order to
evaluate the feasibility of this Big Data approach in the area of
Big Open Data
We demonstrated that the adoption of modern standard
approach and technology specifically designed for big data
management, such as Apache Hadoop and MongoDB, can be
effective in this context
We discovered that, although Apache Solr behaves quite well,
our technique is capable of better focusing on data sets of
interest; furthermore, it extracts only items of interest (while
Apache Solr does not in a classic configuration)

Future Developments
Improve performance
Replace native Hadoop with Spark (10x faster)
Introduce Multilanguage search
Extend query language: spatial joins
Query disambiguation: improve the generation of neighbour
queries

Thanks for your attention

The Challenge of Using Map-Reduce to Query Open Data

Recommended

Recommended

More Related Content

Similar to The Challenge of Using Map-Reduce to Query Open Data

Similar to The Challenge of Using Map-Reduce to Query Open Data (20)

Recently uploaded

Recently uploaded (20)

The Challenge of Using Map-Reduce to Query Open Data