Public Administrations openly publish many data sets concerning
citizens and territories. Open Data corpora has become so huge that it is impossible to deal with them by hand; as a consequence, it is necessary to use tools that include innovative techniques able to query them.
The need for a centralized query engine arises, able to select
single data items from within a mess of heterogeneous open data
sets.
SCM Symposium PPT Format Customer loyalty is predi
The Challenge of Using Map-Reduce to Query Open Data
1. The Challenge of Using Map-Reduce to
Query Open Data
M. Pelucchi1
, G.Psaila2
, M.Toccu3
1
University of Bergamo, Italy
(current affiliation Tabulaex srl, Milan Italy)
2
University of Bergamo, Italy
3
University of Milano Bicocca, Italy
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 1 / 31
2. Open Data
(The Open Definition)
“Open means anyone can freely access, use, modify, and share
for any purpose (subject, at most, to requirements that preserve
provenance and openness)” (http://opendefinition.org)
Open data refers to the information collected and held by Public
Administration that are made available to everyone through the
Internet
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 2 / 31
3. Open Data - Distribution size
http://opendatamonitor.eu/
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 3 / 31
4. Benefits of Open Data
Better decisions
Increase administrative transparency
Improve the quality of public services
Lead Innovation
Development of new services or new products (creation of new
economy)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 4 / 31
5. Problem for re-use Open Data
Quality of Open Data
Lack of Standardization
Variety of formats and structures
Heterogeneous and federated data sources
Dimension
Unstructured data sets
Availability of Open Data
Incorrect or not updated metadata
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 5 / 31
6. Hammer prototype: motivations
Public Administrations openly publish many data sets concerning
citizens and territories
Open Data corpora has become so huge that it is impossible to
deal with them by hand; as a consequence, it is necessary to use
tools that include innovative techniques able to query them
The need for a centralized query engine arises, able to select
single data items from within a mess of heterogeneous open data
sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 6 / 31
7. Open Data Value Chain
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 7 / 31
8. Goals
Developing a technique:
for blindly querying a corpus of open data
from within a corpus published by a portal of open data
Distinctive features:
finding data sets with specific items (rows or objects)
Blindly Querying approach
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 8 / 31
9. Blindly Querying Open Data Portals
Users of Open Data Portals wishing to retrieve useful pieces of
information from within the mess of published data sets, have to face
several issues
Only keyword-based search: Search engines usually provide a
simple keyword-based retrieval, thinking about data sets as
documents without structure
Large data sets: Traditional search engines return the entire
data set, instead of returning only the items of interest
Heterogeneity of data sets: The structure of data sets is not
coherent, even though two data sets concern the same topic
Schema Unawareness: the user cannot be aware of the real
schema of thousands of data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 9 / 31
10. Preliminary Definition
Open Data Set ods:
ods: <ods_id, dataset_name,schema, metadata>
ods_id: is a unique identifier
dataset_name: is the name of the data set (not unique)
schema is a set of fields names,
metadata is a set of pairs (label, value), which are additional
meta-data associated with the data set
Open Data Corpus C: the collection of Open Data Sets
published by an Open Data Portal
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 10 / 31
11. Query
q: <dn, P, sc>
dn: Data Set Name
P = { pn1, pn2, . . . }
set of field names (properties) of interest
sc: selection condition on field values
q1:<dn=nutrition,
P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa"
OR county = "Nairobi"
OR county = "Turkana") >
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 11 / 31
12. Problem Definition
Relevance Measure
rm(ods)∈ [0, 1]
Minimum Threshold
th∈ [0, 1]
A data set ods is relevant if
rm(ods)> 0
and rm(ods)≥th
The result set of a query q
RS = {o1, o2, . . . }
the set of items (rows or
JSON objects) oi
such that odsj is relevant
oi ∈ Instance(odsj )
and oi satisfies the
selection condition q.sc
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 12 / 31
14. Step 1 - Query Assessment
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi"
OR county = "Turkana") >
Terms in the query are searched
within the catalogue
A missing term t is replaced
with term t, the term in the
meta-data catalog with the
highest similarity score w.r.t. t
Similarity measure:
Jaro-Winkler
We obtain the assessed query q
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 14 / 31
15. Step 2 - Keywords Selection
Relevant Terms for q:
the most
representative/informative
terms
for finding potentially
relevant data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 15 / 31
16. Step 3 - Neighbour Queries
For each selected keyword, a pool of similar terms in the catalog
are identified
From the assessed query q, Neighbour Queries is derived
(queries which are similar to q)
Q: the set of queries to process (q and neighbour queries)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 16 / 31
17. Step 4 - Data Set Retrieval
For each query in Q, data sets are retrieved:
Vector Space Model approach
Relevant keywords for each query are used
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 17 / 31
18. Step 5 - Schema Fitting
The full set of field names in each query in Q is compared with
the schema of each selected data set
Data sets whose schema better fits the query will be more
relevant than other data sets
Relevance Measure of data set ods w.r.t. query nq
For a data set ods and a query nq∈Q
rm(ods, nq)= +(α − 1) × krm(ods, nq) + α × sfd(ods, nq)
krm(ods, nq): Keyword-based Relevance Measure (VSM)
sfd(ods) Schema Fitting Degree
α: balance factor
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 18 / 31
19. Step 6 - Instance Filtering
For each neighbour query nqi ∈ Q and each data set odsj
relevant for nqi
items (rows or JSON objects) in odsJ that satisfy the selection
condition are selected
and inserted into the result set RS.
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 19 / 31
20. Map-Reduce in the Hammer Prototype
The final perspective application of the Hammer prototype is to build
a centralized query engine for many open data portals and making
available the blindly queries technique, in order to further simplify the
work of open data users.
Three main services provided by the Hammer prototype:
Open Data Portal Crawling: portal is queried to get the list
of open data sets they publish, getting their schema and
meta-data, so that a local catalog of data sets is locally built
Indexing: the inverted index to perform the technique (VSM
Dataset Retrieval) is built
Querying: queries are actually executed and evaluated on the
actual content of data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 20 / 31
21. Crawler
The Crawler:
Gets information about the data sets contained in the corpus
published by an open data portal and builds the local catalog
Has been designed to connect with several platforms usually
adopted to publish open data portals through several connector
(CKAN Connector, Socrata Connector, ...)
Implement with Map-Reduce
Splits the computation by the end point URL; an end point is
assigned to a single worker; during the input split phase the
Crawler queries the open data portals and retrieve the list of all
data sets
Map is used to download a single data set; for each data set,
retrieve the metadata and the first row to obtain the schema
Reduce merges the data sets by their unique identifier and store
the data on MongoDB data lake
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 21 / 31
22. Indexer
The Indexer:
Extracts terms and labels from schema and meta-data of each
open data set in the corpus C and builds the Inverted Index IN
Within Indexer we adopted the Map-Reduce technique. For each data
set and for each label in the schema and in the meta-data:
The map primitive generates a pair <t, ods_id>
Reduce transforms the set of pairs into a nested structure
IN= { <t,ods_list: { ods_id1, . . . , ods_idn}> }
Finally, the inverted index IN is stored into the local database
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 22 / 31
24. Query Engine
Query Engine exploits the inverted
index IN to retrieve the list of data
sets relevant for the query (that is
for each neighbor query).
The picture shows how the query
engine works:
It receives the query and
generates the set Q of queries
to process
Then, by exploiting the catalog
and the inverted index IN, it
determines relevant data sets
and then provides (as output)
the items within those data sets
that satisfy the selection
condition nq.sc of a query
nq ∈ Q
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 24 / 31
25. Query Engine
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 25 / 31
26. Case Study - Experimental Evaluation
and application
We implemented our technique within the Hammer prototype so as
to evaluate the effectiveness of the technique and study performance
Africa Open Data portal: 2301 Data Sets, non homogeneous data
sets, different structures for same topics, about 5 millions rows
We assembled a testbed for evaluating the Hammer prototype: a
(small) Hadoop
On local PCs and it was used for development and tuning
On Google Cloud Platform and it was used to evaluate performance
in an industrial settings
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 26 / 31
27. Case Study - Experimental Evaluation
and application
We have executed 9 blind queries on our Testbed and have collected
performance metrics for each phase
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 27 / 31
28. Case Study - Query
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi" OR county =
"Turkana") >
243 neighbour queries / 73 selected neighbour queries
3 relevant data sets / 3 rows found
about 1.5 hours to perform the query and create the data visualization
https://mauropelucchiunimib.carto.com/builder/d9a7b6b4-8605-4d61-9929-ee9197f466cf/embed
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 28 / 31
29. Conclusions
In this work we faced the challenge of parallelizing many parts of
the prototype by applying the Map-Reduce approach, in order to
evaluate the feasibility of this Big Data approach in the area of
Big Open Data
We demonstrated that the adoption of modern standard
approach and technology specifically designed for big data
management, such as Apache Hadoop and MongoDB, can be
effective in this context
We discovered that, although Apache Solr behaves quite well,
our technique is capable of better focusing on data sets of
interest; furthermore, it extracts only items of interest (while
Apache Solr does not in a classic configuration)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 29 / 31
30. Future Developments
Improve performance
Replace native Hadoop with Spark (10x faster)
Introduce Multilanguage search
Extend query language: spatial joins
Query disambiguation: improve the generation of neighbour
queries
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 30 / 31
31. Thanks for your attention
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 31 / 31