SlideShare a Scribd company logo
1 of 31
Download to read offline
The Challenge of Using Map-Reduce to
Query Open Data
M. Pelucchi1
, G.Psaila2
, M.Toccu3
1
University of Bergamo, Italy
(current affiliation Tabulaex srl, Milan Italy)
2
University of Bergamo, Italy
3
University of Milano Bicocca, Italy
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 1 / 31
Open Data
(The Open Definition)
“Open means anyone can freely access, use, modify, and share
for any purpose (subject, at most, to requirements that preserve
provenance and openness)” (http://opendefinition.org)
Open data refers to the information collected and held by Public
Administration that are made available to everyone through the
Internet
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 2 / 31
Open Data - Distribution size
http://opendatamonitor.eu/
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 3 / 31
Benefits of Open Data
Better decisions
Increase administrative transparency
Improve the quality of public services
Lead Innovation
Development of new services or new products (creation of new
economy)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 4 / 31
Problem for re-use Open Data
Quality of Open Data
Lack of Standardization
Variety of formats and structures
Heterogeneous and federated data sources
Dimension
Unstructured data sets
Availability of Open Data
Incorrect or not updated metadata
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 5 / 31
Hammer prototype: motivations
Public Administrations openly publish many data sets concerning
citizens and territories
Open Data corpora has become so huge that it is impossible to
deal with them by hand; as a consequence, it is necessary to use
tools that include innovative techniques able to query them
The need for a centralized query engine arises, able to select
single data items from within a mess of heterogeneous open data
sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 6 / 31
Open Data Value Chain
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 7 / 31
Goals
Developing a technique:
for blindly querying a corpus of open data
from within a corpus published by a portal of open data
Distinctive features:
finding data sets with specific items (rows or objects)
Blindly Querying approach
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 8 / 31
Blindly Querying Open Data Portals
Users of Open Data Portals wishing to retrieve useful pieces of
information from within the mess of published data sets, have to face
several issues
Only keyword-based search: Search engines usually provide a
simple keyword-based retrieval, thinking about data sets as
documents without structure
Large data sets: Traditional search engines return the entire
data set, instead of returning only the items of interest
Heterogeneity of data sets: The structure of data sets is not
coherent, even though two data sets concern the same topic
Schema Unawareness: the user cannot be aware of the real
schema of thousands of data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 9 / 31
Preliminary Definition
Open Data Set ods:
ods: <ods_id, dataset_name,schema, metadata>
ods_id: is a unique identifier
dataset_name: is the name of the data set (not unique)
schema is a set of fields names,
metadata is a set of pairs (label, value), which are additional
meta-data associated with the data set
Open Data Corpus C: the collection of Open Data Sets
published by an Open Data Portal
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 10 / 31
Query
q: <dn, P, sc>
dn: Data Set Name
P = { pn1, pn2, . . . }
set of field names (properties) of interest
sc: selection condition on field values
q1:<dn=nutrition,
P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa"
OR county = "Nairobi"
OR county = "Turkana") >
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 11 / 31
Problem Definition
Relevance Measure
rm(ods)∈ [0, 1]
Minimum Threshold
th∈ [0, 1]
A data set ods is relevant if
rm(ods)> 0
and rm(ods)≥th
The result set of a query q
RS = {o1, o2, . . . }
the set of items (rows or
JSON objects) oi
such that odsj is relevant
oi ∈ Instance(odsj )
and oi satisfies the
selection condition q.sc
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 12 / 31
Processing Steps
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 13 / 31
Step 1 - Query Assessment
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi"
OR county = "Turkana") >
Terms in the query are searched
within the catalogue
A missing term t is replaced
with term t, the term in the
meta-data catalog with the
highest similarity score w.r.t. t
Similarity measure:
Jaro-Winkler
We obtain the assessed query q
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 14 / 31
Step 2 - Keywords Selection
Relevant Terms for q:
the most
representative/informative
terms
for finding potentially
relevant data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 15 / 31
Step 3 - Neighbour Queries
For each selected keyword, a pool of similar terms in the catalog
are identified
From the assessed query q, Neighbour Queries is derived
(queries which are similar to q)
Q: the set of queries to process (q and neighbour queries)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 16 / 31
Step 4 - Data Set Retrieval
For each query in Q, data sets are retrieved:
Vector Space Model approach
Relevant keywords for each query are used
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 17 / 31
Step 5 - Schema Fitting
The full set of field names in each query in Q is compared with
the schema of each selected data set
Data sets whose schema better fits the query will be more
relevant than other data sets
Relevance Measure of data set ods w.r.t. query nq
For a data set ods and a query nq∈Q
rm(ods, nq)= +(α − 1) × krm(ods, nq) + α × sfd(ods, nq)
krm(ods, nq): Keyword-based Relevance Measure (VSM)
sfd(ods) Schema Fitting Degree
α: balance factor
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 18 / 31
Step 6 - Instance Filtering
For each neighbour query nqi ∈ Q and each data set odsj
relevant for nqi
items (rows or JSON objects) in odsJ that satisfy the selection
condition are selected
and inserted into the result set RS.
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 19 / 31
Map-Reduce in the Hammer Prototype
The final perspective application of the Hammer prototype is to build
a centralized query engine for many open data portals and making
available the blindly queries technique, in order to further simplify the
work of open data users.
Three main services provided by the Hammer prototype:
Open Data Portal Crawling: portal is queried to get the list
of open data sets they publish, getting their schema and
meta-data, so that a local catalog of data sets is locally built
Indexing: the inverted index to perform the technique (VSM
Dataset Retrieval) is built
Querying: queries are actually executed and evaluated on the
actual content of data sets
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 20 / 31
Crawler
The Crawler:
Gets information about the data sets contained in the corpus
published by an open data portal and builds the local catalog
Has been designed to connect with several platforms usually
adopted to publish open data portals through several connector
(CKAN Connector, Socrata Connector, ...)
Implement with Map-Reduce
Splits the computation by the end point URL; an end point is
assigned to a single worker; during the input split phase the
Crawler queries the open data portals and retrieve the list of all
data sets
Map is used to download a single data set; for each data set,
retrieve the metadata and the first row to obtain the schema
Reduce merges the data sets by their unique identifier and store
the data on MongoDB data lake
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 21 / 31
Indexer
The Indexer:
Extracts terms and labels from schema and meta-data of each
open data set in the corpus C and builds the Inverted Index IN
Within Indexer we adopted the Map-Reduce technique. For each data
set and for each label in the schema and in the meta-data:
The map primitive generates a pair <t, ods_id>
Reduce transforms the set of pairs into a nested structure
IN= { <t,ods_list: { ods_id1, . . . , ods_idn}> }
Finally, the inverted index IN is stored into the local database
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 22 / 31
Indexer
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 23 / 31
Query Engine
Query Engine exploits the inverted
index IN to retrieve the list of data
sets relevant for the query (that is
for each neighbor query).
The picture shows how the query
engine works:
It receives the query and
generates the set Q of queries
to process
Then, by exploiting the catalog
and the inverted index IN, it
determines relevant data sets
and then provides (as output)
the items within those data sets
that satisfy the selection
condition nq.sc of a query
nq ∈ Q
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 24 / 31
Query Engine
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 25 / 31
Case Study - Experimental Evaluation
and application
We implemented our technique within the Hammer prototype so as
to evaluate the effectiveness of the technique and study performance
Africa Open Data portal: 2301 Data Sets, non homogeneous data
sets, different structures for same topics, about 5 millions rows
We assembled a testbed for evaluating the Hammer prototype: a
(small) Hadoop
On local PCs and it was used for development and tuning
On Google Cloud Platform and it was used to evaluate performance
in an industrial settings
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 26 / 31
Case Study - Experimental Evaluation
and application
We have executed 9 blind queries on our Testbed and have collected
performance metrics for each phase
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 27 / 31
Case Study - Query
q1:<dn=nutrition, P= {county, underweight, stuting, wasting},
sc= (county = "Monbasa" OR county = "Nairobi" OR county =
"Turkana") >
243 neighbour queries / 73 selected neighbour queries
3 relevant data sets / 3 rows found
about 1.5 hours to perform the query and create the data visualization
https://mauropelucchiunimib.carto.com/builder/d9a7b6b4-8605-4d61-9929-ee9197f466cf/embed
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 28 / 31
Conclusions
In this work we faced the challenge of parallelizing many parts of
the prototype by applying the Map-Reduce approach, in order to
evaluate the feasibility of this Big Data approach in the area of
Big Open Data
We demonstrated that the adoption of modern standard
approach and technology specifically designed for big data
management, such as Apache Hadoop and MongoDB, can be
effective in this context
We discovered that, although Apache Solr behaves quite well,
our technique is capable of better focusing on data sets of
interest; furthermore, it extracts only items of interest (while
Apache Solr does not in a classic configuration)
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 29 / 31
Future Developments
Improve performance
Replace native Hadoop with Spark (10x faster)
Introduce Multilanguage search
Extend query language: spatial joins
Query disambiguation: improve the generation of neighbour
queries
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 30 / 31
Thanks for your attention
M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 31 / 31

More Related Content

Similar to The Challenge of Using Map-Reduce to Query Open Data

L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...Cigdem Aslay
 
Text mining and machine learning
Text mining and machine learningText mining and machine learning
Text mining and machine learningJisc RDM
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop DatabasesMartín Rezk
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptRajeshT305412
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ireKovidaN
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Mojisola Erdt née Anjorin
 
Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_maMojisola Erdt née Anjorin
 

Similar to The Challenge of Using Map-Reduce to Query Open Data (20)

L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...
 
Text mining and machine learning
Text mining and machine learningText mining and machine learning
Text mining and machine learning
 
final_presentation
final_presentationfinal_presentation
final_presentation
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop Databases
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
Visualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLABVisualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLAB
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ire
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
 
Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
 

Recently uploaded

定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 

Recently uploaded (20)

定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 

The Challenge of Using Map-Reduce to Query Open Data

  • 1. The Challenge of Using Map-Reduce to Query Open Data M. Pelucchi1 , G.Psaila2 , M.Toccu3 1 University of Bergamo, Italy (current affiliation Tabulaex srl, Milan Italy) 2 University of Bergamo, Italy 3 University of Milano Bicocca, Italy M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 1 / 31
  • 2. Open Data (The Open Definition) “Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness)” (http://opendefinition.org) Open data refers to the information collected and held by Public Administration that are made available to everyone through the Internet M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 2 / 31
  • 3. Open Data - Distribution size http://opendatamonitor.eu/ M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 3 / 31
  • 4. Benefits of Open Data Better decisions Increase administrative transparency Improve the quality of public services Lead Innovation Development of new services or new products (creation of new economy) M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 4 / 31
  • 5. Problem for re-use Open Data Quality of Open Data Lack of Standardization Variety of formats and structures Heterogeneous and federated data sources Dimension Unstructured data sets Availability of Open Data Incorrect or not updated metadata M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 5 / 31
  • 6. Hammer prototype: motivations Public Administrations openly publish many data sets concerning citizens and territories Open Data corpora has become so huge that it is impossible to deal with them by hand; as a consequence, it is necessary to use tools that include innovative techniques able to query them The need for a centralized query engine arises, able to select single data items from within a mess of heterogeneous open data sets M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 6 / 31
  • 7. Open Data Value Chain M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 7 / 31
  • 8. Goals Developing a technique: for blindly querying a corpus of open data from within a corpus published by a portal of open data Distinctive features: finding data sets with specific items (rows or objects) Blindly Querying approach M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 8 / 31
  • 9. Blindly Querying Open Data Portals Users of Open Data Portals wishing to retrieve useful pieces of information from within the mess of published data sets, have to face several issues Only keyword-based search: Search engines usually provide a simple keyword-based retrieval, thinking about data sets as documents without structure Large data sets: Traditional search engines return the entire data set, instead of returning only the items of interest Heterogeneity of data sets: The structure of data sets is not coherent, even though two data sets concern the same topic Schema Unawareness: the user cannot be aware of the real schema of thousands of data sets M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 9 / 31
  • 10. Preliminary Definition Open Data Set ods: ods: <ods_id, dataset_name,schema, metadata> ods_id: is a unique identifier dataset_name: is the name of the data set (not unique) schema is a set of fields names, metadata is a set of pairs (label, value), which are additional meta-data associated with the data set Open Data Corpus C: the collection of Open Data Sets published by an Open Data Portal M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 10 / 31
  • 11. Query q: <dn, P, sc> dn: Data Set Name P = { pn1, pn2, . . . } set of field names (properties) of interest sc: selection condition on field values q1:<dn=nutrition, P= {county, underweight, stuting, wasting}, sc= (county = "Monbasa" OR county = "Nairobi" OR county = "Turkana") > M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 11 / 31
  • 12. Problem Definition Relevance Measure rm(ods)∈ [0, 1] Minimum Threshold th∈ [0, 1] A data set ods is relevant if rm(ods)> 0 and rm(ods)≥th The result set of a query q RS = {o1, o2, . . . } the set of items (rows or JSON objects) oi such that odsj is relevant oi ∈ Instance(odsj ) and oi satisfies the selection condition q.sc M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 12 / 31
  • 13. Processing Steps M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 13 / 31
  • 14. Step 1 - Query Assessment q1:<dn=nutrition, P= {county, underweight, stuting, wasting}, sc= (county = "Monbasa" OR county = "Nairobi" OR county = "Turkana") > Terms in the query are searched within the catalogue A missing term t is replaced with term t, the term in the meta-data catalog with the highest similarity score w.r.t. t Similarity measure: Jaro-Winkler We obtain the assessed query q M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 14 / 31
  • 15. Step 2 - Keywords Selection Relevant Terms for q: the most representative/informative terms for finding potentially relevant data sets M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 15 / 31
  • 16. Step 3 - Neighbour Queries For each selected keyword, a pool of similar terms in the catalog are identified From the assessed query q, Neighbour Queries is derived (queries which are similar to q) Q: the set of queries to process (q and neighbour queries) M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 16 / 31
  • 17. Step 4 - Data Set Retrieval For each query in Q, data sets are retrieved: Vector Space Model approach Relevant keywords for each query are used M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 17 / 31
  • 18. Step 5 - Schema Fitting The full set of field names in each query in Q is compared with the schema of each selected data set Data sets whose schema better fits the query will be more relevant than other data sets Relevance Measure of data set ods w.r.t. query nq For a data set ods and a query nq∈Q rm(ods, nq)= +(α − 1) × krm(ods, nq) + α × sfd(ods, nq) krm(ods, nq): Keyword-based Relevance Measure (VSM) sfd(ods) Schema Fitting Degree α: balance factor M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 18 / 31
  • 19. Step 6 - Instance Filtering For each neighbour query nqi ∈ Q and each data set odsj relevant for nqi items (rows or JSON objects) in odsJ that satisfy the selection condition are selected and inserted into the result set RS. M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 19 / 31
  • 20. Map-Reduce in the Hammer Prototype The final perspective application of the Hammer prototype is to build a centralized query engine for many open data portals and making available the blindly queries technique, in order to further simplify the work of open data users. Three main services provided by the Hammer prototype: Open Data Portal Crawling: portal is queried to get the list of open data sets they publish, getting their schema and meta-data, so that a local catalog of data sets is locally built Indexing: the inverted index to perform the technique (VSM Dataset Retrieval) is built Querying: queries are actually executed and evaluated on the actual content of data sets M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 20 / 31
  • 21. Crawler The Crawler: Gets information about the data sets contained in the corpus published by an open data portal and builds the local catalog Has been designed to connect with several platforms usually adopted to publish open data portals through several connector (CKAN Connector, Socrata Connector, ...) Implement with Map-Reduce Splits the computation by the end point URL; an end point is assigned to a single worker; during the input split phase the Crawler queries the open data portals and retrieve the list of all data sets Map is used to download a single data set; for each data set, retrieve the metadata and the first row to obtain the schema Reduce merges the data sets by their unique identifier and store the data on MongoDB data lake M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 21 / 31
  • 22. Indexer The Indexer: Extracts terms and labels from schema and meta-data of each open data set in the corpus C and builds the Inverted Index IN Within Indexer we adopted the Map-Reduce technique. For each data set and for each label in the schema and in the meta-data: The map primitive generates a pair <t, ods_id> Reduce transforms the set of pairs into a nested structure IN= { <t,ods_list: { ods_id1, . . . , ods_idn}> } Finally, the inverted index IN is stored into the local database M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 22 / 31
  • 23. Indexer M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 23 / 31
  • 24. Query Engine Query Engine exploits the inverted index IN to retrieve the list of data sets relevant for the query (that is for each neighbor query). The picture shows how the query engine works: It receives the query and generates the set Q of queries to process Then, by exploiting the catalog and the inverted index IN, it determines relevant data sets and then provides (as output) the items within those data sets that satisfy the selection condition nq.sc of a query nq ∈ Q M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 24 / 31
  • 25. Query Engine M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 25 / 31
  • 26. Case Study - Experimental Evaluation and application We implemented our technique within the Hammer prototype so as to evaluate the effectiveness of the technique and study performance Africa Open Data portal: 2301 Data Sets, non homogeneous data sets, different structures for same topics, about 5 millions rows We assembled a testbed for evaluating the Hammer prototype: a (small) Hadoop On local PCs and it was used for development and tuning On Google Cloud Platform and it was used to evaluate performance in an industrial settings M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 26 / 31
  • 27. Case Study - Experimental Evaluation and application We have executed 9 blind queries on our Testbed and have collected performance metrics for each phase M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 27 / 31
  • 28. Case Study - Query q1:<dn=nutrition, P= {county, underweight, stuting, wasting}, sc= (county = "Monbasa" OR county = "Nairobi" OR county = "Turkana") > 243 neighbour queries / 73 selected neighbour queries 3 relevant data sets / 3 rows found about 1.5 hours to perform the query and create the data visualization https://mauropelucchiunimib.carto.com/builder/d9a7b6b4-8605-4d61-9929-ee9197f466cf/embed M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 28 / 31
  • 29. Conclusions In this work we faced the challenge of parallelizing many parts of the prototype by applying the Map-Reduce approach, in order to evaluate the feasibility of this Big Data approach in the area of Big Open Data We demonstrated that the adoption of modern standard approach and technology specifically designed for big data management, such as Apache Hadoop and MongoDB, can be effective in this context We discovered that, although Apache Solr behaves quite well, our technique is capable of better focusing on data sets of interest; furthermore, it extracts only items of interest (while Apache Solr does not in a classic configuration) M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 29 / 31
  • 30. Future Developments Improve performance Replace native Hadoop with Spark (10x faster) Introduce Multilanguage search Extend query language: spatial joins Query disambiguation: improve the generation of neighbour queries M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 30 / 31
  • 31. Thanks for your attention M. Pelucchi, G.Psaila, M.Toccu The Challenge of Using Map-Reduce to Query Open Data 31 / 31