Handwritten Text Recognition for manuscripts and early printed texts
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
1. Trying Not to Die Benchmarking
using LITMUS
Harsh Thakkar1
, Yashwant Keswani2
, Mohnish Dubey1
,
Jens Lehmann1,3
, Sören Auer4
1
University of Bonn, Bonn, Germany
2
DA-IICT, Gandhinagar, India
3
Fraunhofer IAIS, St. Augustin, Germany
4
TIB, Hannover, Germany
- Amsterdam - Nederland - September 13
2. 2Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Outline
● Motivation
● Problem Statement
● State of the Art
● Approach - LITMUS Benchmark Suite
● Challenges
● Evaluation Plan
● Next Steps
3. 3Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
RDF-3X
Ocean of Data
Sea of Tools+
K-V stores
Graph stores
Doc-oriented
stores
RDF stores
Wide column
stores
Real
Synthetic
http://lod-cloud.net/versions/2017-02-20/lod.pn
g
LOD Cloud 2017
Motivation
2
4. 4Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
• Domain specific
applications:
i.e. perspectives
• Choice Overload!
• Vendors
• Researchers
• Users
https://steemit.com/philosophy/@l0k1/subjectivity-and-truth-how-blockchains-model-consensus-building
Motivation
5. 5Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Benchmarking
● Tedious!
● Needs domain-specific expertise
● Lack of standardization (single focus)
○ Open software, System configuration
settings, etc.
● Near-zero Reusability
● Guaranteeing a fair benchmark is difficult!
● Choosing the right performance metrics is
cumbersome and subjective
● Visualising benchmark results
[6] http://2.bp.blogspot.com/-TkUb0TPN7IA/VewUHm_jVaI/AAAAAAAABgM/vZILnZNJv5A/s1600/2012-10-16-subjective-objective.jpg
6. 6Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Problem Statement
“How can diverse cross-domain DMSs
be benchmarked in an automated
established *
standard #
environment?”
7. 7Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
State of the Art
Benchmark Effort Relational DMSs RDF DMSs Graph DMSs
TPC [H,C,E,DS] [13]
XGDBench [6]
HPC [7]
Graph 500 [12]
DBPSB [11]
LUBM [9]
IGUANA [19]
WatDiv [1]
SP2Bench [20]
BSBM [4]
Pandora*
Graphium [8]
LDBC [2]
HOBBIT**
*http://pandora.ldc.usb.ve/
Single domain
Benchmarks
Cross domain
Benchmarks
**https://project-hobbit.eu/
8. 8Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
LITMUS Benchmark Suite
9. 9Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Dataset 1 Dataset 2 Dataset 3 Dataset N
Data integration module
Benchmarking Core
Controller & Tester
System configuration & integration
module
Queryset 1
Queryset 3
Queryset M
Analyzer
RDF stores Graph
stores
Relational
DBs
Wide Column
stores
Profiler
Queryset 2
Key value
stores
Queryconversion
module
Query Facet (F2)
Data Facet (F1)
System Facet (F3)
User Interface
(F4)
User
The LITMUS architecture
Thakkar, Harsh. "Towards an Open Extensible
Framework for Empirical Benchmarking of Data
Management Solutions: LITMUS." ESWC, 2017.
10. 10Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Challenges
● Core challenges in developing
such an open, extensible, FAIR
framework?
○ C1 - Data Conversion
○ C2 - Query Translation
○ C3 - Key Performance Indicators
(KPIs)
http://media.thinkadvisor.com/lifehealthpro/article/2015/02/24/challenge.jp
g
11. 11Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
C1 - Data Conversion
● Different data models
○ RDF Graph
○ Property Graph
● To conduct a fair benchmark
conversion is needed
● DMS’s native supported data model
is the best
RDF graph
Property graph
Lots of Data
Real
Synthetic
RQ1 - What are the methods to convert RDF into
Property Graph data model?
12. 12Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
RDF Data Model
● RDF is a triple based graph model, where :
○ Subject: URI, Blank node
○ Predicate: URIs -> property
○ Object: URI, Literal, Blank node
“2017”
ex:Eventex:Person
ex:AMS
“Semantics”
ex:year
ex:name
ex:place
ex:speaker
URI = Universal Resource identifier, analogous
to ISBN for books
Literals = data values
Blank nodes = Desc. of entities that don’t need
to be named.
IRIs*
ex:stim
e
“30”
@prefix ex: <http://example.org>
ex:Person ex:speaker ex:Event
ex:Person ex:name “Harsh”
ex:Person ex:place ex:Bonn
ex:Person ex:age “27”
ex:Event ex:name “Graph Day”
ex:Event ex:Year “2017”
interpretation
representation
“Harsh” ex:name
ex:place
ex:Bonn
“27”
ex:age
13. 13Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
RDF Graphs (RDFGs)
● Edge-labelled, directed, multi-graphs (w. Ent. URIs, Blank nodes, Literals)
● Going from information to Knowledge using OWL (DLs) and Ontologies
(RDFS, RDFa, etc)
● Bulky
○ Everything is a node-edge-node (edges dont have properties)
○ More relationships per node → More total number of triples!
14. 14Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Property Graph Data Model
● Edge-labelled, directed, attributed, multi-graph
● Vertices and edges both have properties
● Main components:
○ Vertices, edges (Src,Dsc), properties (key-value pairs), labels (strings)
● Super neat (compact), super cute
● Easier to add weighted, reified edges
● Query Languages - CYPHER, Gremlin, PGQL, etc
Name: Semantics
Year: 2017
Place: AMS
Name: Harsh
Age: 27
Place: Bonn
Role: speaker
Time: 30
Person Event
15. 15Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Mapping RDF → PG
● Initial Results:
○ Intra-conversion of graph data models (mapping problem)
○ PoC implementation ready (see GitHub)
● Work in progress:
○ Conversion of properties, blank nodes, etc.
○ Using e.g. Reification, Singleton Property, Hypergraphs, etc.
○ Use case: DBpedia 2016-10 (mapping from .owl & data)
16. 16Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
C2 - Query Translation
● Yes we are linguistically
diverse and so are DMSs!
● That too with different
dialects:
○ SPARQL, CYPHER,
Gremlin, etc
● RDF - SPARQL (W3C ‘08)
● Graph - ??
http://cdn2.wpbeginner.com/wp-content/uploads/2015/02/multilingual-wordpress.jpg
17. 17Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Gremlin Traversal Language
http://www.datastax.com/wp-content/uploads/2015/09/many-to-many-mapping.png
http://www.datastax.com/wp-content/uploads/2015/09/gtm-dataflow.png
Gremlin’s Multi-Graph Query Language (GQL) support
18. 18Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Contd…
Multi-DMS & platform support
https://tinkerpop.apache.org/images/oltp-and-olap.png
RQ2 - What are the semantics preserving methods/approaches for translating SPARQL
queries to a graph query language such as Gremlin?
19. 19Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
https://opinionessoftheworld.files.wordpress.com/2013/04/game-of-thrones-daenerys-dragon.j
pg
Gremlinator
Me
20. 20Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
SPARQL → Gremlin
● C2: Gremlinator - the SPARQL-Gremlin translator
○ Formalizing Gremlin traversals in Graph algebra [DEXA ‘17]
○ A novel translation mechanism that maps SPARQL queries to Gremlin
pattern matching traversals [Planned submission - EDBT’18]
○ Nested queries still a challenge (i.e. UNION)
Addressing
RQ2
Talk@Graph Day 2017
21. 21Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
C3 - Metrics/KPIs
RQ3 - What are the strengths and the
limitations of the existing KPIs, and to what
extent do they reflect the performance of a
DMS?
RDF graph
Property graph
Type of Data
Real
Synthetic
[11] https://www.tutorialspoint.com/computer_fundamentals/images/primary_memory.jpg
[12] http://s.hswstatic.com/gif/microprocessor-250x150.jpg
11
Query response time
Precision, Recall
DMS Index size
DMS configuration
Linear
Star
shaped
Snowflake
Type of Query
22. 22Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Selection of KPIs
● CPU and Memory specific metrics:
○ Perf-tool - LITMUS v0.1 (supported)
■ TLB, LLC, instructions, L1 cache, page faults, etc (18 supported
currently)
● Dataset specific metrics:
○ |V|, |E|, Eccentricity, Clustering coefficient, Centrality, etc (in progress)
● Query specific metrics:
○ Type, Length, Response time, Precision, Recall, F1, etc (planned)
● DMS specific:
○ Load time, index time, index size (supported)
23. 23Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Dataset 1 Dataset 2 Dataset 3 Dataset N
Data integration module
Benchmarking Core
Controller & Tester
System configuration & integration
module
Queryset 1
Queryset 3
Queryset M
Analyzer
RDF stores Graph
stores
Relational
DBs
Wide Column
stores
Profiler
Queryset 2
Key value
stores
Queryconversion
module
Query Facet (F2)
Data Facet (F1)
System Facet (F3)
User Interface
(F4)
User
Back in the bigger picture
C1
C2
C3
24. 24Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
The LITMUS Test*
PLOTS
FILES
*Please visit our Poster & Demo for Hands on experience & more details in the paper!
25. 25Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Evaluation
● RQs: Publications
● Framework: Continuous integration (v0.1 released, v0.2
planned Dec ‘17)
○ Reproducing third-party benchmarks
○ Gathering users and experts feedback
○ Going live @Industry:
■ Gremlinator - Apache Tinkerpop
■ Further collaboration… Adoption by other projects - LDBC,
HOBBIT! :-)
26. 26Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Next Steps
● Framework - LITMUS v0.2 launch (Dec ‘17 - planned)
● DMS module - Adding two more DMSs each
● Dataset module - RDF → PG (Dec ‘17)
● Query module - Integrating Gremlinator
● GUI: Aesthetic GUI (may be?)
27. 27Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Acknowledgements
Funding: Supervisors & Mentors:
Prof. Dr.
Soeren Auer
TiB, DE
Prof. Dr. Jens
Lehmann
UBO, DE
Prof. Dr.
Maria-Esther Vidal
TiB, DE
H2020 WDAqua ITN (GA: 642795)
Dr. Marko Rodriguez
DataStax & Apache,
USA
28. 28Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Resources
http://wdaqua.eu/
https://github.com/LITMUS-Benchmark-Suite/sparql-to-gremlin
Code : https://github.com/LITMUS-Benchmark-Suite/
Web : https://litmus-benchmark-suite.github.io
Docker : https://hub.docker.com/r/litmusbenchmarksuite/litmus/
LITMUS Benchmark Suite
29. THANK YOU !
Harsh Thakkar
University of Bonn
Twitter: @harsh9t
LinkedIn: thakkarharsh
E-mail: harsh9t@gmail.com
Questions? Comments?
Insults? Injuries?
30. 30Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
EXTRA STUFF
31. 31Semantics 2017 - Amsterdam - Nederland - September 13 Trying Not to Die Benchmarking... - Harsh Thakkar - University of Bonn
Experiments*
Northwind dataset
● PG - Vertices: 3209, Edges: 6177
● RDF - Triples: 33033
BSBM 1M dataset
● PG - Vertices: 92737, Edges: 238309
● RDF - Triples: 1000313
CPU: Intel® Xeon® CPU E5-2660 v3 (20 cores @2.60GHz),
RAM: 128 GB DDR3, HDD: 512 GB SSD, OS: Linux 4.2-generic (x86_64)
Openlink Virtuoso v7.2.4, Apache TinkerGraph-Gremlin v3.2.3