A multi-threaded XML parser in C++ (MSc project dissertation)

PXML
A SAX-compliant parallel XML parser
A dissertation for the degree of MSc in Software Engineering
By David Kabongo Tshiany
Supervised by
Dr. Niki Trigoni
May 2013
Kellogg College
University of Oxford

Parallel XML parser - 2 -
Je dédie ce travail à ma tendre épouse Maguy,
Et à nos enfants Gaël, Ryan et Nathan,
Pour leur patience.
Declaration
Except as acknowledged, attributed and referenced, I declare that this dissertation is my own
unaided work.

Abstract
XML is a standardized and widely adopted markup language designed for data
exchange and storage. To use data from an XML document, applications typically
need to use an XML parser. The parser will be responsible for reading the XML file or
stream and provides XML data and structure to the application. Many programming
API and framework for processing XML files exist, and among the most used are DOM
and SAX.
In the last few years, the trend in the computer industry is to increase the number of
processors (or core within a processor) in computers rather than increasing the
processing speed. This fact is marking a fundamental change on software design and
applications development: whenever applicable, Software engineers should design
programs in ways that exploit the multiple processing resources available explicitly.
This dissertation presents a design and implementation of a SAX compliant XML
parser. The goal is certainly to make a faster parsing, but instead of counting on the
sequential processing speed increase, the parser will achieve an overall speedup
improvement by using multiple threads to read the same XML document concurrently.

Contents
1 Introduction ....................................................................................... 7
1.1 Motivation ...................................................................................................................7
1.1.1 Computer design trend...........................................................................................7
1.1.2 The need for faster XML parsing............................................................................8
1.1.3 Parallel parsing at the rescue.................................................................................9
1.2 Objectives.................................................................................................................10
1.2.1 Throughput ...........................................................................................................10
1.2.2 Concurrency support ............................................................................................11
1.2.3 Scalability .............................................................................................................11
1.3 Challenges ...............................................................................................................11
1.3.1 Synchronization and speedup ..............................................................................12
1.3.2 Programming languages support of concurrency.................................................12
1.4 Organization of the thesis.........................................................................................13
2 Background ..................................................................................... 15
2.1 The extensible markup language (XML) ..................................................................15
2.1.1 XML document representations............................................................................15
2.1.2 XML text and logical structure ..............................................................................15
2.1.3 XML tree concepts................................................................................................17
2.1.4 Well-formedness and validation constraints.........................................................18
2.1.5 XML standard and compliance.............................................................................19
2.1.6 Character's encoding, BOM and Unicode standards............................................19
2.2 XML processing........................................................................................................20
2.2.1 The Document Object Model (DOM)....................................................................20
2.2.2 A Simple API for XML processing (SAX)..............................................................21
2.2.3 SAX specification and language binding ..............................................................22
2.3 Elements of program optimization............................................................................23
2.3.1 Computer organization and its evolution ..............................................................23
2.3.2 Computer program performance goals.................................................................24
2.3.3 Branch optimization..............................................................................................24
2.3.4 Cache optimization...............................................................................................25
2.3.5 Principle of Locality and Common case ...............................................................26
2.4 Elements of program concurrency............................................................................26
2.4.1 Threads and cores................................................................................................26
2.4.2 Synchronization and concurrent objects...............................................................27
2.4.3 Lock-free and lock-based synchronization ...........................................................28
2.4.4 Speedup and thread concurrency ........................................................................29
2.5 Design patterns ........................................................................................................30
2.5.1 Strategy pattern....................................................................................................30
2.5.2 Observer pattern...................................................................................................31
2.5.3 Active object pattern.............................................................................................31
2.5.4 Monitor pattern .....................................................................................................31

2.5.5 Thread pool pattern ..............................................................................................32
2.5.6 Thread-local storage pattern ................................................................................32
2.6 Putting it all together.................................................................................................32
3 Design and implementation ........................................................... 34
3.1 Introduction...............................................................................................................34
3.2 Fundamental concept...............................................................................................35
3.2.1 Scanner types and chunk allocation.....................................................................35
3.2.2 Parsing properties and parsing modes.................................................................37
3.2.3 From bytes to SAX events....................................................................................38
3.3 Class relationship and interaction ............................................................................39
3.3.1 SAX classes .........................................................................................................39
3.3.2 PXML classes.......................................................................................................39
3.3.3 Concurrency classes ............................................................................................41
3.3.4 Class interaction and scanning loops...................................................................41
3.4 Implementation of SAX classes................................................................................43
3.4.1 Characters, String and C++ binding .....................................................................43
3.4.2 XMLReaderImpl class ..........................................................................................44
3.5 Implementation of PXML classes .............................................................................50
3.5.1 XmlTranscoder class............................................................................................50
3.5.2 TranscoderUtf8 class............................................................................................50
3.5.3 XmlScanner class.................................................................................................51
3.5.4 ChunkingScanner and ParsingScanner classes ..................................................52
3.6 Implementation of concurrency classes ...................................................................55
3.6.1 ChunkContext class..............................................................................................55
3.6.2 ThreadSafeQueue class.......................................................................................56
3.6.3 ThreadPool class..................................................................................................58
3.7 PXmlCount test program ..........................................................................................60
4 Evaluation ........................................................................................ 61
4.1 Evaluation objectives................................................................................................61
4.1.1 Speedup and elapsed time...................................................................................61
4.1.2 Performance optimization metrics ........................................................................61
4.2 Measurement collection ...........................................................................................62
4.2.1 Profiler and test program......................................................................................62
4.2.2 Accuracy and elapsed time ..................................................................................63
4.2.3 Test files ...............................................................................................................64
4.2.4 Test platforms.......................................................................................................64
4.3 Measurement results................................................................................................65
4.3.1 Observing parsing speed improvement with the PXmlCount program.................65
4.3.2 Observing parsing speed improvement with the Intel VTune Amplifier................67
4.4 Speedup evaluation..................................................................................................69
4.4.1 EnhancementFraction ..........................................................................................69
4.4.2 ImprovementRatio ................................................................................................69
4.4.3 Speedup ...............................................................................................................70

4.5 Hotspots and bottlenecks location............................................................................71
4.5.1 Hotspots ...............................................................................................................71
4.5.2 Synchronization bottleneck...................................................................................72
4.5.3 Memory and processor bottlenecks......................................................................73
4.6 Effects of parsing properties on performance...........................................................74
4.6.1 Pool configuration.................................................................................................74
4.6.2 Chunking depth ....................................................................................................75
4.6.3 Siblings per chunk ................................................................................................76
5 Reflection and conclusions............................................................ 78
5.1 PXML library integration to XML projects .................................................................78
5.2 Lessons learned .......................................................................................................78
5.3 Conclusion................................................................................................................79
5.4 Further research directions ......................................................................................80
5.4.1 Dynamic reconfiguration of parsing properties.....................................................80
5.4.2 Parsing based on XML schema............................................................................81
5.4.3 Lock-free synchronization.....................................................................................81
6 Bibliography .................................................................................... 82
7 Appendices...................................................................................... 85
7.1 PXmlCount program.................................................................................................85
7.1.1 PXmlSpinLock.hpp ...............................................................................................85
7.1.2 PXmlSpinLock.cpp ...............................................................................................86
7.1.3 PXmlCountHandler.hpp........................................................................................87
7.1.4 PXmlCountHandler.cpp (Part I)............................................................................88
7.1.5 PXmlCountHandler.cpp (Part II)...........................................................................89
7.1.6 PXmlCount.cpp (Part I).........................................................................................90
7.1.7 PXmlCount.cpp (Part II)........................................................................................91
7.2 PXML Character and String .....................................................................................92
7.2.1 XmlChar class ......................................................................................................92
7.2.2 XmlBuffer class.....................................................................................................93
7.3 XmlScanner states enumeration ..............................................................................94
7.4 ThreadJoiner and ChunkTask classes .....................................................................95
7.5 Other XMLReaderImpl methods...............................................................................96

1 Introduction
This chapter discusses the motivation behind this work; it presents the thesis objectives and
explains the challenges around them. Finally, it describes the thesis organization for the
remaining chapters.
1.1 Motivation
1.1.1 Computer design trend
Traditionally, computer performance depended mostly on the CPU clock speed
increase, the execution optimization and the refinement in memory organization [1 p.
665]. Faster clock speed, new CPU optimization techniques or better memory model
has meant a ‘de facto’ performance increase for a computer system and all its
programs.
Today that free benefit is over1
; for a decade now there is a fundamental change in the
computer industry that has pushed processor designers to a new design approach to
improving computers performance: placing multiple processors on the same chip [2 p.
344].
One of the reasons of this turnaround was that designers were not able to increase the
processor clock speed further due to physical limitations. Principally the high density of
micro components preventing power dissipation and the interconnect wires causing RC
delay (increase of the resistance R because they shrink in size and increase of the
capacitance C because they get closer to each other) [2 p. 19].
Although recent researches have proven that these limitations can be overcome
considering the arrival of better alternatives to silicon, such as the graphene [3], there is
an ultimate limit to the processor clock speed: the speed of the light [4].
The evolution of the memory hierarchy and caching techniques has conducted to
placement of a part of the computer memory just near the processor [1 p. 674]. Now the
CPU and memory are influencing each other so that improving the performance
optimization of one of them in isolation become impractical; typically, the memory speed
lags the processor speed. This phenomenon, called the memory wall, is just another
reason that convinced designers about the adoption of multi-core processor [5].
As predicted by the Moore Law, the number of transistors on a single chip continued to
grow exponentially due to the decreased price, leading to more complex but cheaper
CPUs. Multi-core organization quickly became the only way to build better-performing
1
This refers to the widely-cited essay “The free lunch is over” from Herb Sutter, who was among the first to
describe the change to exploit parallel hardware in the software world [61]

computers, and therefore parallelism as the most cost effective way to achieve better
program throughput.
This change has come with a number of improvements but also introduced new kind of
complexity in programs design. The industry has given a solution for performance
increase, but effectively obtaining performance in programs is today a programmer’s
burden: he has to program the multiple cores explicitly to take advantage of them.
That turned out to be a difficult exercise.
Multiple problems arise, most of them related to the added learning curve introduced by
these new concepts. The proliferation of concurrency counter-part of traditional design
patterns [6][7] is an example showing that help is needed by programmers to achieve
better-performing concurrent constructs.
Today, more than ever, the awareness of multi-core computer architecture and the
concurrency theory and practices are crucial to software engineers, system designers
and application programmers.
This thesis will illustrate this fact. Early chapters will have to dip the reader into the
complex and yet evolving area of multi-threading, beside a thorough exploration of
recent advances in processor and memory optimization.
1.1.2 The need for faster XML parsing
XML is a markup language widely used today to store and exchange business critical
information. One of the reasons behind its popular adoption is its simplicity, claimed
mostly because the language is self-describing and has both human- and machine-
readable format.
The XML language is extensively verbose, mostly in order to provide the human-
readability capability. For instance in the below XML document extract, the number of
characters used for the tags (see § 2.1.2 for tags), giving a contextual meaning to the
"content", is more than the number of characters in the "content" itself.
<tag>content</tag>
In a large XML document, the markups represent an important part of the overall size
and introduce substantial burden when it comes to processing that XML document with
machines.
Web Services and SOAP are examples of applications and protocols failing to reach a
satisfactory level of their most important performance requirement, which is the
response time, because of their use of XML [8] [9] [10]. When transferring large XML
data using web services, an XML parser needs to process the data on the client side.

The time required that is to process the data is affecting customer experience of the
service as soon as data reaches a certain size.
In this regards, JSON is a data interchange format repeatedly cited as a better
alternative to XML [11] [12]. The equivalent JSON format for the above XML document
extract will contain fewer characters and at large scale the difference will be noticeable.
Another overhead introduced in XML processing is the validation step. An XML
document that require conformance with a defined schema needs to be validated, on
top of being well-formed (see § 2.1.4). The validation step constitutes a considerable
burden for XML processing, many XML parser libraries, especially those claiming to be
‘fast’ (see Table 1-1, validation support), just do not include validation feature.
Generalizing the traditional relational model of database to XML is widely considered
today [13 p. 240] and XML database and XML-related database technologies are getting
a large adoption (XQuery, Oracle XML DB [14]). However, they all have efficiency
concerns directly or indirectly linked to the overhead introduced by the processing of
XML. One objection of using XML in database technologies has been for long the
higher overhead of processing it [15].
1.1.3 Parallel parsing at the rescue
Many researchers have considered increasing the XML parsing speed though
concurrency, and many have identified the arrival of multi-cores processors as an
opportunity to achieve a spectacular improvement in XML processing [16], [17], [18],
[19].
Authors in [19], among others, already suggested concurrently parsing pre-divided (pre-
parsed) chunks of an XML document, in a conquer-and-divide fashion, in order to
increase the parsing speed. They focussed on obtaining the DOM-like skeleton of the
full document (see § 2.2.1 for DOM) in the pre-parsing step, before processing the
produced chunks in parallel, while in this dissertation pre-parsing will take place in
parallel too.
Surprisingly none of the suggested technologies has affected the world of XML
processing, to the best of our knowledge; among the most widely used parser libraries
today, none has apparently adopted any of the above-referenced technologies or
methods.
Today's parser libraries remain inherently in their single-threaded version with no direct
support for concurrency. Some of them (see Table 1-1, concurrency support) offer a
limited support to add concurrency capability, but leaving the entire burden to the
programmer.
The Table 1-1 on the next page lists some of the popular XML parser libraries, their
compliance to XML, DOM and SAX specifications, and their support of concurrency and
validation.

XML parser
library
Language Style and
features
Compliance Validation
support
Concurrency
support
XML SAX DOM
Xerces [20] Java, C++ The most
compliant
1.0 &
1.1
2.0 Up to
Level 3
Yes No direct [21]
Libxml2 [22] C, C++ Partial SAX
and DOM
1.0 No No Yes No direct [23]
RapidXML
[24]
C++ DOM like,
fast (in situ)
Partial No No No No
Expat [25] C SAX-like,
popular
1.0 No No No No
TinyXML [26] C++ DOM like,
small size
Partial No No No No
Table 1-1 Popular XML parser libraries
1.2 Objectives
The objective of this work is the design and implementation of a SAX compliant XML parser
library, PXML, which uses parallel programming techniques to increase the processing
speed. The library aims to provide support to its users to take advantage of the multiple
processors available for parsing XML documents.
The developed algorithm will consist of cutting an XML document in chunks and parse the
chunks concurrently, with the cutting (or ‘chunking’) and the parsing occurring in parallel.
During the parsing, SAX events will be concurrently available to the library user, who will
eventually (not necessarily) use appropriate synchronization techniques in order to
consume those events.
Main aims are to improve throughput, provide concurrency support and offer
scalability.
1.2.1 Throughput
Computer architecture have improved very match lately; it is not possible to perform an
advanced program optimization without the knowledge of concepts such as branching or
caching and their impact on the program’s performance.
Their ignorance can drastically decrease the performance of an application, without
debugging or traditional troubleshooting being of any help in finding the cause.
Conversely mastering it may bring spectacular improvement on programs.

This work achieved a faster speed in parsing XML documents thanks to the knowledge
of computer architecture and organization, primarily the factors influencing programs
performance.
1.2.2 Concurrency support
The PXML library complies with the SAX specification for parsing XML documents; it
adds the concurrency support as a set of properties (as suggested by the SAX standard
[27]) on top of the specification.
The library gives the programmer opportunity to choose between the below three modes
of operation, the one that best fits the application domain:
1. Single-threaded
2. Multi-threaded manual (the user explicitly set the number of threads to use)
3. Multi-threaded automatic (the PXML library choose how many threads to use
according to the concurrency capability of the platform).
The library abstracts the hard concepts of concurrency internally but still gives the
programmer the opportunity to tune the parser behaviour at will. The library is easy to
reason about, it allies concurrency concepts and XML processing in a more naturally
way.
1.2.3 Scalability
The number of cores ranges from two to eight in today’s personal computers and up to
32 in servers, and this number is predicted to increase. A program made for 4-core
processors may need a redesign when it comes to run it on a 16-core computer, or
when the 16-core processors will be the standards in personal computers.
The designed parser aims to be scalable, meaning able to increase the parsing speed
seamlessly with the number of cores available in the platform where it is running,
without the programmer doing any additional coding for it.
The PXML parser library features that capability thanks to the multi-threaded automatic
mode. For the same parsing properties, the parsing improvement will be higher on a
computer with more CPU cores.
1.3 Challenges
There are a number of challenges to this project, the primary one being the difficulty to
apply concurrency correctly without affecting the system, and the real benefit obtained in
case of a successful application.

In addition, the fact that programming language support of concurrency is not a trivial issue
in most programming language does not make this task easy.
1.3.1 Synchronization and speedup
A concurrent program is a program made up of several entities that cooperate to a
common goal [28 p. vi]. Doing so they have to access shared resources on the
computer, but in order to keep these resources in a consistent state, the entities have to
synchronize their access to them. Many have agreed that this is hard to achieve. M.
Herlihy and N. Shavit refer to exploiting parallelism as one of the outstanding challenges
of modern Computer Science [29 p. 1].
The expected performance increase for programs is in most cases a faster program
execution or an increased throughput2
. However, the improvement brought by
concurrency is a potential rather than a guaranteed benefit in this regard, because the
overall performance increase (how much faster is the program) depends more on the
way concurrency have been applied to achieve it than simply on the number of
additional processors available.
Because both applying and getting benefit from concurrency are difficult, this thesis
considers synchronization and speedup the two fundamental concepts to master when
going for concurrency.
Synchronization (see § 2.4.2) is what one do to ensure program objects remain in a
consistent state. However, it has the disadvantage of introducing additional hurdles to
the overall throughput of the application, due to the relatively high cost of implementing
it.
Speedup (see § 2.4.4) is a measure of the overall improvement brought by the
concurrency to the program. Succeeding or not in making the program parallel, a
speedup equal to 1 means there was no improvement.
1.3.2 Programming languages support of concurrency
Another challenge, common to all those trying to dig into concurrency, but especially
programmers, is the fact that programming language support of concurrency for major
programming languages was slow to be effective; it is still in perpetual evolution today.
Java added concurrency utilities (with the java.util.concurent and other packages) in
Java 5 only [30]; the recent Java 8 introduced further support on concurrency with
libraries for parallel operations and concurrent accumulator [31]. Similarly, the Microsoft
2
Another goal when applying concurrency is in the domain of “separation of concern”, where each core
or thread is dedicated to an specific task, not tightly related to each other, for instance in the domain of
GUI programming. This aspect of concurrency is not discussed in this dissertation.

.Net framework (with the C# programming language) introduced concurrency support
only in its version 4.
The C++ programming language ignored even the existence of threads and atomic
operations up to the latest edition of the C++ standard [32] where concurrency support
increased both at language level and library level; and yet some features are expected
to incorporate the coming version [33].
Today there is a comprehensive support of concurrency in major programming
languages, but the learning curve remains important and the adoption is still slow. Some
important languages such as JavaScript does not contain any threading mechanism at
core level and limited support just started to be mentioned with ‘web workers’ [34 p.
322].
Some languages have built-in support for concurrency. One of the most widely used is
the Erlang programming language, which uses a message passing concurrency model
and claim to being easier to reason about and more robust in its implementation [35 pp.
1-14]. Unfortunately, this concurrency model does not fit lower level task like processing
XML documents.
Java has traditionally been the chosen language for XML and SAX specification, but
C++ is a more low-level language fitting for the task of processing files and streams and
provides a finer-grained control of the memory. C++ is the development language of
major web browsers’ rendering engine that also processes XML (Gecko, Blink, Trident,
and Webkit). Table 1-1 shows how C++ is the favoured language for XML parser
libraries.
With the recently added library support for concurrency, C++ features most concurrency
concepts essential to the realization of this thesis, within both the language and the
standard library, and deserved to be the chosen language for this project.
1.4 Organization of the thesis
The organization of the remaining chapter is as follows:
Chapter 2 (Background) begins with an introduction to XML and SAX concepts. For XML, it
discuss textual representation and tree representation of an XML document, XML
specifications, XML validation, XML characters and Unicode support. For SAX, it presents
the API, primarily in comparison to the DOM model, then focus on the SAX specification
and language binding.
The chapter continues with a description of selected terms of program optimization and
those of concurrency. It discusses principally relevant concepts, methods and techniques
used in this work. Finally, it briefly describes some design patterns used in the design and
implementation of the proposed parser.

Chapter 3 (Design and implementation) introduces the concept of chunking and parsing
scanners, the basis of the proposed PXML parser library. Then it discusses in details the
design and implementation of following parts of the parser:
- The SAX classes, which ensures the conformance to the SAX specification.
- The PXML classes, building blocks of the central library concepts
- The XML Reader implementation, which contains the parsing algorithm and
- The concurrency classes, essential constructs of the concurrency support
Chapter 4 (Evaluation) conduct the measurements of primary aspects of the PXML parser
performance, essentially the speedup. It also presents the metrics used for performance
improvement. The chapter discusses hotspots, thread concurrency, synchronization, CPU
usage and memory issues.
The chapter ends with a review of the parsing properties and their effect on the parser
performance, supported with measurement results.
Chapter 5 (Reflexion and conclusions) discusses some aspects of the realization of this
dissertation, such as the integration of the proposed library into XML projects and the
application of principles and algorithms from this thesis in a broader context. It is also the
conclusion of the thesis.
Bibliography and references are in chapter 6.
The appendices are in chapter 7. The first section presents the PXmlCount program; a full
program based on the PXML library, and used in this work as a test program to count the
number of elements and characters of an XML document.

2 Background
2.1 The extensible markup language (XML)
XML (eXtensible Markup Language) is a framework for defining markup languages. It is a
vast subject and introduction to it is impractical in the context of this work. A full introduction
to XML is available in [36] but [13] provide a more concise description of XML and related
technologies.
The XML standard [37] provides the complete XML specification and compliance, but it is
difficult to assimilate; this work recommends an XML course such as the one given in the
Software Engineering Programme [38] of the University of Oxford for comprehensive
understanding of XML concepts.
There are thousands of technologies and applications related to XML. This thesis will
mention exclusively those related to XML documents definition (DTD, Schema) and those
related to XML processing (XPath, XSLT, XQuery), discussing only concepts primordial to
the understanding of this thesis.
2.1.1 XML document representations
An XML document, in its textual representation (subsequently referred in this
dissertation as XML text), is made of a sequence of balanced and properly nested
markups and text fragments. However, conceptually, an XML document is equivalent to
a hierarchical tree structure called XML tree.
Listing 2-1 represents an XML text. It is a modified version of the TourAgency.xml
document found in the XML module of the SEP in Oxford [38 p. Exercises]. The XML
text includes annotations (texts after the  sign) to identify key markups.
The figure 2.1 is another representation of the same XML text, but in its conceptual
form, as an XML tree.
2.1.2 XML text and logical structure
In the textual representation of the XML document, one can readily identify tags,
constituted of a name between an opening (<) and closing (>) bracket. Such tags
typically come in two flavours, a start-tag such as <hotel> and its corresponding end-
tag, that differs from the start-tag by the presence of a slash just after the opening
bracket, such as </hotel>.
An XML element includes the start-tag, its matching end-tag, and content between the
two. A particular type of element is the empty-element-tag such as <flat/>.

<?xml version="1.0" encoding="UTF‐8"?>                 XML declaration
<!DOCTYPE MyTourAgency SYSTEM "MyTourAgency.dtd">      Document type declaration

<MyTourAgency>                                          Root element

    <rating stars="2">                                  Element (rating)
        <pool>true</pool>
        <room_service>true</room_service>
    </rating>

    <rating stars="3">
        <pool>true</pool>
        <sauna>true</sauna>
    </rating>

    <country name="Bulgaria">
        <resort name="Borovet">
            <hotel name="Rila">500</hotel>
            <flat>200</flat>
            <lowSeasonRent>
                <nbrDay>6</nbrDay>
                <banner>The whole week!</banner>
            </lowSeasonRent>
        </resort>
    </country>

    <country name="Andorra">
        <resort name="Pas De La Casa">                   Start‐tag (resort)
            <hotel name="Bovit">300</hotel>
            <flat/>                                      Empty‐element‐tag (flat)
        </resort>                                        End‐tag (resort)
        <resort name="Soldeu / El tartar">
            <info><![CDATA[Best restaurant]]></info>     CDATA section
             <info>Serial id is                           Character Data
                               3163<6475</info>       Entity reference
            <hotel rate="2">500</hotel>
            <flat>200</flat>
        </resort>
    </country>
                                                         Ignorable white space
    <!‐‐ countries and rating ‐‐>                        Comment
    <?php printf("starting with ratings")?>              Processing instruction

</MyTourAgency>
Listing 2-1 Textual representation of an XML document (with annotations)
Elements may have simple name/values pairs associated with them called attributes.
They usually identify or give more information about the element.
An XML production specifies a sequence of markups or other productions upon which
substitution can be recursively performed to generate new markup sequences3
.
The entire XML text is just a production called document and defined in the XML
standard as:
document ::= prolog element Misc*
3
The term ‘production’ comes from production rules used for grammar generation such as context-free
grammar.

The prolog is the first production of an XML document. It contains the XML declaration
and the Document type declaration productions (see annotations on Listing 2.1). The
XML declaration specifies the XML version to which it conforms (see § 2.1.5), and the
character encoding (see § 2.1.6) being used. The document type declaration belongs to
the built-in schema language (see § 2.1.4).
Anything between an element start-tag and end-tag is its content. The content of an
element consist of intermingled character data production (CharData) and any of the
element, processing instruction (PI), comment, the entity reference, character reference
or CDATA section (CDSect) production. The XML standard defines the content as
content ::= CharData? (
(element | EntityRef | CharRef |
CDSect | PI | Comment)
CharData?)
CDSect (also referred as CDATA section) is used to escape blocks of text between the
string ‘<![CDATA[‘ and the string ‘]]>‘, anything in-between is just pure text and should
not be processed as markup by the parser.
Entity reference allows representation of some characters that have a meaning in XML
using appropriate escape sequences to prevent the parser interpret them as markups.
For instance to represent the opening bracket ‘<’ within the content, one will place the
entity for this character between characters ‘&’ and ‘;’. So the sequence ‘<’ will be
interpreted by the parser as a single ‘<’ character.
Character reference allows representation of characters by the hexadecimal
representation of their code point between ‘&#’ and ‘;’. For instance, the sequence
‘<’ will be replaced by ‘<’ within an XML content.
Comments and Processing instructions are not part of the XML structure but are
destined respectively to the human reader and to the XML processor. They are XML
productions that can be themselves parts of the Misc production (see listing 2.1).
2.1.3 XML tree concepts
In the tree representation, a node is the counterpart of the element in the XML text, and
the root node corresponds to the first element in the textual representation, the root
element.
The tree theory defines a path as a sequence of nodes connected by edges [39 p. 10].
Please note path means here the shortest path, which is a path that does not repeat
nodes.
The depth of a node is the number of edges on its path to the root node; it is like its
‘distance’ to the root element, making the root element being of depth 0. In the XML text

(listing 2.1), the depth of an element is somewhat corresponding to its indentation level;
in the XML tree all nodes with the same depth are assembled inside horizontal dotted
lines.
MyTourAgency
rating rating country
sauna
pool
Room_service resort
country country
resort resort resort
hotel flat lowSeasonRent
bannernbrDay
hotel hotel hotel flat
name=Romania
stars=3
name=Bulgaria
stars=2
name=Andorra
name=Borovet
flat
name=Rila
Rate=5
name=Bovit
depth 1
depth 2
depth 3
depth 4
depth 0
root node
siblings
leaves
= path
edge
descendants of ‘Andorra’
Figure 2-1 Tree representation of XML document
Two nodes refer to each other as parent and child if the path between them does not
contain another node, the parent being the element closest to the root node. The child
depth is always +1 the parent depth.
A group of nodes are siblings when they are of the same depth and have a common
parent. Sibling nodes being side by side in the tree refer to each other as preceding-
sibling (left-hand) and following-sibling (right-hand) [13 p. 62].
The descendants of a node are the set of nodes that have this node on their paths to
the root node. The ancestors of a node are the nodes found in the path from that node
to the root node.
2.1.4 Well-formedness and validation constraints
The World Wide Web Consortium defined the XML standard in term of productions and
constraints. The specification describes a list of constraints called well-formedness
constraint (WFC) and validation constraint (VC).

A text document is qualified as a well-formed XML document (or simply XML
document) when its textual representation matches the production in Listing 2.1, and it
satisfy all the well-formedness constraints.
An XML document is a valid XML document if it satisfies all the validation constraints
(in respect to a given schema) in addition to being well-formed.
An XML language is a particular family of XML documents complying with additional
syntactic and semantic rules. A schema is a formal definition of the syntaxes and
semantics of that XML language. A schema language is then a formal language for
expressing schemas [13 p. 92]. The validation of an XML document is equivalent to the
establishment of its conformance to the syntax and semantic of a schema language.
The most popular schema languages are DTD and XML schemas.
Document Type Definition (DTD) is an XML built-in schema language, its definition is
part of the XML specification. The Document type declaration that optionally specifies a
type and the location of a document containing other rules for validation of an XML
document is part of the DTD language.
XML Schema is another popular schema language considered as much more
elaborated than DTD. A number of limitations have been identified on DTD [13 p. 112]
and XML schema have been specially designed to overcome them and bring other
improvements. An XML schema document is itself an XML document and has thus the
advantage of being self-describing.
XML schema and DTD encouraged the creation of validating parsers, but the
validation step introduce a non-negligible overhead to the overall processing of the XML
document. That overhead is a pretext for many XML parsers to not incorporating the
validation feature.
2.1.5 XML standard and compliance
The W3C standard for XML is today in its version 1.1, but the previous version 1.0 is still
widely used and even recommended for general use. The main additions in the version
1.1 are about the compliance to the later version of the Unicode standard [40].
Many XML libraries limit their compliance to the version 1.0, as it is enough for major
applications. The proposed PXML parser also conforms to the 1.0 version of the XML
specification, excluding validation constraints.
2.1.6 Character's encoding, BOM and Unicode standards
The Unicode [41] is an international encoding standard for characters, texts and
symbols representation. The standards assign a unique integer value to each character
so that its use is unambiguous over multiple languages. The binary representation of the

character’s Unicode number is its encoding. The encoding of the XML text is the
Unicode encoding specification to use for its conversion from binary to the textual
format.
XML specification does not allow all existing Unicode characters within XML documents;
it defines a number of valid XML characters and excludes the remaining, as their
introduction will affect the document interpretation.
Inside XML text, only subsets of valid XML characters can appear within specific
markups. For instance, the character data production allows using a larger set of
character than the element.
UTF-8 and UTF-16 [42] are the most popular Unicode encoding specification. UTF-8 is
the most widely used on the web, among other reasons because of its similarity to the
old ASCII encoding; UTF-16 is widely adopted by many programming languages (Java)
and operating systems (Windows).
The first step on parsing an XML file is to transform the sequence of bytes forming the
document data to a sequence of characters in the specified encoding. Because the byte
ordering (or the endianness) of a file or stream is different from a machine type to
another, XML parsers should consider it in order to not produce incorrect output.
Some files or streams use a byte order mark or BOM as their first character, to help
XML and text processors use the right endianness when reading them.
2.2 XML processing
XML processing languages allows extraction of XML content and structure. XML languages
such as XPath, XSLT or XQuery are the most popular for processing XML documents.
Although these processing languages fulfil most needs, some processing may require a
particular way of parsing; XML programming languages help programmers to define their
processing mechanism of XML documents.
DOM and SAX are two of the most used of such languages. They are intrinsically different
in the way they proceed to process XML documents and are the prime representatives of
the two principal models of XML programming.
2.2.1 The Document Object Model (DOM)
DOM is a language-independent API for XML (and HTML) documents defined by the
W3C consortium and freely available [43]. It defines the logical structure of XML
documents and allows programmatically reading, manipulating, modifying and creating
them.

When parsing XML documents the DOM parser first reads the full document to build an
in-memory representation of it, from which it will perform subsequent operations.
The DOM representation is usually similar to the XML document presentation itself and
has the form of a tree. The API defines a set of interface, procedure and methods to
navigate over XML document elements or create and modify them.
The number one problem with DOM is the necessity to read the full document before
any further processing because this incurs an extra delay and consumes the system
memory. The DOM processing model does not fit some needs, such as when the
parsing speed is crucial, or when parsing large documents.
On the other side, DOM is very stable. Once the DOM processor loads the XML
document into memory, the programmer can freely go back and forth throughout the
document, which is not possible in SAX-based processing.
2.2.2 A Simple API for XML processing (SAX)
There are circumstances where it is not necessary to build the full structure of the XML
document in advance before processing it; the document can be processed while being
read.
SAX is an API for processing XML documents, providing an alternative to the DOM
mechanism [44]. It is an event-based API (also referred as a “push parser”), which
operates by reading the XML file or stream and triggering SAX events for each XML
entity that it recognizes as part of the SAX specification. It is a serial access mechanism
that process each markup sequentially and once.
An important consequence of SAX parsing is that it is stateless. Once it has processed
a markup, the parser may discard any information about it before proceeding with the
next markup.
This fact is both an advantage and a disadvantage for SAX users. It is a disadvantage
for the extra burden of saving state information being on him, with the possible
introduction of errors in the processing. It is an advantage for the liberty to focus only on
the part of the XML document of its interest, avoiding reading the full document into the
memory.
        <resort name="Pas De La Casa">            startElement(resort)
            <flat>                                startElement(flat)
                 300                              characters(“300”)
            </flat>                               endElement(flat)
        </resort>                                 endElement(resort)
Listing 2-2 XML portion and corresponding SAX events

Given the XML document portion on the previous page (Listing 2-2), the SAX parser
reading it will generate the indicated five events (see annotations) corresponding to the
XML markups it has recognized.
SAX usually suits to XML processing that focuses on information retrieval; when it
comes to manipulating the document structure, the DOM parser is often more
appropriate.
2.2.3 SAX specification and language binding
Unlike DOM, SAX is not from the W3C consortium; it has been developed by the XML-
Dev mailing list, with the participation of many contributors [45]. Because the first
implementation of SAX uses the Java programming language, and no formal
specification exists since then, the Java implementation of the SAX API is the ‘de facto’
standard [46].
The SAX specification, currently in its version 2.0, is an ensemble of Java classes and
interfaces that implementation need to extend to make a SAX parser, or that library
users need to implement to use the parser. These classes and interfaces fall into two
important groups:
- Parser designer interfaces: XMLReader and Attributes
- Parser user interfaces or handlers: ContentHandler, ErrorHandler, DTDHandler
and EntityResolver
There are other classes, but this thesis does not discuss them. Among them are
SAXException and SAXParseException for exception handling, InputSource for
processing XML document stream, and LexicalHandler class, part of the SAX 2
Extensions [47], used to provide lexical information about an XML document, such as
comments and CDATA section boundaries.
An implementation of a SAX parser is required to extend the parse designer interfaces
(XMLReader and Attributes) and leave the implementation of handlers to the library
users. The XMLReader has methods for parsing, setting features, properties and
handlers. The proposed PXML parser will provide concurrency support as an
implementation-specific parsing property (see § 3.2.2).
Users of a SAX library have to implement parser handlers’ callbacks functions in order
to use the parser.
Because programming languages differs in semantics, an implementation of the SAX
using a language other than Java may provide its language binding, which is its
equivalent of the SAX Java classes and interfaces.
The case of Java and C++ is notable principally because of a fundamental difference in
their memory management styles (Garbage collector for Java and RAII and smart

pointers for C++). The C++ implementation will principally focus on providing an efficient
mechanism for string creation, destruction and manipulation (see § 3.4.1).
2.3 Elements of program optimization
The evolution of computer technology has greatly influenced the techniques of program
performance optimization. Nowadays more than ever, knowledge of computer organization
and architecture is a prerequisite to achieving a successful program optimization.
2.3.1 Computer organization and its evolution
Computers traditionally consisted of four main structural elements: a central
processing unit (CPU), a main memory (M) and input-output components (I/O) for
data movements between the computer and its external environment [48 p. 28].
Today the hardware revolution and a number of modern functional requirements have
boosted the appearance of new waves of technologies that have modified the
organization of a computer, mainly on the CPU and memory.
Computers have now multiple processors or cores available on the same chip or
socket. The memory hierarchy and management have been further improved. One level
of the memory hierarchy, the cache, is now playing a role of uppermost importance.
The cache memory is made of different level of memory blocks decreasing in size as
they get close to the core. Upper-level memory blocks are each dedicated to one core
and all the cores share lower-level memory blocks.
Figure 2-2 Intel core i7 block diagram (from [1 p. 56])

2.3.2 Computer program performance goals
The main function of a computer is to execute programs. A program is a set of
instructions that the processor execute. In its simplest form, a single instruction
execution consists of the fetch stage and the execution stage, constituting the
instruction cycle [48 p. 31]. The clock rate is the speed at which the processor executes
instructions.
In real implementations, one instruction involves many of these instruction cycles; the
number of cycles per instruction (CPI) is an important metric of processor execution
performance, as it influences the CPU time or the overall time spent by the processor to
run a program.
Besides data processing operations inside the Arithmetic Logic unit and control
instructions, most instructions in the execution stage and the entire fetch stage are
memory access operations. The access of the CPU to the memory has been identified
to be a relatively expensive operation and has been the central bottleneck location for
the computer performance improvement.
The above facts and observations have made techniques and technology for
programmes performance improvement to focus on two principal goals:
- Increase throughput: increase the number of cycle per instruction and increase the
processor clock rate for a decreased CPU time
- Decrease latency: obtain an optimal access speed of the CPU to the memory
throughout the program execution
2.3.3 Branch optimization
Branch prediction is a technique used within processors to improve the flow of
instructions, with a more positive impact when used with a pipelined processor
(pipelining is a technique that exploits the capability of the processor to evaluate multiple
instructions in parallel [2 pp. 147, 261]). Branch prediction occurs when a program reach
a conditional instruction (if-then-else or switch).
The processor will try to identify the branch that is most likely to be taken, pre-fetches its
code and speculatively execute it, eventually discarding it when it turned out to be not
the branch effectively taken by the programme.
Although prediction techniques have proven to be often successful and resulted in
improved performances, there will be cases where the chosen branch will be the wrong
one (branch misprediction) and the delay or penalty incurred to the program will be
considerable.

Branch optimization is a technique of reducing branch misprediction. In general,
performing this optimization before design and implementation is considered a
premature optimization.
2.3.4 Cache optimization
Tradeoffs in the cost-performance-size of memory technologies have leaded to the
appearance of the memory hierarchy [2 p. 72], and the cache memory plays an
important role in performance improvement. The cache is a relatively fast, small in size
and expensive memory that is placed on the same chip as the processor, thus causing a
reduced latency when accessed by the CPU.
Registers
Main memory
Caches
Disk storage
Size
Speed Price
< 1 KB
> 1GB 5 ms
5 ns $$$$$$$$
$$
Figure 2-3 Trade-offs in the cost-performance-size of memory
Optimization of cache performance often brings more benefit on programs than other
techniques of optimization. There are many of such caches optimization techniques [2 p.
78], but they all usually influence on few important metrics of caches optimization.
The hit ratio is a cache optimization metric. It represents the number of memory
references that hit the cache over the total number of memory references.
The miss rate is the equivalent metric, i.e. miss rate = 1- hit ratio. One important class
of miss rate is the compulsory miss rate that happens at the very first access of a
memory block, because the block has not previously been referenced.
LLC miss is a miss that occurs for the last-level layer of the cache memory; it is the one
incurring the most performance degradation.
Knowing the number of cache levels, the cache size, and the cache layout of a
processor can help tailor a program that will run efficiently on computers with such

processor. The problem with this approach is that the program might be inefficient when
running on a different type of processor.
Cache-oblivious optimization refers to the practice of cache optimization based on
general principles of caches such as the principle of locality (see § 2.3.5) or other
techniques such as divide-and-conquer, rather than based on a particular cache
configuration or size.
2.3.5 Principle of Locality and Common case
Programs tend to reuse data and instructions they have used recently [2 p. 45]. This
affirmation came from observations on programs showing that a program usually
spends 90% of its execution time in only 10% of the code (90/10 rule), which is the
program hotspot.
There are two types of locality: temporal locality concerns the code or data recently
accessed, and spatial locality (or locality of reference) is about data or code addresses
that are near one another.
Cache optimizations techniques heavily rely on this property. It is possible for an
advised programmer to increase performance of the program only by appropriately
exploiting this valuable property.
One of such situation is with loop-based structures. Because a programme with a loop
will likely return to the same code portion many times, appropriate use of the principle of
locality will often trigger unexpected performance improvements.
A similar principle referred as common case by [2 p. 45] is a design principle that
favours the frequent case over the infrequent case when deciding about performance
tradeoffs. When optimizing an algorithm, it is often more benefit to identify the frequent
case (such as a branch often taken in a switch conditional statement) and focus on
optimizing that case in priority.
2.4 Elements of program concurrency
2.4.1 Threads and cores
In a computer, a unit of sequential processing can be a thread or process. The process
is more ‘robust’ because it benefit from the operating system support in terms of
security; it encapsulate/protect all its internal structures and execute within its memory
space, while a thread execute in a memory space shared with other threads.
A program is a multi-process program or concurrent program if it allows execution of
more than one process or thread in parallel. As the program aims to a unique goal,

usually these processes or threads have to communicate. The communication among
processes is typically achieved through message passing while threads communicate
through shared memory, a memory location they can all access, used as a medium for
their communication.
This thesis will focus on threads and shared memory, and the term concurrent program
will be preferred to designate a multi-process program.
A computer is a multiprocessor or multi-core processor computer when it has more
than one processor. Concurrent programs fit well on multiprocessor computers, but they
can also run on a single processor computer; in such case, the operating system will
arrange so that the processor serve both threads successively in turn.
A context switch occurs when the operating system switches the processor from one
thread to another. Context switches are expensive (in term of CPU time) and so often
contribute considerably to the overall performance degradation. With many cores, a
context switch may still occur, although will less probability; hopefully its impact can be
considerably reduced if the program is correctly made so that threads are equally
distributed among processors and their cache memory.
On multiprocessor computers, access to data become hard to manage in the presence
of concurrent programs and is often the cause of hurdles and bottlenecks. For instance
access contention is one of the issues, it happens when data written by one thread is
read by another thread on another core; access contention impact is reflected on cache
memory and cause problems such as false sharing.
A typical case of false sharing is when two algorithms running in parallel on different
core use two variables logically separated but placed by inadvertence in memory
location near one to another. The caching algorithm, by the principle of spatial locality
(see § 2.3.5), will always try to treat them together, forcing them to move from one
dedicated cache to another, thus increasing the miss rate [2 p. 366].
2.4.2 Synchronization and concurrent objects
In a concurrent program, two given threads competing or cooperating to a common goal
may need to access the same space called shared memory, but an inappropriate
access will affect the programme integrity or consistency and leads to a hazardous
situation called race condition.
This problem, identified as the mutual exclusion problem, was solved many years ago
from now by E. W. Dijkstra [49], who provided synchronization as the solution to the
mutual exclusion problem.
Synchronization is a set of rules and mechanism that allows specification and
implementation of concurrent programs whose execution are guaranteed to be
correct[28 p. 5], or to have a degree of correctness called liveliness or progress
conditions[28 p. 137].

At a certain level of abstraction, a program is made of elements that participate to the
execution of that program; in some programming paradigm, they are referred to as
objects. A concurrent object is an object that can be safely accessed concurrently by
several threads without requiring an explicit synchronization; the object is said to be
thread safe.
A mutex is an example of such concurrent objects. It defines a lock and unlock methods
that can be called by many threads. Once one thread call the lock method of a mutex, it
has acquired that mutex; any other thread trying to acquire it will block (its execution is
suspended by the CPU) until the mutex is released by the thread that locked it calling its
unlock method.
The mutex can be used to ensure that only one thread enter the area of the program
code between its lock and its unlock method, making that region to be referred to as a
critical section.
A condition variable is also a concurrent object. Its C++ specification defines methods
wait, notify_one and notify_all that can be called by one or may threads. Any thread that
calls a condition variable’s wait method is blocked. Threads blocked by a condition
variable are said to be waiting as they can be unblocked only upon a defined condition
being met. The condition can be either that a predicate previously assigned to the
condition variable to be verified or that a notification for release to be sent to the
condition variable from another thread, by calling its notify_one method (to unblock only
one of the waiting threads) or its notify_all method (to unblock all the waiting threads).
Mutex and condition variable are part of a type of concurrent objects called
synchronization primitives as they can be used to build synchronization constructs
(see § 2.4.3 below).
The C++ language provides classes to work with concurrent objects (std::mutex and
std::condition_variable) and threads (std::thread).
The std::thread class allow creation and manipulation of threads. One of its methods
is the join method, used to synchronize execution of threads. In practice this method is
used to make a thread wait for another thread to complete, before continuing its
execution (see the ThreadJoiner class on § 3.6.3).
2.4.3 Lock-free and lock-based synchronization
Synchronization can be implemented in term of concurrent objects. There are two types
of synchronization depending on the concurrent object used to implement it, both types
having a set of related progress conditions.
Lock-based synchronization consists on providing a synchronization object called
lock that allows a zone of the code to be bracketed to guarantee that a single process
at a time can execute it. It is based on mutex and their critical section (see § 2.4.2).

When a thread is blocked due to synchronization reasons (or for other reasons such as
memory access latency), its execution is suspended by the CPU; it create an idle time
called wait time that is usually not desirable as the processor considers it a waste of
time.
There are cases where the wait time consumes CPU time. That is the case in the
implementation of spin lock, where an object is trying to acquire the mutex repeatedly
and so remains in a busy wait state. Spin lock a proven to be more efficient than
traditional locks on cases where critical section exclusivity is required only for a short
time.
There are two progress conditions (see § 2.4.2) that can be associated with lock-based
synchronizations: deadlock-freedom and starvation-freedom. In other words, deadlock
and starvation are the two main issues of lock-based synchronizations
Lock-free synchronization is based on atomic registers or hardware-provided primitive
operations (e.g. compare & swap). The following progress conditions can be associated
to lock-free synchronizations: obstruction-freedom, non-blocking and wait-freedom,
which is the highest level of correctness of a synchronization technique can achieve.
2.4.4 Speedup and thread concurrency
Concurrency makes programs run faster by improving their throughput. The speedup
tells how much faster a program will run with concurrency applied as opposed to its
single-threaded version [2 p. 46]; it allows estimating the benefit of applying concurrency
to a program. Let us consider two parameters:
- EnhancementFraction. The enhancement fraction, which is the fraction of the single-
threaded version that can be converted to run with multiple threads
- ImprovementRatio. The improvement ratio, which is the time of running a single-
threaded program compared to the time that will be spent for running the same
program using multiple threads.
The speedup formula is
1
Speedup = ----------------------------------------------------------------------------
EnhancementFraction
(1 – EnhancementFraction) + -------------------------------
ImprovementRatio
The speedup formula is defined from the Amdahl's law. The law states that the
performance improvement to be gained from using some faster mode of execution of a
program is limited by the fraction of the time the faster mode can be used [2].

For a concurrent program, one obtains the best ImprovementRatio when the maximum
number of threads is effectively running in parallel, and all threads are performing a
useful work.
EnhancementFraction is playing an important role in the speedup equation as it can limit
possibility of program improvement for any given ImprovementRatio if its value is small,
and can increase the potentiality of a better speedup as its value is approaching 1.
2.5 Design patterns
The multithreading revolution has led to identification new design patterns [6], [7]. Some
are concurrency counter-part of traditional design patterns described by the Gang of four
[50], other simply new design patterns specific to parallel computing.
This section describes two traditional design patterns and four concurrency design patterns
used in this work.
2.5.1 Strategy pattern
The intent of the strategy pattern is to define a family of algorithms and make them
interchangeable; one of the motivations is that different algorithms may be appropriate
at different time [50 p. 315].
For example let's have a Context that need to scan some text, but needs different
variant of scanning algorithm at different time. Different scan strategies can be
implemented (ScanStrategy1, ScanStrategy2 and ScanStrategy3) and the Context can
use them interchangeably through the ScanStrategy.
Context
scan(ScanStrategy sc)
sc‐>scan()
ScanStrategy
scan()
ScanStrategy1
scan()
ScanStrategy2
scan()
ScanStrategy3
scan()
Figure 2-4 The strategy pattern

A remarkable benefit of the Strategy pattern is that it represents an alternative to
conditional statements [50 p. 315]. The conditional statement can be replaced by the
strategy assignment, and each branch moved into its strategy.
2.5.2 Observer pattern
The observer pattern is needed when a change to an object called subject require to be
watched by other objects (called observers).
The interaction between the subject and the observers is known as publish-subscribe,
the subject is the publisher of notifications, and any number of observers can subscribe
to receive them [50 p. 294].
Subject
subscribe(Observer)
unsubscribe(Observer)
notify()
List<Observer> obsList
Observer
update(UpdateData data)
ConcreteObserver
update(UpdateData data)ConcreteSubject
Figure 2-5 The observer pattern
2.5.3 Active object pattern
The intent of the active object pattern is to have objects with methods that can be
asynchronously invocated in one thread and have their execution in a different thread.
The pattern decouples method invocation from method execution [51].
An active object can be an object that reside and run in its own thread (thread of
execution) independently of the thread that created it (thread of creation).
Implementations of an active object usually provide a mean for the thread of execution
to communicate results or outcome of the execution to the thread of creation.
2.5.4 Monitor pattern
The intent of the monitor pattern is to synchronize concurrent method execution to
ensure that only one method at a time runs within an object [51] . In addition to being
mutually exclusive within the object, the methods executions are also pre-conditioned to
some predicate to be verified.

The monitor is a higher-level concurrency construct that require a certain number of
synchronization primitive to participate in its construction. An essential participant is a
mutex, a synchronization object providing mutual exclusion.
2.5.5 Thread pool pattern
Multithreading allows running multiple tasks in parallel, each task running within one
thread. The thread pool organization is needed when the number of tasks to run in
parallel is much higher than the number of available threads.
A thread pool mechanism will typically consist of inserting the tasks in an internal data
structure such as queue or stack [29 pp. 223, 245], then let each thread fetch a task in
the queue, run it and proceed with another task until the data structure is empty.
As the threads competing for tasks may cause race conditions, the thread pool requires
synchronization mechanisms that allow thread-safe insertion and retrieval of tasks in
and from the queue.
2.5.6 Thread-local storage pattern
In multi-threaded environment, threads are sharing the same memory space and need
synchronization to access a shared memory location. There are situations where each
thread requires the same functionality, which can be implemented using the same
program variable, but does not require to be shared by other threads.
The thread-local storage pattern allows multiple threads to have access to a single
definition of an object, but instead of having the object instance shared by all the
threads, it arrange for each thread to have its copy of the object instance, kept internal
to the thread’s stack of execution.
Although the object definition appears to be global, any reference to it will be a
reference to a unique, local version internal to the thread accessing it.
2.6 Putting it all together
The operational mode of a SAX parser is similar to that of a pushdown automata or
PDA [39 p. 109]. The PDA has a number of states defining its stable conditions (PDA
states) and another set of state ranged in an internal stack (stack states).
An input to the PDA can cause a transition if the combination of the input with the
current PDA state and the top stack state has a corresponding pair of state formed by a
PDA state and a stack state. The relationship between all possible input set (the input,
the PDA state and the stack state) and theirs corresponding pair (a PDA state and a
stack state) is the transition function or transition table.

The implementation of the parser will thus inevitably make use of some conditional
statements to define transitions, by comparing the current state, the stack state and the
input for a possible match in the transition table.
The proposed PXML parser will use the strategy pattern in its implementation to
eliminate a number of the conditional statements, and replace them by a strategy
algorithm. The implemented strategy pattern will achieve both a performance
optimization goal, which is the reduction of branch misprediction (see § 2.5.1), and the
organizational goal, which is achieving dynamic polymorphism, as strategies will be
interchangeable at runtime.
As discussed in § 1.1.3, this thesis will adopt a divide-and-conquer strategy that will
consist of the parser cutting the XML document in multiple parts or chunks and parsing
them in parallel.
The number of chunks should be in reasonable proportion to the number of available
resources for parsing them in parallel, that is the number of processor cores and the
number of threads. However, the number of chunks will be typically higher than the
number of cores or threads. The thread pool pattern will allow a balanced distribution of
chunks to the available resources (see § 2.5.5).
The parser will have to process each chunk within its dedicated thread; for each chunk,
the thread function and the chunk details constitute an active object, as it will be
evolving in a multi-processing environment different from its thread of creation (see §
2.5.3 ). The active objects will need to interact as they are participating to the common
goal of parsing an XML document; the monitor pattern will allow synchronized
interaction between them (see § 2.5.4).
The most important requirement of this parser will be the support of concurrency.
In a SAX parser, events come in sequential order, as they appear in the document it
parses. For the proposed parser, event will come in a non-sequential order. The parser
will have to provide contextual information for each chunk to help the user reorder the
events for meaningful use. The thread local storage pattern will help the library make the
contextual information local to its chunk so that it is accessible safely within the chunk,
without requiring synchronization (see § 2.5.6).

3 Design and implementation
3.1 Introduction
The SAX parser that this thesis proposes has two functional requirements:
- Conformance to the SAX specification and
- Support of concurrency
Conformance to the SAX specification is the easier part. The SAX API is simple and has
already an implementation in Java that is fully compliant. The work in this thesis will be
about providing a C++ binding for the Java implementation.
The support of concurrency needs to be implemented without affecting the conformance to
the SAX specification. That is, the modification of the existing SAX implementation classes
in order to achieve concurrency must be at the extent allowed by the standards, and
classes added for the sake of achieving parallel parsing should be properly encapsulated.
This thesis suggests grouping the classes of this design and implementation in three
groups, concerning their contribution to the requirements:
- SAX classes: represent all classes coming from the SAX specification. They are
used to implement the ‘visible’ part of the SAX parser, in contrast to other classes
that will be encapsulated. These classes address one of the functional requirements;
that is the conformance to SAX. See § 3.3.1 and § 3.4 for design and
implementation.
- PXML classes: represent classes that implement the concepts of the proposed
parser, such as chunking and parsing scanner, and the algorithms used for
achieving parallel parsing, such as chunking and parsing loop. See § 3.3.2 and §
3.5 for design and implementation.
- Concurrency classes: additional classes added for concurrency support, they allow
keeping the whole system consistent in regards to the introduced multi-processing.
They address non-functional requirements such as synchronization, thread safety
and thread concurrency. See § 3.3.3 and § 3.6 for design and implementation.
This chapter presents the design and implementation of the essentials classes of each
above-mentioned group; it brings out the relationship among them and their interaction in
achieving the proposed objective.
This thesis widely introduced SAX specification and concurrency principle in previous
chapters; this chapter begins with the presentation of the fundamental concept of the
proposed PXML parser.

3.2 Fundamental concept
The proposed PXML parser will divide the XML document into chunks so that it parses
them in parallel using multiple threads, and so increases the XML parsing speed. The
“chunking” algorithm is not directly based on physical properties such as the chunks size,
although the expected endeavour is to have chunks of similar sizes; it is rather a markup-
aware chunking, based on the logical structure of the XML document (see § 2.1.2).
As far as markups are concerned, different parts of an XML document will require different
processing effort. For instance within an element, no comment or CDATA section can be
expected, in addition characters used for elements are part of a limited set of allowed
characters; Within a content on the other side the vast majority of markups and productions
can be expected, including other elements. This difference will make the parser need more
effort when parsing a chunk that is part of the content than a chunk part of an element,
even if they are of the same physical size.
At byte level, however, the size is what matters the most. Because the file is read byte by
byte from memory (or group of bytes if a buffer is used), the reading effort will always be
directly proportional to the size of the chunk, the logical structure of the document being
meaningless at this level.
The markup-aware chunking will try to determine the right balance of processing to perform
the chunking, taking into account the logical structure of the chunks, but striving to have
them of equal size to obtain balanced repartition of the parsing effort.
Because XML documents store data, they are usually made of a sequence of elements
that, such as database records, contains similar content. Considering this fact as a
“common case” (see § 2.3.5), the proposed PXML parser bases its chunking algorithm on
the following assumption:
For most XML documents, there is a depth at which elements start repeating in
similar shapes, hence in similar sizes.
Once the parser identifies that depth, it performs the chunking of the XML document solely
based on its logical structure. It counts on the natural, ‘common case’ fact that the structure
of the XML document will be a repetition of elements of similar size, and so obtain chunks
that are balanced both in size and markups.
3.2.1 Scanner types and chunk allocation
The parser has to process the XML document somewhat in order to identify the right
depth and define the chunking location. If, in addition, the parser will again need to
parse the produced chunks, there will be a double processing. However, the two

processing will have different goals, hence different algorithm of parsing and thus
different speed.
The proposed PXML parser will use two scanners, each dedicated to a particular
processing.
The chunking scanner will scrutinize the file in order to identify chunks locations. The
element is the basis on which the scanner divides chunks, to minimize the need of any
communication between chunks. For a given depth, the chunking scanner mission will
be to identify XML elements at that depth.
The parsing scanner will receives chunks information (start-tag location) from the
chunking scanner and properly parses the chunks, following the full set of XML rules.
Multiple parsing scanners can parse chunks in parallel.
The PXML’s algorithm success relies on the fact that the chunking scanner is faster than
the parsing scanners. The chunking scanner has a much less set of XML rules to
comply with, needs not to parse internal content of elements and, most importantly,
does not trigger any external event. Because the chunking scanner completes its job
earlier, the concurrent parsing of chunks compensates the double scanning overhead.
prolog scanner
<!DOCTYPE MyTourAgency…
<?xml vers…    …ng="UTF‐8"?>
<rating stars=“2>
<pool>true</pool>
<room_service>true</ro….
</rating>
<rating stars=“3>
…
</rating>
<hotel name="Rila">500…
<flat>200</flat>
<lowSeasonRent>
<nbrDay>6</nbrDay>
<banner>The whole …
</banner>
</lowSeasonRent>
</resort>
</country>
<country name=“Andorra">
……
</country>
<MyTourAgency >                                            ……..                               </MyTourAgency.>
chunking scanner
XML
Document
parsing scanner
thread #1
parsing scanner
thread #2
parsing scanner
thread #3
parsing scanner
thread #4
Figure 3-1 Chunks allocation to different scanners
The figure 3-1 shows a possible allocation of chunks per scanner. Notice the presence
of another scanner, the prolog scanner, to which is allocated to the prolog. Because
the prolog has specific rules that differ from the remainder of the file, it will require a
different algorithm of parsing, thus a different type of scanner.

3.2.2 Parsing properties and parsing modes
Three PXML parser properties help the user to control the parsing; they are the pool
configuration, the chunking depth and the siblings per chunks.
The chunking depth property represents the depth of the element from which the
parser will start cutting the XML document in chunks. Once the parser identifies the first
element at the chunking depth, the chunking scanner starts cutting all following-siblings
elements (see § 2.1.3) until it reaches the closing tag of the root element.
PrologScanner
<!DOCTYPE MyTourAgency…
<?xml vers…    …ng="UTF‐8"?>
<rating stars=“2>
<pool>true</pool>
<room_service>true</ro….
</rating>
<rating stars=“3>
…
</rating>
<hotel name="Rila">500…
<flat>200</flat>
<lowSeasonRent>
<nbrDay>6</nbrDay>
<banner>The whole …
</banner>
</lowSeasonRent>
</resort>
</country>
<country name=“Andorra">
……
</country>
<MyTourAgency >                                            ……..                               </MyTourAgency.>
ParsingScanner
XML
Document
Figure 3-2 Chunks allocation in the single-threaded parsing mode
The siblings per chunk property represent the number of sibling elements to be parts
of each chunk. Because the number of elements will be typically much higher than the
number of threads, the siblings per chunks indirectly allows a fine-grained control on the
number of chunks.
The pool configuration property determine the parsing mode of the parser (see §
1.1.2). For a given value, the pool configuration property set the parser in the
corresponding parsing mode:
- single-threaded parsing mode (pool configuration equals -1)
- multi-threaded automatic parsing mode (pool configuration equals 0)
- multi-threaded manual parsing mode (pool configuration > 0)
In the single-threaded mode, the parser does not need multi-threading and do not use
the chunking scanner; it uses the parsing scanner with only one thread for parsing the
whole document, including the root element as the only chunk as shown in the above
figure 3-2. This parsing mode is typically required for the case of small files.

In the multi-threaded mode, the pool configuration controls the thread concurrency of
the parser by directly setting the number of threads the parser will use. The multi-
threaded mode can be manual (number of threads controlled by the user) or automatic
(number of threads controlled by the PXML library).
3.2.3 From bytes to SAX events
From bytes to characters and from characters to SAX events, the parsing is
accomplished through the composition of two pushdown automata or PDA (see § 2.6):
the transcoder and the scanner.
The Transcoder consumes bytes and recognizes valid XML characters (or more
precisely, code points). The transition function is a subset of the encoding specification
that the parser is using, such as UTF-8 or UTF-16 (because not all UTF-8 or UTF-16
characters are a valid XML character).
The Scanner is a PDA that operates at a higher level than the transcoder; it consumes
characters produced by the transcoder and recognizes valid markups or productions.
Here the transition function is the XML specification itself.
The Reader is a filter that selects only markups and productions that comply with SAX
specification in order to create and trigger SAX events. For instance, the comment
production triggers an event only if a LexicalHandler is available (see § 2.2.3). The
reader has access to the SAX handler classes in order to trigger their callback method
as SAX events.
bytes
0101100
characters
Abc45>g&
markups
<element/>
SAX events
startElement()
Transcoder Scanner Reader
PDA
UTF‐8/UTF‐16 encoding
and XML specifications
XML specifications
rules
SAX specifications
rules
PDA Filter
Figure 3-3 Main parser components and their responsibility

3.3 Class relationship and interaction
The UML class diagram in Figure 3-4 (page 40) shows the most important PXML classes,
and the relationship between them; the Figure 3-5 (page 42) represents the PXML classes’
interaction during the multi-threaded parsing.
3.3.1 SAX classes
The PXML parser provides XMLReaderImpl as the implementation of XMLReader
interface and AttributesImpl as the implementation of Attributes interface; it leaves the
implementation of the SAX handlers to the library users. PXmlCountHandler is an
example SAX handler implementation of the ContentHandler interface; it is not a library
class but is part of the PXmlCount test program.
Taken in isolation, the SAX classes constitute an observer pattern (see observer pattern
in § 2.5.2), with XMLReader (Subject), XMLReaderImpl (ConcreteSubjet),
ContentHandler (Observer) and PXmlCountHandler (ConcreteObserver) as
participants.
The XMLReaderImpl is a central class of the parser concept. It contains the parse and
the concurrentParse methods that define the chunking algorithm and the concurrent
parsing algorithm respectively. Its C++ implementation is discussed in § 3.4.2.
3.3.2 PXML classes
The abstract classes XmlTranscoder and XmlScanner are generalizations of the
central PXML concepts of transcoder and scanner (see § 3.2.3).
The XmlTranscoder defines an interface for converting from bytes to characters. Its
subclasses TranscoderUtf8 and TranscoderUtf16 implement the convert_bytes
method.
The XmlScanner defines a family of algorithms for converting from characters to
markups. Its subclasses PrologScanner, ChunkingScanner and ParsingScanner,
provide their implementation of the consume_char method.
The relationship between these classes is visible in the UML diagram as a cascade of
two strategy patterns (see strategy pattern in § 2.5.1).
The strategy pattern for conversion from bytes have XMLReaderImpl (Context),
XmlTranscoder (Strategy), TranscoderUtf8 (ConcreteStrategy) and TranscoderUtf16
(ConcreteStrategy) as participants.

The strategy pattern for scanning have XmlTranscoder (Context), XmlScanner
(Strategy), PrologScanner (ConcreteStrategy), ChunkingScanner (ConcreteStrategy)
and ParsingScanner (ConcreteStrategy) as participants.
XMLReader
‐ ContentHandler* handler
+ setContentHandler(handler)
+ getContentHandler()
+ parse()
PXmlCountHandler
+ startElement(…)
XmlScanner
‐ XMLReader* reader
‐ DocumentState docState
‐ InternalSTate intState
+ consume_char()
PrologScanner
+consume_char()
ParsingScanner
‐AttributesImpl* attr
+consume_char()ChunkingScanner
+ consume_char()
XmlTranscoder
‐XmlScanner* scanner
+setXmlScanner(scanner)
+getXmlScanner(scanner)
+convert_bytes()
TranscoderUtf8
+convert_bytes()
TranscoderUtf16
+convert_bytes()
XMLReaderImpl
+ parse()
‐ concurentParser()
ContentHandler
‐ChunkContext* context
+ startElement(… , Attributes attr)
AttributesImpl
+convert_bytes()
+addAttribute()
Attributes
+getValue()
+getQName()
ThreadPool
+ threadFunction()
+ submitTask()
ChunkContext
+ thread_index
+ chunk_index
+ get_current_depth()
ThreadSafeQueue
+ push()
+ wait_and_pop()ChunkTask
ThreadJoiner
T
1
1
1
1…*
1
1
1
1
1
*
Figure 3-4 PXML class diagram

3.3.3 Concurrency classes
The ThreadPool responsibility is to create and maintain an appropriate number of
threads to be each used for parsing the chunks of the XML document; it uses for this the
classes ThreadSafeQueue, ChunkTask and ThreadJoiner. The UML class diagram in
Figure 3-4 shows that a composition aggregation links them, where the ThreadPool
owns the other classes.
The ThreadSafeQueue is the internal container used to store chunking data for each
chunk that the parser creates. Chunking data means all information needed to perform
the parsing of that chunk independently; ChunkTask is a class that gathers this
information. So the ThreadSafeQueue is a parameterized thread-safe container of
ChunkTask.
The ThreadJoiner class is used to ensure cooperation between the thread containing
the chunking scanner and the parsing scanner threads.
The ChunkContext is not participating in the thread pool organization; it is used to
provide chunk information to the library user. It is created by the ParsingScanner class,
and they are both attached to a particular chunk. It is a member of the ContentHandler
so that it can be available the handler callback function, which is accessible to the library
users.
3.3.4 Class interaction and scanning loops
The Figure 3-5 represents interactions between PXML classes during the parsing in
multi-threaded mode with two threads.
The sequence diagram starts with an instance of ContentHandler and an instance of
XMLReaderImpl, the reader. Upon the call of its parse method, the reader creates
XmlTranscoder and PrologScanner instances, and then sets the scanner to receive
characters from the transcoder.
The tandem transcoder-scanner performs a prolog loop, which consists of the
transcoder calling convert_bytes and the chunking scanner calling consume_char in
loop until the scanner recognize the root element and notify the reader by setting
isRootElementFound to true.
Because the parsing mode is multi-threaded (pool_config = 2), the reader create the
ThreadPool, which creates two threads and waits for the reader to submit ChunkTask
instances for parallel parsing.
The reader continues with a ChunkingScanner, which replaces the prolog scanner in
the transcoder, forming what is now a chunking loop. Anytime the scanner recognizes
a chunk position, according to the parsing properties, it notifies the reader by setting
isChunkPosition to true. The reader then collects the chunking information as a
ChunkTask and submits it to the ThreadPool.

sd parse
handler
:ContentHandler
reader
:XmlReader
parse()
:XmlTranscoder
prolog
:PrologScanner
:ThreadPool
chunking
:ChunkingScanner
:concurentParse
:XmlTranscoder
scanner
:ElementScanner
alt thread #1
:concurentParse
:XmlTranscoder
scanner
:ElementScanner
alt thread #2
new()
new()
new()
new()
setXmlScanner(scanner)
setXmlScanner(scanner)
*convert_bytes()
*consume_char()
*convert_bytes()
*consume_char()
new()
new()
new()
new()
new()
new()
setXmlScanner(prolog)
*convert_bytes()
*consume_char()
setXmlScanner(chunking)
*convert_bytes()
*consume_char()
[isChunkPosition=true]:
*convert_bytes()
*consume_char()
[isChunkPosition=true]:
submitTask()
submitTask()
startDocument()
endDocument()
endElement()
startElement()
[isRootElementFound=true]:
[isEndOfChunk=true]:
[isEndOfChunk=true]:
[pool_config>=0]:
[pool_config>=0]:
Parsing properties:
‐ pool_config=2
‐ chunking_depth=1
‐ siblings_per_chunk=1
Remark: convert_bytes() and
consume_char() operation are
marked with * to denote they
are iterative operations within
a loop but the loop fragment
is not represented for the
sake of diagram clarity
Figure 3-5 PXML sequence diagram

The reader continues with the scanning loop until it reaches the end of the document
element and waits for the ThreadPool to finish its task before acknowledging the end of
document to the user with the SAX endDocument callback.
Upon a thread within the ThreadPool receives a ChunkTask, it create XmlTranscoder
and a ParsingScanner instances to form a transcoder-scanner tandem, performing a
parsing loop. The loop consists of the transcoder calling convert_bytes and the parsing
scanner calling consume_char in loop until the scanner reaches the end of the chunk
and notifies the thread by setting isEndOfChunk to true.
The PXML considers the prolog loop, the chunking loop and the parsing loop as
specialization of the general concept of scanning loop, having all a loop breakers
(isRootElementFound, isChunkPosition and isEndOfChunk).
The sequence diagram illustrated the definition of the PXML algorithm:
The PXML algorithm consists of a prolog loop, then a chunking loop and one or
many parsing loops running in parallel.
3.4 Implementation of SAX classes
3.4.1 Characters, String and C++ binding
Because the PXML implementation of the SAX parser is not in Java, the library has to
provide a binding, the C++ equivalent of objects used in Java.
The main difference will be on the string type. The C++ language provides the Java
equivalent of String (that is std::string), but in order to optimize the performance of the
parser and, most importantly, to provide the right representation of the Unicode
encoding (not handily provided by C++), the library provides its string type.
PXML defines the following type and classes:
- XmlCh: an XML character type, or more precisely an XML code point [42]
- XmlChar: a class for XML character or code point manipulation
- XmlBuffer: a class that represents a dynamic string
UTF8 and UTF16 allow representation of all their symbols using 4-bytes memory
storage, or a pair of 2-bytes memory storage [42].

A multi-threaded XML parser in C++ (MSc project dissertation)

A multi-threaded XML parser in C++ (MSc project dissertation)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A multi-threaded XML parser in C++ (MSc project dissertation)

Similar to A multi-threaded XML parser in C++ (MSc project dissertation) (20)

Recently uploaded

Recently uploaded (20)

A multi-threaded XML parser in C++ (MSc project dissertation)