2. TABLE OF CONTENTS
Smackdown.............................................................................................................................................3
Introduction ............................................................................................................................................5
Search features 1 - 1 comparison ...........................................................................................................6
Feature 1: Getting Started ......................................................................................................................7
Feature 2: Operations and Management ...............................................................................................9
2.1 Backup...........................................................................................................................................9
2.2 System upgrades and patch management .................................................................................10
2.3 Re-indexing .................................................................................................................................11
Feature 3: Monitoring...........................................................................................................................13
Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export ...........................................14
4.1 Schema management .................................................................................................................14
4.2 Dynamic fields.............................................................................................................................14
4.3 Data types ...................................................................................................................................15
4.4 Data import & export..................................................................................................................16
Feature 5: Search and Indexing features..............................................................................................18
5.1 Analyzers, Tokenizers and Token filters......................................................................................18
5.2 Faceting.......................................................................................................................................20
5.3 Auto Suggestion..........................................................................................................................21
5.4 Highlighting.................................................................................................................................23
Feature 6: Multilingual support............................................................................................................24
Feature 7: Protocol & API Support........................................................................................................26
7.1 Request and Response formats ..................................................................................................26
7.2 External Integrations...................................................................................................................26
7.3 Protocols Support .......................................................................................................................26
Feature 8: High Availability...................................................................................................................27
8.1 Replication ..................................................................................................................................27
8.2 Failover........................................................................................................................................29
Feature 9: Scaling..................................................................................................................................32
Feature 10: Customization....................................................................................................................34
Feature 11: More..................................................................................................................................35
11.1 Client libraries...........................................................................................................................35
Feature 12: Cost....................................................................................................................................36
Conclusion.............................................................................................................................................37
3. Amazon CloudSearch Comparison Report
PAGE 3 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Smackdown
Feature Apache Solr Elasticsearch Amazon
CloudSearch
Admin Operations
Backup
Replication/Custom
handler/Custom scripts
Snapshot API/Custom
scripts
Fully-managed
Patch Management
Manual/Automated via
custom scripts
Manual/Automated via
custom scripts
Fully-managed
Re-indexing
Manual Manual
Fully-managed
Manual option
available from
management console
Monitoring
If hosted on EC2,
Amazon CloudWatch
SaaS Monitoring tools
like NewRelic,
Stackdriver, Datadog
If hosted on EC2,
Amazon CloudWatch
SaaS Monitoring tools
like NewRelic,
Stackdriver, Datadog
CloudSearch default
metrics
Maintenance
External managed
service
External managed
service
Fully-managed
API
Client Library
Java, PHP, Ruby, Rails,
AJAX, Perl, Scala,
Python, .NET,
JavaScript
Java, Groovy,
JavaScript, .NET, PHP,
Perl, Python, and Ruby
Amazon SDK
HTTP RESTful API YES YES YES
Request Format XML, JSON, CSV XML, JSON XML, JSON
Response Format XML, JSON, CSV XML, JSON XML, JSON
Third party
Integrations
Available for
Commercial and Open
source
Available for
Commercial and Open
source
Amazon Web Services
Integrations available
Search Functions
Schema
Schema and Schema-
less
Schema and Schema-
less
Schema
Dynamic fields
support
Yes Yes Yes
Synonyms Yes Yes Yes
Multiple indexes Yes Yes No
Faceting Yes Yes Yes
Rich documents
support
Yes Yes No
Auto Suggest Yes Yes Yes
Highlighting Yes Yes Yes
Query parser
Standard, DisMax,
Extended DisMax,
Other parsers
Standard,
query_string, DisMax,
match, multi_match
Simple, structured,
Lucene, or DisMax
Geosearch Yes Yes Yes
Analyzers, Default/Custom Default/Custom Default
4. Amazon CloudSearch Comparison Report
PAGE 4 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Tokenizers and
Token filters
Fuzzy Logic Yes Yes Yes
Did you mean Default/Custom Default/Custom No
Stopwords Yes Yes Yes
Customization Yes Yes No
Advanced
Cluster management ZooKeeper in-built Fully-managed
Scaling
Vertical scaling/
Horizontal scaling
Vertical scaling/
Horizontal scaling
Fully-managed
horizontal scaling
Replication Yes Yes Yes
Sharding Yes Yes Yes
Failover
Yes, if set up in Cluster
Replica mode
Yes, if set up in Cluster
Replica mode
Fully-managed
Fault tolerant
Yes, if set up in Cluster
mode
Yes, if set up in Cluster
mode
Fully-managed
Import and Export
Data import
Default import
handlers, custom
import handlers
Rivers modules,
Logstash input plugins,
custom programs
Batch upload
Data export
Default export
handlers, custom
export handlers
Snapshot API Custom program
Others
Web Interface Solr Admin Sense
AWS Management
Console
5. Amazon CloudSearch Comparison Report
PAGE 5 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Introduction
In today's world of vast and available information, a good search experience is central to a good user
experience. Hence, delivering effective search tools has become the key goal of all software
products, market places, e-commerce websites, and content management systems. Developers
looking to deliver a premium search experience to their users should be aware of some broad
trends:
1) Open source and platform-based search engines are replacing proprietary search engines because
of better licensing models and community support.
2) The cloud delivery model is succeeding over the on-premise delivery model because of scalability,
high availability and operating expense.
In light of the above trends, the choice of leading candidates for search technology boils down to
three: Apache Solr, Elasticsearch, and Amazon CloudSearch. At 8KMiles, our clients often ask us how
these three choices compare relative to each other. This report aims to make it easy for developers
to pick the right technology for their application by presenting a comprehensive framework for
evaluation of the three options. We have also applied our framework to top feature sets that are
critical to any search workload. We then broke them down further into granular features and
compared each of the three search engines.
In this report, we summarize our conclusions and present them in a smack down style summary
card. We encourage our readers to run a more in-depth evaluation for their specific use cases.
6. Amazon CloudSearch Comparison Report
PAGE 6 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Search features 1 - 1 comparison
This section discusses about the search features in detail and how they are present in Apache Solr,
Elasticsearch, and Amazon CloudSearch.
The table below illustrates the most influential features involved in the assessment of our search
engines. These features are identified and grouped based on the various operations of a search
application.
•Server setup, search engine installation and configurationGetting started
•Backup, patches, re-indexing, monitoringOperations
•Schema management, field types, dynamic fields, data
import/export, analyzers, did you mean, facets, auto
complete, spatial
Indexing , Search and Query
•Replica, Failover, Self-healing clustersHigh Availability
•Read scaling, write scaling, partitioning options, replication
options
Scaling
•Request and response formats and support, protocols
supported, external integrations
Protocol & API Support
•Data field types, functionsCustomization
•Supported programming languages, administration
interface
Others
•Infrastructure cost, support cost, on-going management
cost, licensing cost, talent cost
Cost
7. Amazon CloudSearch Comparison Report
PAGE 7 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 1: Getting Started
‘Getting Started’ is the first step by an engineer to understand the basics of the major features of a
product. In this section, we will see how the search engines discussed above facilitate ‘Getting
Started’.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch require the end users to spend quality time in understanding and
setting up the respective search engines. The “Getting Started” manuals of Apache Solr and
Elasticsearch assume the end user to have minimal knowledge of search engines, their related
functions and architecture.
The installation processes for Apache Solr and Elasticsearch include tasks such as:
• Server setup
• Search engine download
• Dependent software installations
• Setup of environmental requirements
• Understanding of basic server commands
• Administrative access
Apache Solr and Elasticsearch are shipped with test examples which allow users to do “warm up”
search and indexing operations. While the default test schema in Apache Solr is sufficient for the
user to get started, the Elasticsearch’s schema-less design allows the user to send document
request without any schema.
Amazon CloudSearch
If you already have an Amazon Web Services (AWS) account set up, you can create a CloudSearch
domain in a few clicks using the AWS Management console. The AWS CloudSearch Management
console guides administrators through a step-by-step process, requesting the user input:
• Instance type
• High availability options
• Replication options
• Schema definitions
• Access policies
Among these options, it is important to note that Amazon CloudSearch does not mandatorily
prompt for all information. The CloudSearch domain name and engine type are adequate to
create a CloudSearch instance.
The other configurations such as schema, instance type, access policies, and high availability
options can be modified at a later time based on the application requirements.
The users are abstracted from hardware provisioning, software installation, configuration, cluster
setup and other administration activities.
Users receive two administrative regional endpoints: a search endpoint and document endpoint.
Both endpoints can be accessed using RESTful API or AWS Software Development kit (SDK) with
Identity and Access Management (IAM) credentials.
8. Amazon CloudSearch Comparison Report
PAGE 8 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Another important note is that CloudSearch default access policies for the document service and
search service endpoints are configured to block all IP addresses. Developers should configure the
authorized IP addresses to access the CloudSearch’s endpoints.
CloudSearch also provides a sample dataset, the IMDB movies, which can be used to test drive the
CloudSearch service. The CloudSearch developer documentation walks through the steps to
launch a test domain using the sample IMDB dataset.
Conclusion
Apache Solr and Elasticsearch expect users to have basic practical knowledge of the search engine
and also complete a few significant tasks to accomplish the first step ‘Getting Started’.
In Amazon CloudSearch, the ‘Getting Started’ activities are easier and end users can have the
CloudSearch instance up and running with few clicks in a few minutes.
9. Amazon CloudSearch Comparison Report
PAGE 9 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 2: Operations and Management
In this section, we’ll discuss some important administrative operations such as
• Index backup
• Patch management
• Re-indexing and recovery
2.1 Backup
Data backup is a routine operation, carried out within a defined period of time. Data backup is an
essential task for recovering data responsively from failures such as hardware crash, data corruption
or related events.
Apache Solr
Apache Solr provides a feature called ‘ReplicationHandler’. The main objective of
ReplicationHandler is to replicate index data to slave servers, but it can also be used as a backup
copy server. A replication slave node can be configured with Solr Master, which can be solely
identified as a backup server, with no other operations taking place on the slave node.
Solr‘s implicit support for replication allows ReplicationHandler to be used as an API. The API has
optional parameters like location, name of snapshot, and number of backups. The backup API is a
bound-to-store snapshot on a local disk, but for any other storage options the backup API requires
customization.
If you are required to store the backups in a different store location like Amazon’s Simple Storage
Service (S3), a local storage server, or in a remote data center, ReplicationHandler has to be
customized. Solr core libraries are available in open source that allows for any customization.
Elasticsearch
Elasticsearch provides an advanced option called ‘Snapshot API’ for backing up the entire cluster.
The API will back up the current cluster state and related data and save it to a shared repository.
The first or initial backup process will be a complete copy of the data. The subsequent backup
processes will snapshot the delta between the backup of fresh data with previous snapshots.
Elasticsearch prompts end users to create a repository type, which can be chosen from a shared
file system:
• Amazon S3
• Hadoop Distributed File System (HDFS)
• Azure Cloud
This integration gives a greater flexibility for developers to manage their backups.
Backup Process
The backup options present in Apache Solr and Elasticsearch can be executed manually or can be
automated. To automate the entire backup process, one has to write custom scripts that calls the
relevant API or handler. Most engineering companies follow this model of writing custom scripts
for backup automation.
10. Amazon CloudSearch Comparison Report
PAGE 10 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
The backup also involves maintenance of the latest snapshots and archives. The management
tasks involve key tasks like snapshot retrieval, archival, and expiration.
In an alternate approach, if the Solr or Elasticsearch cluster is set up in a cluster replication mode,
any one of the slave nodes is identified as backup server. The automation of the slave node
backup server needs a script written by the developer.
Amazon CloudSearch
Amazon CloudSearch inherently takes cares of the data that is stored and indexed, leaving a
lighter load for engineering and operations teams. Amazon CloudSearch self-manages all the data
backup and its management. The backups are internally maintained behind the scenes. In the
event of any hardware failure or other problem, Amazon CloudSearch restores the backup
automatically, and this process is not revealed to end users.
Conclusion
The default option in Apache Solr is only to back up to a ‘local disk’ store; it does not offer any
other storage options as Elasticsearch does. However, the engineers can write their own handlers
to manage the backup process.
Elasticsearch is packaged with multiple storage options plugins which gives added advantage for
engineers.
Amazon CloudSearch relieves the users of the intricacies of the backup and its management
process. The IT operations or managed service team have a lesser role in the backup process as
the entire operations are managed behind the scenes by CloudSearch.
2.2 System upgrades and patch management
Patch management and system upgrades like OS patches and fixes are inevitable in operations and
administration. For any system, there is always a version upgrade, or maintenance on the OS, and
hardware or software changes.
Rolling Restarts
Apache Solr and Elasticsearch both recommend using ‘Rolling Restarts’ for patch management,
operating system upgrades and other fixes. Rolling Restarts involve stopping and starting each
cluster node in the cluster sequentially. This allows the cluster to continue its operations while
each node is updated with the latest code, fixes, or patches while continuing to serve search
requests. Rolling Restarts is adopted when high availability is mandatory and downtime is not
allowable.
Sometimes, the Rolling Restarts require some intelligent decision making based on cluster
topology. If a cluster consists of shards and replicas, the order of restarting each node has to be
done decisively.
11. Amazon CloudSearch Comparison Report
PAGE 11 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache’s ZooKeeper service acts as a stand-alone application and does not get upgraded
automatically when Apache Solr is upgraded, but it should be done manually at the same time.
Elasticsearch
Elasticsearch recommends disabling the ‘shard allocation’ configuration during node restart. This
informs Elasticsearch to stop re-balancing missing shards because the cluster will immediately
start working on node loss.
Amazon CloudSearch
Amazon CloudSearch internally manages all patches and upgrades related to its operating system.
The managed search service offering from Amazon CloudSearch monitors for when new features
are rolled out; upgrades are self-managed and immediately available to all customers without any
action on their part.
Conclusion
The patch management in Apache Solr and Elasticsearch has to be carried out manually using the
Rolling Restarts feature. Customers automate this process by developing custom scripts to do
system upgrades and patch management.
Patch management in Amazon CloudSearch is transparent to the customers. The upgrades and
patches done on Amazon CloudSearch are regularly updated in the ‘What’s New’ section of the
CloudSearch documentation.
2.3 Re-indexing
Any business application changes over its lifetime, as the business running it changes. The business
change has a direct effect on the data structure of the system’s persistent information store. The
search engine, which is seen as a secondary or alternate store, will eventually have to change its data
structure when required. Any changes to the search engine data structure will require a re-indexing
of the data.
Example: A product company started collecting ‘feedback’ from their customer for a given product.
The text string from the new field ‘feedback’ needs to be added into the search schema, and may
require re-indexing.
If the search data is not re-indexed after a structural change, the data that has already been indexed
could become inaccurate and the search results may behave differently than expected.
Re-indexing becomes a necessary process over a period of time as the application grows. It is also
identified as a common and mandatory admin operation executed periodically based on application
requirements.
12. Amazon CloudSearch Comparison Report
PAGE 12 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache Solr recommends re-indexing if there is a change in your schema definitions. The options
below are widely used by the Apache Solr user community.
• Create a fresh index with new settings. Copy all of the documents from the old index to
the new one.
• Configure Data import handler with ‘SolrEntityProcessor’. The SolrEntityProcessor imports
data from Solr instances or cores for a given search query. The SolrEntityProcessor has a
limitation where it can only copy fields that are stored in the source index.
• Configure Data import handler with the source or origination data source. Push the data
freshly to the new index
Elasticsearch
Elasticsearch proposes several approaches for data re-indexing. The following approaches are
usually combined:
Use Elasticsearch’s Scan and Scroll and Bulk APIs to fetch and push data into the new
index.
Update or create an index alias with the old index name and delete the old index.
Use open source Elasticsearch plugins that can extract all data from the cluster and re-
index the data. Most of these plugins internally use the Scan and Scroll and Bulk API (as
mentioned above) which reduces development time.
Amazon CloudSearch
Amazon CloudSearch recommends data rebuilding when index fields are added or modified.
Amazon CloudSearch expects to issue an indexing request after a configuration change. Whenever
there is a configuration change, the CloudSearch domain status changes to ‘NEEDS INDEXING’.
During the index rebuilding, the domain's status changes to ‘PROCESSING’, and upon completion
the status is changed to ‘ACTIVE’.
Amazon CloudSearch can continue to serve search requests during the indexing process, but the
configuration changes are not reflected in the search results. The re-indexing process can take
some time for the changes to take effect. It is directly proportional to the amount of data volume
in your index.
Amazon CloudSearch also allows document uploads while indexing is in progress, but the updates
can become slower, if there are is large volume of document updates. During such a scenario, the
uploads or updates can be throttled or paused until the Amazon CloudSearch domain returns to
an ‘ACTIVE’ state.
Customers can initiate re-indexing by issuing the index-documents command using RESTful API,
AWS command line interface (CLI), or AWS SDK. They can also initiate re-indexing from the
CloudSearch management console.
Conclusion
Re-indexing in Apache Solr and Elasticsearch is mostly a manual process because it requires a
decision that factors data size, current request size, and offline hours.
Amazon CloudSearch manages the re-indexing process inherently and leaves much less to
administrators. The re-indexing time period is abstracted and not disclosed to administrators but
Amazon CloudSearch runs the re-indexing process based on the best practices mentioned above.
13. Amazon CloudSearch Comparison Report
PAGE 13 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 3: Monitoring
Monitoring server health is an essential daily task for operations and administration. In this section,
we will describe the built-in monitoring capabilities for all three search engines.
Apache Solr
Apache Solr has a built-in web console for monitoring indexes, performance metrics, information
about index distribution and replication, and information on all threads running in the Java Virtual
Machine (JVM) at the time.
For more detailed monitoring, Java Management Extensions (JMX) can be configured with Solr
that share runtime statistics as MBeans. The Apache Solr JVM container has built-in
instrumentation that enables monitoring using JMX.
Elasticsearch
Elasticsearch has a management and monitoring plugin called ‘Marvel’. Marvel has an interactive
console called ‘Sense’ that helps users to interact easily with Elasticsearch nodes. Elasticsearch has
in-built diversified APIs that emit heap usage, garbage collection stats, file descriptions, and more.
Marvel is strongly integrated with these APIs, and it periodically executes polling, collects statistics
and stores the data back in Elasticsearch. Marvel’s interactive graph report dashboard allows
administrators to query and aggregate historical stats data.
Amazon CloudSearch
Amazon CloudSearch recently introduced Amazon CloudWatch integration. The Amazon
CloudSearch metrics can be used to make scaling decisions, troubleshoot issues, and manage
clusters.
Amazon CloudSearch publishes four metrics into Amazon CloudWatch: SuccessfulRequests,
Searchable Documents, Index Utilization, and Partition Count.
The CloudWatch metrics can be configured to set alarms, which can notify administrators through
Amazon Simple Notification Service.
Conclusion
Apache Solr and Elasticsearch have integrations with in-built and external plugins. They can also
support SaaS based monitoring plugins or custom plugins developed by the customers.
CloudSearch’s integration with CloudWatch shares some good metrics and it is expected to offer
newer ones in the future.
14. Amazon CloudSearch Comparison Report
PAGE 14 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export
4.1 Schema management
Schema: A schema is a definition of fields and field types used by the search system to organize data
within the document files it indexes.
Schema definition is the foremost task in the search data structure design. It is important that the
schema definition caters to all business requirements and is designed to suit the application.
Apache Solr and Elasticsearch
Both Elasticsearch and Apache Solr can run the search application in ‘Schema-less’ and ‘Schema’
mode. Schema mode is suitable for application development or any production environments.
Schema-less is a very good option for entrants to get started. After server setup, users can start
the application without a schema structure and create the field definitions on the search indexing.
However, to have a production-grade application running, a proper schema structure becomes
mandatory and the schema definition is a necessity.
Amazon CloudSearch
Amazon CloudSearch also allows users to set up search domains without any index fields. The
index fields can be added anytime, but before any valid document indexing or any search request.
In addition, the CloudSearch management console has integration with Amazon Web Services like
S3, DynamoDB, or can access a local machine from where the schema can be imported directly to
CloudSearch domain. After the schema import, CloudSearch allows the user to edit the fields or
add new fields. This is a convenient feature for a pre-built schema that is to be migrated to a
CloudSearch domain.
Conclusion
Apache Solr and Elasticsearch can be started without any schema but they cannot be put into
production use. Amazon CloudSearch allows creating domains without any index fields, but to
have any index and search requests served the schema should be created.
The general best practice in schema management is to rehearse and design the schema suiting
application requirements before finalizing the search structure. The underlying schema concept of
all three search engines is consistent with this practice.
4.2 Dynamic fields
Dynamic fields are like regular field definitions which support wildcard matching. They allow the
indexing of documents without knowing the type of fields they contain. A dynamic field is defined
using a wildcard pattern (*) for first, last, or only character. All undefined fields go through dynamic
field rules which validate the pattern match configured with the dynamic field's indexing options.
15. Amazon CloudSearch Comparison Report
PAGE 15 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch allow end users to set up dynamic fields and rules using RESTful API
and schema configuration.
Amazon CloudSearch
In Amazon CloudSearch, dynamic fields can be configured using indexing options in the
CloudSearch management console or using CloudSearch, RESTful API, or AWS SDK API.
Conclusion
If you are unsure about the schema structure or exact field names, dynamic fields come in handy.
Amazon CloudSearch, Apache Solr, and Elasticsearch all allow the flexibility to configure dynamic
fields. This helps the application development team to describe any omitted field definitions in the
schema document.
4.3 Data types
There are a variety of data types supported by these search engines. The table below illustrates the
data field types supported by each search engine.
Data type Solr Elasticsearch CloudSearch
String / Text Yes Yes Yes
Number types
integer, double,
float, long
byte, short, integer,
long, float, double
integer, double
Date types Yes Yes Yes
Enum fields Yes Yes No
Currency Yes No No
Geo location / Latitude –
Longitude
Yes Yes Yes
Boolean Yes Yes No
Array types Yes Yes Yes
Conclusion
The most important data types like string, date, and number types are supported by all three
search engines. Geo location data type, which is now regularly used by modern applications, is also
supported by all search engines.
Engineers and developers may use an alternate data type if a particular data type is not supported
for their chosen search engine. Example, ‘currency’ data type supported in Solr is not available in
Elasticsearch and CloudSearch. During such cases, engineers use number type as an alternative
16. Amazon CloudSearch Comparison Report
PAGE 16 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
data type for ‘Currency’.
4.4 Data import & export
The most important task in a search application development is data migration from origination
source to the search engine. The origination data can be of a data source like a database, or a file
system or a persistent store. To commence a search data set, it is required to migrate or import the
full data set from its origin to the search engine.
Likewise, extracting data from a search engine and exporting it to a different destination source is
also a crucial task but executed occasionally.
Apache Solr
Apache Solr has in-built handler called Data import handler (DIH). The DIH provides a tool for
migrating and/or importing data from the origin store. The DIH can index data from data sources
such as
• Relational Database Management System (RDBMS)
• Email
• HTTP URL end point
• Feeds like RSS and ATOM
• Structured XML files
The DIH has more advanced features like Apache Tika integration, delta import, and transformers
to quickly migrate the data.
The Apache Solr export handler can export the query result data to a Javascript Object Notification
(JSON) or comma-separated values (CSV) format. The export query expects to sort and filter query
parameters and returns only the stored fields. Users also have the option of developing a custom
export handler and incorporate it with Solr core libraries.
Elasticsearch
Elasticsearch ‘Rivers’ is an elegant pluggable service which runs inside the Elasticsearch cluster.
This service can be configured for pulling or pushing the data that is indexed into the cluster.
Some of the popular Elasticsearch Rivers modules are CouchDB, Dropbox, DynamoDB, FileSystem,
Java Database Connectivity (JDBC), Java Messaging Service (JMS), MongoDB, neo4j, Redis, Solr,
Twitter, and Wikipedia.
However, ‘Rivers’ will be deprecated in the newer release of Elasticsearch, which recommends
using official client libraries built for popular programming languages. Alternatively, the Logstash
input plugin is also one of the identified tools that can be used to ship data into Elasticsearch.
For data export, Elasticsearch snapshot can be used for any individual indices or an entire cluster
into a remote repository. This is discussed in detail in the section ‘Operations and Management -
Backup’.
Amazon CloudSearch
Amazon CloudSearch recommends sending the documents in batches to upload on CloudSearch
domain. A batch is a collection of add, update, and delete operations which should be described in
17. Amazon CloudSearch Comparison Report
PAGE 17 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
JSON or XML format.
Amazon CloudSearch limits a single batch upload to 5 MB per batch, but allows running parallel
upload batches to reduce the time frame for full data upload. The number of parallel batch
uploads is directly proportional to the CloudSearch instance types. Larger instance types have a
higher upload capacity, while smaller instance types have lower. During such scenarios, the batch
upload programs should intelligently threshold the uploads based on instance capacity.
Conclusion
Apache Solr has good handlers to export and import the data. In any case, if the options present
are not viable, Apache Solr allows one to develop a new custom handler or customize an existing
handler that can be used for data import and export.
Elasticsearch has integration with popular data sources in the form of ‘River’ modules or plugins.
However, the future versions of Elasticsearch strongly recommend using Logstash input plugins or
developing and contributing new Logstash input, as customization of a plugin is allowed in
Elasticsearch.
Amazon CloudSearch does not have elaborate options like other two search engines. However by
combining custom programs with bulk upload recommendations in Amazon CloudSearch,
customers can successfully migrate data into CloudSearch.
18. Amazon CloudSearch Comparison Report
PAGE 18 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 5: Search and Indexing features
In this section, we will evaluate ‘Search and Indexing’ features present in the search engines we are
evaluating. This is a very important feature set as they are widely used by search application
engineers.
5.1 Analyzers, Tokenizers and Token filters
Generally speaking, the search engine prepares text strings for indexing and searching using
analyzers, tokenizers, and filters. These tools are frequently used by libraries configured for indexing
and searching the data. Most of the time, the libraries are composed in a sequential series.
• During indexing and querying, analyzer assesses the field text and tokenizes each block
of text into individual terms. Each token is a sub-sequence of the characters in the text.
• The token filter filters each token in the stream sequentially and applies its filter
functionality.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch have multifaceted in-built libraries for analyzers, tokenizers and
token filters. These libraries are packaged with search engine installable that can be configured
during indexing and searching.
Although the analyzers can be configured for indexing and querying, the same series of libraries
doesn’t need to be used for both operations. The indexing and searching operations can be
configured to have different tokenizers and filters, as their goals can be different.
Search Engine Tokenizers Filters
Apache Solr
Standard, Classic, Keyword, Letter, Lower
Case, N-Gram, Edge N-Gram, ICU, Path
Hierarchy, Regular Expression Pattern,
UAX29 URL Email, White Space
ASCII Folding, Beider-Morse, Classic,
Common Grams, Collation Key, Daitch-
MokotoffSoundex, Double Metaphone,
Edge N-Gram, English Minimal Stem,
Hunspell Stem, Hyphenated Words, ICU
Folding, ICU Normalizer 2, ICU Transform,
Keep Words, KStem, Length, Lower Case,
Managed Stop, Managed Synonym, N-
Gram, Numeric Payload Token, Pattern
Replace, Phonetic, Porter Stem, Remove
Duplicates Token, Reversed Wildcard,
Shingle, Snowball Porter, Stemmer,
Standard, Stop, Suggest Stop, Synonym,
Token Offset Payload, Trim, Type As
Payload, Type Token, Word Delimiter
Elasticsearch
Standard, Edge NGram, Keyword, Letter,
Lowercase, NGram, Whitespace, Pattern,
UAX Email URL, Path Hierarchy, Classic,
Thai
Standard Token, ASCII Folding Token,
Length Token, Lowercase Token, Uppercase
Token, NGram Token, Edge NGram Token,
Porter Stem Token, Shingle Token, Stop
Token, Word Delimiter Token, Stemmer
Token, Stemmer Override Token, Keyword
Marker Token, Keyword Repeat Token,
KStem Token, Snowball Token, Phonetic
Token, Synonym Token, Compound Word
19. Amazon CloudSearch Comparison Report
PAGE 19 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Token, Reverse Token, Elision Token,
Truncate Token, Unique Token, Pattern
Capture Token, Pattern Replace Token,
Trim Token, Limit Token Count Token,
Hunspell Token, Common Grams Token,
Normalization Token, CJK Width Token, CJK
Bigram Token, Delimited Payload Token,
Keep Words Token, Keep Types Token,
Classic Token, Apostrophe Token
Amazon CloudSearch
Amazon CloudSearch analysis scheme configuration is used for analyzing text data during indexing.
The analysis schemes basically control:
• Text field content processing
• Stemming
• Inclusion of stopwords and synonyms
• Tokenization (Japanese language)
• Bigrams (Chinese, Japanese, Korean languages)
The following analysis options are executed when text fields are configured with an analysis scheme
1. Algorithmic stemming: Level of algorithmic stemming (minimal, light, and heavy) to
perform. The stemming levels vary depending on the analysis scheme language.
2. Stemming dictionary: A dictionary to override the results of the algorithmic stemming.
3. Japanese Tokenization Dictionary: A dictionary which specifies how particular characters
should be grouped into words (only for Japanese language).
4. Stopwords: A set of terms that should be ignored both during indexing and at search.
5. Synonyms: A dictionary of words that have the same meaning in the text data
Before processing the analysis scheme, Amazon CloudSearch tokenizes and normalizes the text
data. During tokenization, the text data is split into multiple tokens; this is common behavior in all
search engine text processing. During normalization, upper case characters are converted to lower
case, and more formatting is applied.
After the tokenization and normalization processes are completed, stemming, stopwords, and
synonyms are applied.
Conclusion
Apache Solr and Elasticsearch are packaged with varied libraries with distinct functions of analyzers,
tokenizers, and filters. Also, these libraries are allowed to be customized which gives greater
flexibility for the developers.
Amazon CloudSearch doesn’t carry sophisticated tokenizers or filter libraries like Apache Solr or
Elasticsearch, but it has simplified the configuration. Amazon CloudSearch tokenizers and filters
cover most common search requirements and use cases. This is ideal for developers who want to
quickly integrate search functionality into their application stack.
20. Amazon CloudSearch Comparison Report
PAGE 20 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
5.2 Faceting
Faceting is the composition of search results into categories or groups, based on indexed terms.
Faceting allows for categorizing search results into more sub-groups, which can be used as the basis
for filters or other searches. Faceting is also for efficient computation of search results by facets. For
example, facets for ‘Laptop’ search results can be 'Price', ‘Operating System’, 'RAM' or 'Shipping
Method’.
Faceting is a popular function that helps consumers filter through search results easily and
effectively.
Apache Solr
Apache Solr has far advanced options in faceting ranging from simple to very advanced faceting
behavior.
The below table details the parameters used during faceting. They can be grouped by field value,
date, range, pivot, multi-select, and interval.
Facet grouping Parameters
Field value
parameters
facet.field, facet.prefix, facet.sort, facet.limit, facet.offset,
facet.mincount, facet.missing, facet.method,
facet.enum.cache.minDffacet.threads
Date faceting
parameters
facet.date, facet.date.start, facet.date.end, facet.date.gap,
facet.date.hardend, facet.date.other, facet.date.include
Range faceting
parameters
facet.range, facet.range.start, facet.range.end, facet.range.gap,
facet.range.hardend, facet.range.other, facet.range.include
Pivot facet.pivot, facet.pivot.mincount
Interval facet.interval, facet.interval.set
Elasticsearch
Elasticsearch has deprecated facets and announced that they will be removed in a future release.
The Elasticsearch team felt that their facet implementation was not designed from the ground up
to support complex aggregations. Elasticsearch will be replacing facets with aggregations in their
next release.
Elasticsearch says “An aggregation can be seen as a unit-of-work that builds analytic information
over a set of documents. The context of the execution defines what this document set is (for
example, a top-level aggregation executes within the context of the executed query/filters of the
search request).”
Elasticsearch strongly recommends migrating from facets to aggregations. The aggregations are
classified into two main families, Bucketing and Metric.
The following table lists the aggregations available in Elasticsearch.
21. Amazon CloudSearch Comparison Report
PAGE 21 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Elasticsearch
Aggregators
Min Aggregation, Max Aggregation, Sum Aggregation, Avg Aggregation, Stats
Aggregation, Extended Stats Aggregation, Value Count Aggregation, Percentiles
Aggregation, Percentile Ranks Aggregation, Cardinality Aggregation, Geo Bounds
Aggregation, Top hits Aggregation, Scripted Metric Aggregation, Global Aggregation,
Filter Aggregation, Filters Aggregation, Missing Aggregation, Nested Aggregation,
Reverse nested Aggregation, Children Aggregation, Terms Aggregation, Significant Terms
Aggregation, Range Aggregation, Date Range Aggregation, IPv4 Range Aggregation,
Histogram Aggregation, Date Histogram Aggregation, Geo Distance Aggregation,
GeoHash grid Aggregation
Amazon CloudSearch
Amazon CloudSearch simplifies facet configuration when defining indexing options. These facets
are targeted at common use cases like e-commerce, online travel, classifieds, etc. The facet can be
of any field having data type as date, literal, or numeric field. This is done during CloudSearch
domain configuration. Amazon CloudSearch also allows the buckets definition to calculate facet
counts for particular subsets of the facet values.
The facet information can be retrieved in two ways:
Sort: returns facet information sorted either by facet counts or facet values.
Buckets: returns facet information for particular facet values or ranges
During searching, facet information can be fetched for any facet-enabled field by specifying the
“facet.FIELD” parameter in the search request (‘FIELD’ is the name of a facet-enabled field).
Amazon CloudSearch does allow multiple facets which help to refine search results further. See
the below example.
Example: "q=poet&facet.genres={}&facet.rating={}&facet.year={}&return=_no_fields"
Conclusion
All three search engines allow users to perform faceting with minimal effort. However, in terms of
an advanced complex implementation, the approaches are different for each search engine.
5.3 Auto Suggestion
When a user types a search query, suggestions relevant to the query input are presented and as more
characters are typed by the user, refined suggestions are presented. This feature is called auto-
suggest. Auto-suggest is an appealing and useful requirement and employed in many search user
interfaces.
This feature can be implemented at the Search Engine level or at the Search Application level. Below
are some options available in these three search engines.
22. Amazon CloudSearch Comparison Report
PAGE 22 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache Solr has native support for the auto-suggest feature. It can be facilitated by using
NGramFilterFactory, EdgeNGramFilterFactory, or TermsComponent. Usually, this Apache Solr
feature is used in conjunction with jQuery or asynchronous client libraries for creating powerful
auto-suggestion and user experience in the front-end applications.
Elasticsearch
Elasticsearch also has many edge n-grams, which are easy to set up, flexible, and fast.
Elasticsearch introduced a new data structure, Finite State Transducer (FST), which resembles big
graph data structure. This data structure is managed in memory, which makes it much faster than
a term-based query could be. Elasticsearch also recommends using edge n-grams when query
input and its word ordering are less predictable.
Amazon CloudSearch
Amazon CloudSearch offers ‘Suggesters’ to achieve auto-suggest. CloudSearch Suggesters are
configured based on a particular text field. When Suggesters are used for querying with a search
string, CloudSearch lists all documents where the search string in the Suggester field begins with
that search string. Suggesters can be configured to find matches for the exact query, or to perform
a fuzzy matching process to correct the query string. The ‘Fuzzy Matching’ can be defined with
fuzziness level Low, High or Default.
Suggesters also can be configured with SortExpression, which computes a score for each one. It’s
important to do domain indexing when a new Suggester is configured. Suggestions will not be
reflected until all of the documents are indexed.
Conclusion
Amazon CloudSearch provides simple yet powerful ‘Suggest’ implementation, which is sufficient
for most of the applications. If you are looking for advanced options or any further customizations
on ‘Suggestions’, Apache Solr and Elasticsearch offer some good options.
23. Amazon CloudSearch Comparison Report
PAGE 23 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
5.4 Highlighting
Highlighting is a way of giving formatting clues to end users in the search results. It is a valuable
feature, where the front-end search applications highlight search snippets of text from each search
result. This function conveys to end users why the result document matched their query. In this
section, we will describe the options present in all three search engines.
Apache Solr
Apache Solr includes document text fragments, which are matched in the query response. These
text fragments are included in the response as a highlighted section that is used as a cue by search
clients for representation. Apache Solr is packaged with good highlighting collections which give
control over the text fragments, fragment size, fragment formatting, and so on. These highlighting
collections can be incorporated with Solr Query parsers and Request Handlers.
Apache Solr comes with three highlighting utilities
• Standard Highlighter
• FastVector Highlighter
• Postings Highlighter
Standard Highlighter is most commonly used by search engineers because it is a good choice for a
wide variety of search use-cases. The FastVector Highlighter is ideal for large documents and
highlighting text in a variety of languages. The Postings Highlighter works well for full-text
keyword search.
Elasticsearch
Elasticsearch also allows for highlighting search results on one or more fields. The implementation
uses a Lucene based highlighter, fast-vector-highlighter and postings-highlighter. In Elasticsearch,
the highlighter can be configured in the query to force a specific highlighter type. This is a very
flexible option for developers to choose a specific highlighter to suit their requirements.
Like Apache Solr, the three highlighters present in Elasticsearch emulate the same behavior which
is seen in Solr because these highlighters are inherited from the Lucene family.
Amazon CloudSearch
Amazon CloudSearch simplifies the highlighting by specifying the highlight.FIELD parameter in the
search request. Amazon CloudSearch returns excerpts with the search results to show where the
search terms occur within a particular field of a matching document.
For example: Search terms ‘Smart Phone’ is highlighted for the description field:
Highlights": {"description": "A *smartphone* is a mobile phone with an advanced mobile
operating system. They typically combine the features of a cell phone with those of other popular
mobile devices, such as personal digital assistant (PDA), media player and GPS navigation unit. A
*smartphone* has a touchscreen user interface and can run third-party apps, and are camera
phones."}
Amazon CloudSearch also provides controls like number of search term occurrences within an
excerpt, how they should be highlighted, plain text or HTML and so on.
Conclusion
From a development perspective, all three search engines provide easy and simple highlighting
implementations. If you are looking for different and more advanced highlighting options, Apache
Solr and Elasticsearch have some good features.
24. Amazon CloudSearch Comparison Report
PAGE 24 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 6: Multilingual support
Multilingualism is a very important feature for global applications which cater to non-English
speaking geographies. A leading information measurement company’s survey reveals that search
engines built with multilingual features are emerging and successful because of native language
support, and focus on the cultural background of the users.
Business Impact: A multilingual search is an effective marketing strategy to get the attention of
consumers. In e-commerce, a platform to do more business is created when the language is in the
native tongue of the customer.
Apache Solr
Apache Solr is packaged with multilingual support for most common languages. Apache Solr
carries many language-specific tokenizers, and filters libraries which can be configured during
indexing and querying.
Apache Solr engineering forums recommend using multi-core architecture where each core
manages one language. Solr also supports language detection using Tika and LangDetect detection
features. This helps to map the text data to language-specific fields during indexing.
Elasticsearch
Elasticsearch has incorporated a vast collection of language analyzers for most commonly spoken
languages. The primary role of the language analyzer is to split, stem, filter, and apply required
transformations specific to the language.
Elasticsearch also allows a user to define a custom analyzer that can be a base extension of
another analyzer.
Amazon CloudSearch
Amazon CloudSearch has strong support for language-specific text processing. Amazon
CloudSearch has pre-defined default analysis schemes support to 34 languages. Amazon
CloudSearch processes the text and text-array fields based on the configured language-specific
analysis scheme.
Amazon CloudSearch also allows a user to define a new analysis scheme that can be an extension
of the default language analysis scheme.
Conclusion
All three search engines have ample and effective support features for widely spoken
international languages.
25. Amazon CloudSearch Comparison Report
PAGE 25 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Languages Support
The table below lists the languages supported by each search engine
Search engine Languages supported
Apache Solr
Arabic, Brazilian, Portuguese, Bulgarian, Catalan, Chinese, Simplified Chinese, CJK,
Czech, Danish, Dutch, Finnish, French, Galician, German, Greek, Hebrew, Lao,
Myanmar Khmer, Hindi, Indonesian, Italian, Irish, Japanese, Latvian, Norwegian,
Persian, Polish, Portuguese, Romanian, Russian, Scandinavian, Serbian, Spanish,
Swedish, Thai and Turkish
Elasticsearch
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish,
Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian,
Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian,
Portuguese, Romanian, Russian, Spanish, Swedish, Thai and Turkish
Amazon
CloudSearch
Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese - Simplified, Chinese -
Traditional, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek,
Hindi, Hebrew, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian,
Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish and
Thai
27. Amazon CloudSearch Comparison Report
PAGE 27 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 8: High Availability
All three search engines are architected for
• High availability (HA)
• Replication
• Scaling design principles
In this next section, we will discuss high availability options present in these three search engines.
8.1 Replication
Replication is copying or synchronizing the search index from master nodes to slave nodes for
managing the data efficiently.
Replication is a key design principle exercised in high availability searches and scaling. From a High
Availability perspective, replication can be effective for both HA and failovers from master nodes
(shards or leaders) to slave nodes (replicas). Replication from a scaling perspective is used to scale
the slave or replica nodes when the requests traffic increases.
Apache Solr
Apache Solr supports two models of replication, namely legacy mode and SolrCloud. In legacy
mode, the replication handler copies data from the master node index to slave nodes. The master
server manages all index updates and the slave nodes handle read queries. This segmentation of
master and slave allows scaling Solr clusters to deliver heavy volume loads.
Apache SolrCloud is a distributed advanced cluster setup using Solr nodes designed with high
availability and fault-tolerance. Unlike legacy mode, there is no explicit concept of "master/slave"
nodes. Instead, the search cluster is categorically split into leaders and replicas. The leader has the
responsibility to ensure the replicas are updated with the same data stored in the leader. Apache
Solr has a configuration called ‘numShards’ which defines number of shards (leaders). During start-
up, the core index is split across the ‘numShards’ (number of shards) and the shards are
represented as leaders. The nodes that are attached in the Solr cluster after the initial ‘numShards’
will be automatically assigned as replicas for the leaders.
Elasticsearch
Elasticsearch follows a similar concept to SolrCloud. In brief, an Elasticsearch index can be split into
multiple shards and each shard can be replicated into any number of nodes (0, 1, 2 …n). When
replication is completed, the index will have primary shards and replica shards. During index
creation, the number of shards and replicas are defined. The number of replicas can be dynamically
changed, but the shards count cannot.
Apache Solr and Elasticsearch
Both Apache Solr and Elasticsearch support synchronous and asynchronous replication models. If
the replication is configured in ‘synchronous’ mode, the primary (leader) shard will wait for
successful responses from the replica shards before returning commit transaction. If the model is
‘asynchronous’, the response is returned to the client as soon as the request is executed on the
primary or leader shard. The request to the replicas is forwarded asynchronously.
The diagram below depicts the replication concept which is followed in Solr and Elasticsearch.
28. Amazon CloudSearch Comparison Report
PAGE 28 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Replication handling in Apache Solr and Elasticsearch
S1 Node 1 Shard 1 of the cluster
S2 Node 2 Shard 1 of the cluster
R1 Node 3 Replica 1 of Shard 1
R2 Node 4 Replica 1 of Shard 2
R3 Node 5 Replica 2 of Shard 1
R4 Node 6 Replica 2 of Shard 2
Amazon CloudSearch
Amazon CloudSearch is simple and refined when it comes to handling replication and streamlines
the job of search engineers and administrators. During the configuration of scaling, Amazon
CloudSearch prompts for the desired replication count which should be based on load
requirements.
Amazon CloudSearch will automatically scale up and scale down the replicas for a domain based on
the requests traffic and data volume, but not below the desired replication count. In Amazon
CloudSearch, the replication scaling option can be changed at any time. If the scale requirement is
temporary, (for example, anticipated spikes because of a seasonal sale) the desired replication
count of the domain can be pre-scaled, and then the changes reverted after the requests volume
returns to a steady state. Modifying the replication count does not require any index rebuilding but
the replica sync completion is dependent on the size of search index.
The following describes the benefits of Amazon CloudSearch replication model.
• The search instance capacity is automatically replicated and load is distributed, the search
layer is robust and highly available at all times.
• Improved fault tolerance. If any one of the replicas is down, the other replica(s) will
continue to handle requests while the failed replica is in recovery mode.
• The entire process of scaling and distribution is automated and avoids manual intervention
and support.
29. Amazon CloudSearch Comparison Report
PAGE 29 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Conclusion
All three search engines have a good base to support the ‘replication’ feature. Apache Solr and
Elasticsearch allow defining your own replication topology which can be configured for synchronous
and asynchronous replication. They can be manually or automatically scaled based on application
requirements and by writing custom programs. However, substantial managed service operations
are required if the cluster replication is set up in enterprise scale.
Amazon CloudSearch fully manages the replication by managing scaling, load distribution, and fault
tolerance. This simplicity saves operations costs for the enterprises and companies.
8.2 Failover
Failover is a back-end operation that switches to secondary or standby nodes in the event of primary
server failure. Failover is identified as an important fault tolerance function for systems with lower or
zero downtime requirements.
Apache Solr and Elasticsearch
When an Apache Solr or Elasticsearch cluster is built with shards and replicas, the cluster
inherently becomes fault-tolerant and mechanically supports failover.
During any failure, a cluster is expected to support the operations while the failed node is put into
recovery state. Both the Apache Solr and Elasticsearch documentation strongly recommend a
distributed cluster setup to protect user experience from application or infrastructure failure.
In the event of all nodes storing shards and replicas failing, then the client requests will also fail. If
the shards are set to tolerant configuration, partial results can be returned from the available
shards. This behavior is anticipated in both Apache Solr and Elasticsearch.
The representation below depicts how failover is handled in cluster. This flow is applicable for
both Solr and Elasticsearch.
30. Amazon CloudSearch Comparison Report
PAGE 30 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Node Replica 1 Replica 2
SHARD 1 – Node Number 1
SHARD 1 FIRST REPLICA –
Node Number 3
SHARD 1 SECOND REPLICA –
Node Number 5
SHARD 2 – Node Number 2
SHARD 2 FIRST REPLICA –
Node Number 4
SHARD 2 SECOND REPLICA –
Node Number 6
The below table illustrates the failure scenarios in a Search cluster.
Scenario A
If SHARD1 fails, then one of its replica nodes, either Node number 3 or Node
number 5 is chosen as leader.
Scenario B
If SHARD2 fails, then one of its replica nodes, either Node number 4 or Node
number 6 is chosen as leader.
Scenario C
If SHARD 1 REPLICA1 fails, then Shard 1 Replica 2 continues to support
replication and as well serve the requests.
Scenario D
If SHARD 2 REPLICA1 fails, then Shard 2 Replica 2 continues to support
replication and as well serve the requests.
Elasticsearch uses internal Zen Discovery to detect failures. If the node holding a primary shard
dies, then a replica is promoted to the role of primary. Apache Solr uses Apache ZooKeeper for Co-
ordination, failure detection, and leader voting. ZooKeeper initiates leader election process
between replicas during a leader/primary shard failure.
Amazon CloudSearch
Amazon CloudSearch has built-in failover support. Amazon CloudSearch recommends scaling
options and availability options to increase fault tolerance in the event of a service disruption or
node failures.
When Multi-AZ is turned on, Amazon CloudSearch provisions the same number of instances in
your search domain in the second availability zone within that region. The instances in the primary
and secondary zones are capable of handling a full load in the event of any failure.
In the event of a service disruption or failure in one availability zone, the traffic requests are
automatically redirected to the secondary availability zone. In parallel, Amazon CloudSearch self-
heals the cluster in failure, and Multi-AZ restores the nodes without any administrative
intervention. During this switch, the inflight queries might fail, and they will need to be retried
from the front–end application side.
By increasing the partitions and replicas in the Amazon CloudSearch scaling options, failover
support can be improved. If there's a failure in one of the replicas or partitions, the other nodes
(replica or partition) will handle requests and support while it is being recovered.
Amazon CloudSearch is very sophisticated in terms of handling failure, as the node health is
continuously monitored. In the event of infrastructure failures, the nodes are automatically
recovered or replaced.
Conclusion
Failover can be architected by applying techniques like replication, sharding, service discovery,
and failure-detection services. Apache Solr and Elasticsearch advocate building your search system
in ‘Cluster mode’ to address failover. They undertake that responsibility by employing service
discovery which can detect unhealthy nodes. The service discovery maintains the cluster
31. Amazon CloudSearch Comparison Report
PAGE 31 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
information and balances the search cluster when nodes are detected for failures.
Amazon CloudSearch supports failover for single node as well as for cluster mode. Behind the
scenes, CloudSearch continuously monitors the health of the search instances and they are
automatically managed during failures.
32. Amazon CloudSearch Comparison Report
PAGE 32 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 9: Scaling
The ability to scale in terms of computing power, memory, or data volume is essential in any data
and traffic bound applications. Scaling is a significant design principle employed to improve
performance, balancing and high availability.
Over time, the search cluster is expected to be scaled horizontally (scale out) or vertically (scale up)
depending upon the needs.
Scale-up is the process of moving from a small server to a large server. Scale-out is the process of
adding multiple servers to handle the load. The scaling strategy should be selected based on
application requirements.
Apache Solr and Elasticsearch
Scaling an Apache Solr or Elasticsearch application involves manual processes. These can include a
simple server addition task or advanced tasks like cluster topology changes, storage changes, or
infrastructure upgrades.
If vertical scaling takes place, the search cluster needs to follow processes like new setup and
configuration, downtime, node restarts, etc. If scaling is horizontal, the process may involve re-
sharding, rebalancing, or cache warming.
While a search cluster system can benefit from powerful hardware, vertical scaling has its own
limitations. Upgrading or increasing the infrastructure specifications on the same server can
involve tasks like:
• New setup
• Backup
• Down time
• Application re-testing
The scaling out process is identified as a relatively easier task compared to scaling up.
An expert search administrator (Apache Solr or Elasticsearch) is usually posted to keep a close
watch on the performance of the search servers. Infrastructure and search metrics play a key role
in administrator decision making.
When these metrics increase beyond the threshold of a particular server and start affecting
overall performance, the new server(s) have to be manually spawned. Also, the scale up task can
expand to index partitioning, auto-warming, caching and re-routing/distribution of the search
queries to the new instances. It requires a Solr expert on your team to identify and execute this
activity periodically.
Sharding and Replication
Though scaling up, scaling out, and scaling down involve manual work, technology-driven
companies automate this process by developing custom programs. These smart programs
continuously monitor the cluster group and make decisions to do elastic scaling. This output is
quite similar to AutoScaling’s offering.
In terms of administration functionality, both Apache Solr and Elasticsearch offer scaling
33. Amazon CloudSearch Comparison Report
PAGE 33 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
techniques called Sharding and Replication.
Sharding (which means partitioning) is a method in which a search index is split into multiple
logical units called "shards". If the indexed documents exceed the collection’s physical size, then
sharding is recommended by administrators. When sharding is enabled the search requests are
distributed to every shard in the collection, results are individually collected and then merged.
Another scaling technique, replication, (See 8.1 Replication - discussed in detail) allows adding
new servers with redundant copies of your index data to handle higher concurrent query loads by
distributing the requests around to multiple nodes.
Amazon CloudSearch
Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the
amount of data or query volume increases. CloudSearch can be scaled based on the data or based
on the requests traffic. When the search data volume increases, CloudSearch can be scaled from a
smaller instance type to a larger search instance type. If the capacity of largest search instance
type is also exceeded then CloudSearch partitions the search index across multiple search
instances (Sharding technique).
When traffic and concurrency grows, Amazon CloudSearch deploys additional (replicas) search
instances to support traffic load. This automation eases the complexity and manual labour
required in the scaling out process. Conversely, when the traffic drops, Amazon CloudSearch
scales down your search domain by removing the additional search instances in order to minimize
costs.
The Amazon CloudSearch management console allows users to configure the desired partition
count and the desired replication count. The AWS console also allows changing of the instance
type (scaling up) anytime. This inherent behavior of elastic scaling makes one of the most
important points in favor of Amazon CloudSearch.
Conclusion
Scaling in search is implemented in the form Sharding and Replication. All three search engines
have a strong scaling support for setting up their search tier in ‘cluster mode’.
Scaling in Apache Solr and Elasticsearch often requires administration as there is no direct hard
and fast rule. Techniques like elastic scaling can implemented only up to a limit and when cluster
grows further, manual intervention and thought process is required. Vertical scaling in Apache
Solr and Elasticsearch is even more delicate. It requires individual management of the nodes in the
cluster and executed by using techniques like ‘Rolling restarts’ and custom scripts.
Amazon Cloud Search takes away all the operation intricacies from the administrators. The desired
partition count and desired replication count option in CloudSearch will automatically scale up and
scale down based on the volume of data and requests traffic. This saves lot of efforts and cost on
operations and management.
34. Amazon CloudSearch Comparison Report
PAGE 34 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 10: Customization
At times, the search system or its software may not have support for a specific feature or built-in
integration with other systems. In such cases, most open source software allows developers to
customize and extend their desired features as plugins, extensions or modules. Often, the developer
community shares extension libraries which are helpful for a practical cause. These libraries can be
customized and integrated with the system.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch both belong to the same source breed, allowing customizations on:
• Analyzers
• Filters
• Tokenizers
• Language analysis
• Field types
• Validators
• Fall back query analysis
• Alternate query custom handlers
Since both products are open source, the developers can customize or extend the libraries to fit
the required feature modifications through plugins and libraries. The build and deployment
becomes a developer’s responsibility after the extending the code base.
Apache Solr and Elasticsearch have many plugin extensions that will allow developers to add
custom functionality for a variety of purposes. These plugins are configured as special libraries and
refer to the application using configuration mapping.
Amazon CloudSearch
Amazon CloudSearch does not allow for any customizations. The search features in Amazon
CloudSearch are offered by AWS after much careful thought and collective feedback from the
customers. The Amazon CloudSearch team continually evaluates new features and rolls them out
proactively.
Conclusion
Amazon CloudSearch has a highly capable feature set to develop search systems. However, if you
anticipate strong customization on your search functionalities, Apache Solr or Elasticsearch are
better choices as their search core libraries are open sourced. It is also important to note that any
customization in the core libraries leaves the build and deployment process responsibility to the
developer. The customization also needs to be maintained for every version upgrade or newer
release of your search engine.
35. Amazon CloudSearch Comparison Report
PAGE 35 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 11: More
11.1 Client libraries
Client libraries are required for communicating with search engines. They are essential for developers
as they provide essential information to the connecting search engine and allow applications to easily
interact with high-level libraries.
Apache Solr
Apache Solr has an open source API client to interact with Solr using simple high-level methods.
The client libraries are available for PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, and
JavaScript.
Elasticsearch
Elasticsearch provides official clients for Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby.
There are other community-provided client libraries that can be integrated with Elasticsearch.
Open source
Other than official and open source client APIs, Elasticsearch and Apache Solr can be integrated
using the RESTful API. The REST client can use a typical web client developed in the favored
programming language or even called from a normal command line.
Amazon CloudSearch
Amazon CloudSearch exposes a RESTful API for configuration, document service and search.
• The configuration API can be used for CloudSearch domain creation, its configuration and
end to end management.
• The document service API enables the user to add, replace, or delete documents in your
Amazon CloudSearch.
• The search API is used for search or suggestion requests to your Amazon CloudSearch
domain.
Alternatively, AWS also shares a downloadable SDK package, which simplifies coding. The SDK is
available for popular languages like Java, .NET, PHP, Python, and more. The SDK APIs are built for
most Amazon Web services, including Amazon S3, Amazon EC2, CloudSearch, DynamoDB, and
more. The SDK package includes the AWS library, code samples, and documentation.
36. Amazon CloudSearch Comparison Report
PAGE 36 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 12: Cost
From an overall perspective, Cost is a very important factor and companies always endeavor ways to
reduce Total cost of Ownership (TCO). In this section, we will see the Cost components in these three
search engines.
Apache Solr and Elasticsearch
The cost factor in Apache Solr and Elasticsearch includes infrastructure resources cost, managed
services cost and people resources cost. For any type of deployment, the servers cost and
engineers cost are essential. The commitment to continuous admin operations depends on
application requirements and its criticality.
Amazon CloudSearch
Amazon CloudSearch cost component includes server costs and engineers cost and they are
essential for any search deployment like the above two. Amazon CloudSearch being a fully-
managed service covers the managed services as part of the server costs. Also, Amazon
CloudSearch does not charge during the beginning of service usage but charges at the end of the
month based on CloudSearch usage.
Conclusion
The net operating costs are essentially the same across all three search engines, but people costs
will be 30% more for self-managed Apache Solr or Elasticsearch compared to Amazon
CloudSearch.
For Example, A highly important and critical search application will require 24 * 7 support and
managed services. This cost incurred as part of Managed services which is an additional one in
Apache Solr and Elasticsearch deployments.
A detailed TCO Analysis between Apache Solr, Elasticsearch and Amazon CloudSearch can be read
here.
Link:
37. Amazon CloudSearch Comparison Report
PAGE 37 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Conclusion
Search is an indispensable feature in most business applications.
Apache Solr and Elasticsearch are time proven solutions. Many larger organizations have used
Apache Solr and Elasticsearch for years, but are now looking for greater operational efficiency and
cost effectiveness. On the other hand, companies looking for innovative ways grow their businesses
and provide value. In the recent years, a huge number technology companies have started to employ
the benefits of using cloud-based search services, mainly in terms of getting started and then
accommodating growth without the need to switch vendors to do so. When scalability, cost, and
speed-to-market are primary concerns, we recommend using some form of cloud service. And if you
want to enjoy the benefits of a cloud solution built on the architecture of Apache Solr, we
recommend Amazon CloudSearch.
38. Amazon CloudSearch Comparison Report
PAGE 38 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
About the Authors
Dwarak is a Principal Architect at 8KMiles with more than decade hands-on experience in Cloud
Computing, Big Data, Web technologies and Product Management. He has varied and
progressive experience in architecting distributed Web and Enterprise systems and products.
He is also disciplined with deep domain knowledge in the banking, finance, retail, and e-
commerce industries. At present, he oversees technology consulting, architecture, delivery and
customer end to end transformational programs at 8KMiles.
Dwarakanath Ramachandran
Harish is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles. Harish has more than
decade of experience in architecting and developing cloud computing, e-commerce and mobile
application systems. He has also built large Internet banking solutions that catered to the
needs of millions of users, where security and authentication were critical factors. He is
responsible for the overall technology direction of the 8KMiles products and services in Cloud,
Big Data and Mobility Space. Harish is a thought leader in Cloud related technologies, an
Advisor and has many followers for his blogs.
Harish Ganesan
39. Amazon CloudSearch Comparison Report
PAGE 39 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
About 8KMiles
8KMiles is a solutions company that is focused on helping organizations of all sizes
to integrate Cloud, Identity, and Big Data into their IT and business strategies.
8KMiles’ team of experts, located in North America and India, offer a host of services
and solutions such as Cloud, Federated Identity Consulting, Cloud Engineering,
Migration, Big Data services, and Managed Services on Amazon Web Services.
8KMiles offers specialized expertise in matured verticals such as Pharma, Retail,
Media, Travel, and Healthcare. Visit us at www.8kmiles.com