SlideShare a Scribd company logo
1 of 39
8KMiles Software Services, Inc
Amazon CloudSearch Comparison Report
TABLE OF CONTENTS
Smackdown.............................................................................................................................................3
Introduction ............................................................................................................................................5
Search features 1 - 1 comparison ...........................................................................................................6
Feature 1: Getting Started ......................................................................................................................7
Feature 2: Operations and Management ...............................................................................................9
2.1 Backup...........................................................................................................................................9
2.2 System upgrades and patch management .................................................................................10
2.3 Re-indexing .................................................................................................................................11
Feature 3: Monitoring...........................................................................................................................13
Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export ...........................................14
4.1 Schema management .................................................................................................................14
4.2 Dynamic fields.............................................................................................................................14
4.3 Data types ...................................................................................................................................15
4.4 Data import & export..................................................................................................................16
Feature 5: Search and Indexing features..............................................................................................18
5.1 Analyzers, Tokenizers and Token filters......................................................................................18
5.2 Faceting.......................................................................................................................................20
5.3 Auto Suggestion..........................................................................................................................21
5.4 Highlighting.................................................................................................................................23
Feature 6: Multilingual support............................................................................................................24
Feature 7: Protocol & API Support........................................................................................................26
7.1 Request and Response formats ..................................................................................................26
7.2 External Integrations...................................................................................................................26
7.3 Protocols Support .......................................................................................................................26
Feature 8: High Availability...................................................................................................................27
8.1 Replication ..................................................................................................................................27
8.2 Failover........................................................................................................................................29
Feature 9: Scaling..................................................................................................................................32
Feature 10: Customization....................................................................................................................34
Feature 11: More..................................................................................................................................35
11.1 Client libraries...........................................................................................................................35
Feature 12: Cost....................................................................................................................................36
Conclusion.............................................................................................................................................37
Amazon CloudSearch Comparison Report
PAGE 3 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Smackdown
Feature Apache Solr Elasticsearch Amazon
CloudSearch
Admin Operations
Backup
Replication/Custom
handler/Custom scripts
Snapshot API/Custom
scripts
Fully-managed
Patch Management
Manual/Automated via
custom scripts
Manual/Automated via
custom scripts
Fully-managed
Re-indexing
Manual Manual
Fully-managed
Manual option
available from
management console
Monitoring
If hosted on EC2,
Amazon CloudWatch
SaaS Monitoring tools
like NewRelic,
Stackdriver, Datadog
If hosted on EC2,
Amazon CloudWatch
SaaS Monitoring tools
like NewRelic,
Stackdriver, Datadog
CloudSearch default
metrics
Maintenance
External managed
service
External managed
service
Fully-managed
API
Client Library
Java, PHP, Ruby, Rails,
AJAX, Perl, Scala,
Python, .NET,
JavaScript
Java, Groovy,
JavaScript, .NET, PHP,
Perl, Python, and Ruby
Amazon SDK
HTTP RESTful API YES YES YES
Request Format XML, JSON, CSV XML, JSON XML, JSON
Response Format XML, JSON, CSV XML, JSON XML, JSON
Third party
Integrations
Available for
Commercial and Open
source
Available for
Commercial and Open
source
Amazon Web Services
Integrations available
Search Functions
Schema
Schema and Schema-
less
Schema and Schema-
less
Schema
Dynamic fields
support
Yes Yes Yes
Synonyms Yes Yes Yes
Multiple indexes Yes Yes No
Faceting Yes Yes Yes
Rich documents
support
Yes Yes No
Auto Suggest Yes Yes Yes
Highlighting Yes Yes Yes
Query parser
Standard, DisMax,
Extended DisMax,
Other parsers
Standard,
query_string, DisMax,
match, multi_match
Simple, structured,
Lucene, or DisMax
Geosearch Yes Yes Yes
Analyzers, Default/Custom Default/Custom Default
Amazon CloudSearch Comparison Report
PAGE 4 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Tokenizers and
Token filters
Fuzzy Logic Yes Yes Yes
Did you mean Default/Custom Default/Custom No
Stopwords Yes Yes Yes
Customization Yes Yes No
Advanced
Cluster management ZooKeeper in-built Fully-managed
Scaling
Vertical scaling/
Horizontal scaling
Vertical scaling/
Horizontal scaling
Fully-managed
horizontal scaling
Replication Yes Yes Yes
Sharding Yes Yes Yes
Failover
Yes, if set up in Cluster
Replica mode
Yes, if set up in Cluster
Replica mode
Fully-managed
Fault tolerant
Yes, if set up in Cluster
mode
Yes, if set up in Cluster
mode
Fully-managed
Import and Export
Data import
Default import
handlers, custom
import handlers
Rivers modules,
Logstash input plugins,
custom programs
Batch upload
Data export
Default export
handlers, custom
export handlers
Snapshot API Custom program
Others
Web Interface Solr Admin Sense
AWS Management
Console
Amazon CloudSearch Comparison Report
PAGE 5 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Introduction
In today's world of vast and available information, a good search experience is central to a good user
experience. Hence, delivering effective search tools has become the key goal of all software
products, market places, e-commerce websites, and content management systems. Developers
looking to deliver a premium search experience to their users should be aware of some broad
trends:
1) Open source and platform-based search engines are replacing proprietary search engines because
of better licensing models and community support.
2) The cloud delivery model is succeeding over the on-premise delivery model because of scalability,
high availability and operating expense.
In light of the above trends, the choice of leading candidates for search technology boils down to
three: Apache Solr, Elasticsearch, and Amazon CloudSearch. At 8KMiles, our clients often ask us how
these three choices compare relative to each other. This report aims to make it easy for developers
to pick the right technology for their application by presenting a comprehensive framework for
evaluation of the three options. We have also applied our framework to top feature sets that are
critical to any search workload. We then broke them down further into granular features and
compared each of the three search engines.
In this report, we summarize our conclusions and present them in a smack down style summary
card. We encourage our readers to run a more in-depth evaluation for their specific use cases.
Amazon CloudSearch Comparison Report
PAGE 6 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Search features 1 - 1 comparison
This section discusses about the search features in detail and how they are present in Apache Solr,
Elasticsearch, and Amazon CloudSearch.
The table below illustrates the most influential features involved in the assessment of our search
engines. These features are identified and grouped based on the various operations of a search
application.
•Server setup, search engine installation and configurationGetting started
•Backup, patches, re-indexing, monitoringOperations
•Schema management, field types, dynamic fields, data
import/export, analyzers, did you mean, facets, auto
complete, spatial
Indexing , Search and Query
•Replica, Failover, Self-healing clustersHigh Availability
•Read scaling, write scaling, partitioning options, replication
options
Scaling
•Request and response formats and support, protocols
supported, external integrations
Protocol & API Support
•Data field types, functionsCustomization
•Supported programming languages, administration
interface
Others
•Infrastructure cost, support cost, on-going management
cost, licensing cost, talent cost
Cost
Amazon CloudSearch Comparison Report
PAGE 7 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 1: Getting Started
‘Getting Started’ is the first step by an engineer to understand the basics of the major features of a
product. In this section, we will see how the search engines discussed above facilitate ‘Getting
Started’.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch require the end users to spend quality time in understanding and
setting up the respective search engines. The “Getting Started” manuals of Apache Solr and
Elasticsearch assume the end user to have minimal knowledge of search engines, their related
functions and architecture.
The installation processes for Apache Solr and Elasticsearch include tasks such as:
• Server setup
• Search engine download
• Dependent software installations
• Setup of environmental requirements
• Understanding of basic server commands
• Administrative access
Apache Solr and Elasticsearch are shipped with test examples which allow users to do “warm up”
search and indexing operations. While the default test schema in Apache Solr is sufficient for the
user to get started, the Elasticsearch’s schema-less design allows the user to send document
request without any schema.
Amazon CloudSearch
If you already have an Amazon Web Services (AWS) account set up, you can create a CloudSearch
domain in a few clicks using the AWS Management console. The AWS CloudSearch Management
console guides administrators through a step-by-step process, requesting the user input:
• Instance type
• High availability options
• Replication options
• Schema definitions
• Access policies
Among these options, it is important to note that Amazon CloudSearch does not mandatorily
prompt for all information. The CloudSearch domain name and engine type are adequate to
create a CloudSearch instance.
The other configurations such as schema, instance type, access policies, and high availability
options can be modified at a later time based on the application requirements.
The users are abstracted from hardware provisioning, software installation, configuration, cluster
setup and other administration activities.
Users receive two administrative regional endpoints: a search endpoint and document endpoint.
Both endpoints can be accessed using RESTful API or AWS Software Development kit (SDK) with
Identity and Access Management (IAM) credentials.
Amazon CloudSearch Comparison Report
PAGE 8 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Another important note is that CloudSearch default access policies for the document service and
search service endpoints are configured to block all IP addresses. Developers should configure the
authorized IP addresses to access the CloudSearch’s endpoints.
CloudSearch also provides a sample dataset, the IMDB movies, which can be used to test drive the
CloudSearch service. The CloudSearch developer documentation walks through the steps to
launch a test domain using the sample IMDB dataset.
Conclusion
Apache Solr and Elasticsearch expect users to have basic practical knowledge of the search engine
and also complete a few significant tasks to accomplish the first step ‘Getting Started’.
In Amazon CloudSearch, the ‘Getting Started’ activities are easier and end users can have the
CloudSearch instance up and running with few clicks in a few minutes.
Amazon CloudSearch Comparison Report
PAGE 9 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 2: Operations and Management
In this section, we’ll discuss some important administrative operations such as
• Index backup
• Patch management
• Re-indexing and recovery
2.1 Backup
Data backup is a routine operation, carried out within a defined period of time. Data backup is an
essential task for recovering data responsively from failures such as hardware crash, data corruption
or related events.
Apache Solr
Apache Solr provides a feature called ‘ReplicationHandler’. The main objective of
ReplicationHandler is to replicate index data to slave servers, but it can also be used as a backup
copy server. A replication slave node can be configured with Solr Master, which can be solely
identified as a backup server, with no other operations taking place on the slave node.
Solr‘s implicit support for replication allows ReplicationHandler to be used as an API. The API has
optional parameters like location, name of snapshot, and number of backups. The backup API is a
bound-to-store snapshot on a local disk, but for any other storage options the backup API requires
customization.
If you are required to store the backups in a different store location like Amazon’s Simple Storage
Service (S3), a local storage server, or in a remote data center, ReplicationHandler has to be
customized. Solr core libraries are available in open source that allows for any customization.
Elasticsearch
Elasticsearch provides an advanced option called ‘Snapshot API’ for backing up the entire cluster.
The API will back up the current cluster state and related data and save it to a shared repository.
The first or initial backup process will be a complete copy of the data. The subsequent backup
processes will snapshot the delta between the backup of fresh data with previous snapshots.
Elasticsearch prompts end users to create a repository type, which can be chosen from a shared
file system:
• Amazon S3
• Hadoop Distributed File System (HDFS)
• Azure Cloud
This integration gives a greater flexibility for developers to manage their backups.
Backup Process
The backup options present in Apache Solr and Elasticsearch can be executed manually or can be
automated. To automate the entire backup process, one has to write custom scripts that calls the
relevant API or handler. Most engineering companies follow this model of writing custom scripts
for backup automation.
Amazon CloudSearch Comparison Report
PAGE 10 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
The backup also involves maintenance of the latest snapshots and archives. The management
tasks involve key tasks like snapshot retrieval, archival, and expiration.
In an alternate approach, if the Solr or Elasticsearch cluster is set up in a cluster replication mode,
any one of the slave nodes is identified as backup server. The automation of the slave node
backup server needs a script written by the developer.
Amazon CloudSearch
Amazon CloudSearch inherently takes cares of the data that is stored and indexed, leaving a
lighter load for engineering and operations teams. Amazon CloudSearch self-manages all the data
backup and its management. The backups are internally maintained behind the scenes. In the
event of any hardware failure or other problem, Amazon CloudSearch restores the backup
automatically, and this process is not revealed to end users.
Conclusion
The default option in Apache Solr is only to back up to a ‘local disk’ store; it does not offer any
other storage options as Elasticsearch does. However, the engineers can write their own handlers
to manage the backup process.
Elasticsearch is packaged with multiple storage options plugins which gives added advantage for
engineers.
Amazon CloudSearch relieves the users of the intricacies of the backup and its management
process. The IT operations or managed service team have a lesser role in the backup process as
the entire operations are managed behind the scenes by CloudSearch.
2.2 System upgrades and patch management
Patch management and system upgrades like OS patches and fixes are inevitable in operations and
administration. For any system, there is always a version upgrade, or maintenance on the OS, and
hardware or software changes.
Rolling Restarts
Apache Solr and Elasticsearch both recommend using ‘Rolling Restarts’ for patch management,
operating system upgrades and other fixes. Rolling Restarts involve stopping and starting each
cluster node in the cluster sequentially. This allows the cluster to continue its operations while
each node is updated with the latest code, fixes, or patches while continuing to serve search
requests. Rolling Restarts is adopted when high availability is mandatory and downtime is not
allowable.
Sometimes, the Rolling Restarts require some intelligent decision making based on cluster
topology. If a cluster consists of shards and replicas, the order of restarting each node has to be
done decisively.
Amazon CloudSearch Comparison Report
PAGE 11 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache’s ZooKeeper service acts as a stand-alone application and does not get upgraded
automatically when Apache Solr is upgraded, but it should be done manually at the same time.
Elasticsearch
Elasticsearch recommends disabling the ‘shard allocation’ configuration during node restart. This
informs Elasticsearch to stop re-balancing missing shards because the cluster will immediately
start working on node loss.
Amazon CloudSearch
Amazon CloudSearch internally manages all patches and upgrades related to its operating system.
The managed search service offering from Amazon CloudSearch monitors for when new features
are rolled out; upgrades are self-managed and immediately available to all customers without any
action on their part.
Conclusion
The patch management in Apache Solr and Elasticsearch has to be carried out manually using the
Rolling Restarts feature. Customers automate this process by developing custom scripts to do
system upgrades and patch management.
Patch management in Amazon CloudSearch is transparent to the customers. The upgrades and
patches done on Amazon CloudSearch are regularly updated in the ‘What’s New’ section of the
CloudSearch documentation.
2.3 Re-indexing
Any business application changes over its lifetime, as the business running it changes. The business
change has a direct effect on the data structure of the system’s persistent information store. The
search engine, which is seen as a secondary or alternate store, will eventually have to change its data
structure when required. Any changes to the search engine data structure will require a re-indexing
of the data.
Example: A product company started collecting ‘feedback’ from their customer for a given product.
The text string from the new field ‘feedback’ needs to be added into the search schema, and may
require re-indexing.
If the search data is not re-indexed after a structural change, the data that has already been indexed
could become inaccurate and the search results may behave differently than expected.
Re-indexing becomes a necessary process over a period of time as the application grows. It is also
identified as a common and mandatory admin operation executed periodically based on application
requirements.
Amazon CloudSearch Comparison Report
PAGE 12 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache Solr recommends re-indexing if there is a change in your schema definitions. The options
below are widely used by the Apache Solr user community.
• Create a fresh index with new settings. Copy all of the documents from the old index to
the new one.
• Configure Data import handler with ‘SolrEntityProcessor’. The SolrEntityProcessor imports
data from Solr instances or cores for a given search query. The SolrEntityProcessor has a
limitation where it can only copy fields that are stored in the source index.
• Configure Data import handler with the source or origination data source. Push the data
freshly to the new index
Elasticsearch
Elasticsearch proposes several approaches for data re-indexing. The following approaches are
usually combined:
 Use Elasticsearch’s Scan and Scroll and Bulk APIs to fetch and push data into the new
index.
 Update or create an index alias with the old index name and delete the old index.
 Use open source Elasticsearch plugins that can extract all data from the cluster and re-
index the data. Most of these plugins internally use the Scan and Scroll and Bulk API (as
mentioned above) which reduces development time.
Amazon CloudSearch
Amazon CloudSearch recommends data rebuilding when index fields are added or modified.
Amazon CloudSearch expects to issue an indexing request after a configuration change. Whenever
there is a configuration change, the CloudSearch domain status changes to ‘NEEDS INDEXING’.
During the index rebuilding, the domain's status changes to ‘PROCESSING’, and upon completion
the status is changed to ‘ACTIVE’.
Amazon CloudSearch can continue to serve search requests during the indexing process, but the
configuration changes are not reflected in the search results. The re-indexing process can take
some time for the changes to take effect. It is directly proportional to the amount of data volume
in your index.
Amazon CloudSearch also allows document uploads while indexing is in progress, but the updates
can become slower, if there are is large volume of document updates. During such a scenario, the
uploads or updates can be throttled or paused until the Amazon CloudSearch domain returns to
an ‘ACTIVE’ state.
Customers can initiate re-indexing by issuing the index-documents command using RESTful API,
AWS command line interface (CLI), or AWS SDK. They can also initiate re-indexing from the
CloudSearch management console.
Conclusion
Re-indexing in Apache Solr and Elasticsearch is mostly a manual process because it requires a
decision that factors data size, current request size, and offline hours.
Amazon CloudSearch manages the re-indexing process inherently and leaves much less to
administrators. The re-indexing time period is abstracted and not disclosed to administrators but
Amazon CloudSearch runs the re-indexing process based on the best practices mentioned above.
Amazon CloudSearch Comparison Report
PAGE 13 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 3: Monitoring
Monitoring server health is an essential daily task for operations and administration. In this section,
we will describe the built-in monitoring capabilities for all three search engines.
Apache Solr
Apache Solr has a built-in web console for monitoring indexes, performance metrics, information
about index distribution and replication, and information on all threads running in the Java Virtual
Machine (JVM) at the time.
For more detailed monitoring, Java Management Extensions (JMX) can be configured with Solr
that share runtime statistics as MBeans. The Apache Solr JVM container has built-in
instrumentation that enables monitoring using JMX.
Elasticsearch
Elasticsearch has a management and monitoring plugin called ‘Marvel’. Marvel has an interactive
console called ‘Sense’ that helps users to interact easily with Elasticsearch nodes. Elasticsearch has
in-built diversified APIs that emit heap usage, garbage collection stats, file descriptions, and more.
Marvel is strongly integrated with these APIs, and it periodically executes polling, collects statistics
and stores the data back in Elasticsearch. Marvel’s interactive graph report dashboard allows
administrators to query and aggregate historical stats data.
Amazon CloudSearch
Amazon CloudSearch recently introduced Amazon CloudWatch integration. The Amazon
CloudSearch metrics can be used to make scaling decisions, troubleshoot issues, and manage
clusters.
Amazon CloudSearch publishes four metrics into Amazon CloudWatch: SuccessfulRequests,
Searchable Documents, Index Utilization, and Partition Count.
The CloudWatch metrics can be configured to set alarms, which can notify administrators through
Amazon Simple Notification Service.
Conclusion
Apache Solr and Elasticsearch have integrations with in-built and external plugins. They can also
support SaaS based monitoring plugins or custom plugins developed by the customers.
CloudSearch’s integration with CloudWatch shares some good metrics and it is expected to offer
newer ones in the future.
Amazon CloudSearch Comparison Report
PAGE 14 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export
4.1 Schema management
Schema: A schema is a definition of fields and field types used by the search system to organize data
within the document files it indexes.
Schema definition is the foremost task in the search data structure design. It is important that the
schema definition caters to all business requirements and is designed to suit the application.
Apache Solr and Elasticsearch
Both Elasticsearch and Apache Solr can run the search application in ‘Schema-less’ and ‘Schema’
mode. Schema mode is suitable for application development or any production environments.
Schema-less is a very good option for entrants to get started. After server setup, users can start
the application without a schema structure and create the field definitions on the search indexing.
However, to have a production-grade application running, a proper schema structure becomes
mandatory and the schema definition is a necessity.
Amazon CloudSearch
Amazon CloudSearch also allows users to set up search domains without any index fields. The
index fields can be added anytime, but before any valid document indexing or any search request.
In addition, the CloudSearch management console has integration with Amazon Web Services like
S3, DynamoDB, or can access a local machine from where the schema can be imported directly to
CloudSearch domain. After the schema import, CloudSearch allows the user to edit the fields or
add new fields. This is a convenient feature for a pre-built schema that is to be migrated to a
CloudSearch domain.
Conclusion
Apache Solr and Elasticsearch can be started without any schema but they cannot be put into
production use. Amazon CloudSearch allows creating domains without any index fields, but to
have any index and search requests served the schema should be created.
The general best practice in schema management is to rehearse and design the schema suiting
application requirements before finalizing the search structure. The underlying schema concept of
all three search engines is consistent with this practice.
4.2 Dynamic fields
Dynamic fields are like regular field definitions which support wildcard matching. They allow the
indexing of documents without knowing the type of fields they contain. A dynamic field is defined
using a wildcard pattern (*) for first, last, or only character. All undefined fields go through dynamic
field rules which validate the pattern match configured with the dynamic field's indexing options.
Amazon CloudSearch Comparison Report
PAGE 15 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch allow end users to set up dynamic fields and rules using RESTful API
and schema configuration.
Amazon CloudSearch
In Amazon CloudSearch, dynamic fields can be configured using indexing options in the
CloudSearch management console or using CloudSearch, RESTful API, or AWS SDK API.
Conclusion
If you are unsure about the schema structure or exact field names, dynamic fields come in handy.
Amazon CloudSearch, Apache Solr, and Elasticsearch all allow the flexibility to configure dynamic
fields. This helps the application development team to describe any omitted field definitions in the
schema document.
4.3 Data types
There are a variety of data types supported by these search engines. The table below illustrates the
data field types supported by each search engine.
Data type Solr Elasticsearch CloudSearch
String / Text Yes Yes Yes
Number types
integer, double,
float, long
byte, short, integer,
long, float, double
integer, double
Date types Yes Yes Yes
Enum fields Yes Yes No
Currency Yes No No
Geo location / Latitude –
Longitude
Yes Yes Yes
Boolean Yes Yes No
Array types Yes Yes Yes
Conclusion
The most important data types like string, date, and number types are supported by all three
search engines. Geo location data type, which is now regularly used by modern applications, is also
supported by all search engines.
Engineers and developers may use an alternate data type if a particular data type is not supported
for their chosen search engine. Example, ‘currency’ data type supported in Solr is not available in
Elasticsearch and CloudSearch. During such cases, engineers use number type as an alternative
Amazon CloudSearch Comparison Report
PAGE 16 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
data type for ‘Currency’.
4.4 Data import & export
The most important task in a search application development is data migration from origination
source to the search engine. The origination data can be of a data source like a database, or a file
system or a persistent store. To commence a search data set, it is required to migrate or import the
full data set from its origin to the search engine.
Likewise, extracting data from a search engine and exporting it to a different destination source is
also a crucial task but executed occasionally.
Apache Solr
Apache Solr has in-built handler called Data import handler (DIH). The DIH provides a tool for
migrating and/or importing data from the origin store. The DIH can index data from data sources
such as
• Relational Database Management System (RDBMS)
• Email
• HTTP URL end point
• Feeds like RSS and ATOM
• Structured XML files
The DIH has more advanced features like Apache Tika integration, delta import, and transformers
to quickly migrate the data.
The Apache Solr export handler can export the query result data to a Javascript Object Notification
(JSON) or comma-separated values (CSV) format. The export query expects to sort and filter query
parameters and returns only the stored fields. Users also have the option of developing a custom
export handler and incorporate it with Solr core libraries.
Elasticsearch
Elasticsearch ‘Rivers’ is an elegant pluggable service which runs inside the Elasticsearch cluster.
This service can be configured for pulling or pushing the data that is indexed into the cluster.
Some of the popular Elasticsearch Rivers modules are CouchDB, Dropbox, DynamoDB, FileSystem,
Java Database Connectivity (JDBC), Java Messaging Service (JMS), MongoDB, neo4j, Redis, Solr,
Twitter, and Wikipedia.
However, ‘Rivers’ will be deprecated in the newer release of Elasticsearch, which recommends
using official client libraries built for popular programming languages. Alternatively, the Logstash
input plugin is also one of the identified tools that can be used to ship data into Elasticsearch.
For data export, Elasticsearch snapshot can be used for any individual indices or an entire cluster
into a remote repository. This is discussed in detail in the section ‘Operations and Management -
Backup’.
Amazon CloudSearch
Amazon CloudSearch recommends sending the documents in batches to upload on CloudSearch
domain. A batch is a collection of add, update, and delete operations which should be described in
Amazon CloudSearch Comparison Report
PAGE 17 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
JSON or XML format.
Amazon CloudSearch limits a single batch upload to 5 MB per batch, but allows running parallel
upload batches to reduce the time frame for full data upload. The number of parallel batch
uploads is directly proportional to the CloudSearch instance types. Larger instance types have a
higher upload capacity, while smaller instance types have lower. During such scenarios, the batch
upload programs should intelligently threshold the uploads based on instance capacity.
Conclusion
Apache Solr has good handlers to export and import the data. In any case, if the options present
are not viable, Apache Solr allows one to develop a new custom handler or customize an existing
handler that can be used for data import and export.
Elasticsearch has integration with popular data sources in the form of ‘River’ modules or plugins.
However, the future versions of Elasticsearch strongly recommend using Logstash input plugins or
developing and contributing new Logstash input, as customization of a plugin is allowed in
Elasticsearch.
Amazon CloudSearch does not have elaborate options like other two search engines. However by
combining custom programs with bulk upload recommendations in Amazon CloudSearch,
customers can successfully migrate data into CloudSearch.
Amazon CloudSearch Comparison Report
PAGE 18 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 5: Search and Indexing features
In this section, we will evaluate ‘Search and Indexing’ features present in the search engines we are
evaluating. This is a very important feature set as they are widely used by search application
engineers.
5.1 Analyzers, Tokenizers and Token filters
Generally speaking, the search engine prepares text strings for indexing and searching using
analyzers, tokenizers, and filters. These tools are frequently used by libraries configured for indexing
and searching the data. Most of the time, the libraries are composed in a sequential series.
• During indexing and querying, analyzer assesses the field text and tokenizes each block
of text into individual terms. Each token is a sub-sequence of the characters in the text.
• The token filter filters each token in the stream sequentially and applies its filter
functionality.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch have multifaceted in-built libraries for analyzers, tokenizers and
token filters. These libraries are packaged with search engine installable that can be configured
during indexing and searching.
Although the analyzers can be configured for indexing and querying, the same series of libraries
doesn’t need to be used for both operations. The indexing and searching operations can be
configured to have different tokenizers and filters, as their goals can be different.
Search Engine Tokenizers Filters
Apache Solr
Standard, Classic, Keyword, Letter, Lower
Case, N-Gram, Edge N-Gram, ICU, Path
Hierarchy, Regular Expression Pattern,
UAX29 URL Email, White Space
ASCII Folding, Beider-Morse, Classic,
Common Grams, Collation Key, Daitch-
MokotoffSoundex, Double Metaphone,
Edge N-Gram, English Minimal Stem,
Hunspell Stem, Hyphenated Words, ICU
Folding, ICU Normalizer 2, ICU Transform,
Keep Words, KStem, Length, Lower Case,
Managed Stop, Managed Synonym, N-
Gram, Numeric Payload Token, Pattern
Replace, Phonetic, Porter Stem, Remove
Duplicates Token, Reversed Wildcard,
Shingle, Snowball Porter, Stemmer,
Standard, Stop, Suggest Stop, Synonym,
Token Offset Payload, Trim, Type As
Payload, Type Token, Word Delimiter
Elasticsearch
Standard, Edge NGram, Keyword, Letter,
Lowercase, NGram, Whitespace, Pattern,
UAX Email URL, Path Hierarchy, Classic,
Thai
Standard Token, ASCII Folding Token,
Length Token, Lowercase Token, Uppercase
Token, NGram Token, Edge NGram Token,
Porter Stem Token, Shingle Token, Stop
Token, Word Delimiter Token, Stemmer
Token, Stemmer Override Token, Keyword
Marker Token, Keyword Repeat Token,
KStem Token, Snowball Token, Phonetic
Token, Synonym Token, Compound Word
Amazon CloudSearch Comparison Report
PAGE 19 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Token, Reverse Token, Elision Token,
Truncate Token, Unique Token, Pattern
Capture Token, Pattern Replace Token,
Trim Token, Limit Token Count Token,
Hunspell Token, Common Grams Token,
Normalization Token, CJK Width Token, CJK
Bigram Token, Delimited Payload Token,
Keep Words Token, Keep Types Token,
Classic Token, Apostrophe Token
Amazon CloudSearch
Amazon CloudSearch analysis scheme configuration is used for analyzing text data during indexing.
The analysis schemes basically control:
• Text field content processing
• Stemming
• Inclusion of stopwords and synonyms
• Tokenization (Japanese language)
• Bigrams (Chinese, Japanese, Korean languages)
The following analysis options are executed when text fields are configured with an analysis scheme
1. Algorithmic stemming: Level of algorithmic stemming (minimal, light, and heavy) to
perform. The stemming levels vary depending on the analysis scheme language.
2. Stemming dictionary: A dictionary to override the results of the algorithmic stemming.
3. Japanese Tokenization Dictionary: A dictionary which specifies how particular characters
should be grouped into words (only for Japanese language).
4. Stopwords: A set of terms that should be ignored both during indexing and at search.
5. Synonyms: A dictionary of words that have the same meaning in the text data
Before processing the analysis scheme, Amazon CloudSearch tokenizes and normalizes the text
data. During tokenization, the text data is split into multiple tokens; this is common behavior in all
search engine text processing. During normalization, upper case characters are converted to lower
case, and more formatting is applied.
After the tokenization and normalization processes are completed, stemming, stopwords, and
synonyms are applied.
Conclusion
Apache Solr and Elasticsearch are packaged with varied libraries with distinct functions of analyzers,
tokenizers, and filters. Also, these libraries are allowed to be customized which gives greater
flexibility for the developers.
Amazon CloudSearch doesn’t carry sophisticated tokenizers or filter libraries like Apache Solr or
Elasticsearch, but it has simplified the configuration. Amazon CloudSearch tokenizers and filters
cover most common search requirements and use cases. This is ideal for developers who want to
quickly integrate search functionality into their application stack.
Amazon CloudSearch Comparison Report
PAGE 20 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
5.2 Faceting
Faceting is the composition of search results into categories or groups, based on indexed terms.
Faceting allows for categorizing search results into more sub-groups, which can be used as the basis
for filters or other searches. Faceting is also for efficient computation of search results by facets. For
example, facets for ‘Laptop’ search results can be 'Price', ‘Operating System’, 'RAM' or 'Shipping
Method’.
Faceting is a popular function that helps consumers filter through search results easily and
effectively.
Apache Solr
Apache Solr has far advanced options in faceting ranging from simple to very advanced faceting
behavior.
The below table details the parameters used during faceting. They can be grouped by field value,
date, range, pivot, multi-select, and interval.
Facet grouping Parameters
Field value
parameters
facet.field, facet.prefix, facet.sort, facet.limit, facet.offset,
facet.mincount, facet.missing, facet.method,
facet.enum.cache.minDffacet.threads
Date faceting
parameters
facet.date, facet.date.start, facet.date.end, facet.date.gap,
facet.date.hardend, facet.date.other, facet.date.include
Range faceting
parameters
facet.range, facet.range.start, facet.range.end, facet.range.gap,
facet.range.hardend, facet.range.other, facet.range.include
Pivot facet.pivot, facet.pivot.mincount
Interval facet.interval, facet.interval.set
Elasticsearch
Elasticsearch has deprecated facets and announced that they will be removed in a future release.
The Elasticsearch team felt that their facet implementation was not designed from the ground up
to support complex aggregations. Elasticsearch will be replacing facets with aggregations in their
next release.
Elasticsearch says “An aggregation can be seen as a unit-of-work that builds analytic information
over a set of documents. The context of the execution defines what this document set is (for
example, a top-level aggregation executes within the context of the executed query/filters of the
search request).”
Elasticsearch strongly recommends migrating from facets to aggregations. The aggregations are
classified into two main families, Bucketing and Metric.
The following table lists the aggregations available in Elasticsearch.
Amazon CloudSearch Comparison Report
PAGE 21 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Elasticsearch
Aggregators
Min Aggregation, Max Aggregation, Sum Aggregation, Avg Aggregation, Stats
Aggregation, Extended Stats Aggregation, Value Count Aggregation, Percentiles
Aggregation, Percentile Ranks Aggregation, Cardinality Aggregation, Geo Bounds
Aggregation, Top hits Aggregation, Scripted Metric Aggregation, Global Aggregation,
Filter Aggregation, Filters Aggregation, Missing Aggregation, Nested Aggregation,
Reverse nested Aggregation, Children Aggregation, Terms Aggregation, Significant Terms
Aggregation, Range Aggregation, Date Range Aggregation, IPv4 Range Aggregation,
Histogram Aggregation, Date Histogram Aggregation, Geo Distance Aggregation,
GeoHash grid Aggregation
Amazon CloudSearch
Amazon CloudSearch simplifies facet configuration when defining indexing options. These facets
are targeted at common use cases like e-commerce, online travel, classifieds, etc. The facet can be
of any field having data type as date, literal, or numeric field. This is done during CloudSearch
domain configuration. Amazon CloudSearch also allows the buckets definition to calculate facet
counts for particular subsets of the facet values.
The facet information can be retrieved in two ways:
Sort: returns facet information sorted either by facet counts or facet values.
Buckets: returns facet information for particular facet values or ranges
During searching, facet information can be fetched for any facet-enabled field by specifying the
“facet.FIELD” parameter in the search request (‘FIELD’ is the name of a facet-enabled field).
Amazon CloudSearch does allow multiple facets which help to refine search results further. See
the below example.
Example: "q=poet&facet.genres={}&facet.rating={}&facet.year={}&return=_no_fields"
Conclusion
All three search engines allow users to perform faceting with minimal effort. However, in terms of
an advanced complex implementation, the approaches are different for each search engine.
5.3 Auto Suggestion
When a user types a search query, suggestions relevant to the query input are presented and as more
characters are typed by the user, refined suggestions are presented. This feature is called auto-
suggest. Auto-suggest is an appealing and useful requirement and employed in many search user
interfaces.
This feature can be implemented at the Search Engine level or at the Search Application level. Below
are some options available in these three search engines.
Amazon CloudSearch Comparison Report
PAGE 22 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Apache Solr
Apache Solr has native support for the auto-suggest feature. It can be facilitated by using
NGramFilterFactory, EdgeNGramFilterFactory, or TermsComponent. Usually, this Apache Solr
feature is used in conjunction with jQuery or asynchronous client libraries for creating powerful
auto-suggestion and user experience in the front-end applications.
Elasticsearch
Elasticsearch also has many edge n-grams, which are easy to set up, flexible, and fast.
Elasticsearch introduced a new data structure, Finite State Transducer (FST), which resembles big
graph data structure. This data structure is managed in memory, which makes it much faster than
a term-based query could be. Elasticsearch also recommends using edge n-grams when query
input and its word ordering are less predictable.
Amazon CloudSearch
Amazon CloudSearch offers ‘Suggesters’ to achieve auto-suggest. CloudSearch Suggesters are
configured based on a particular text field. When Suggesters are used for querying with a search
string, CloudSearch lists all documents where the search string in the Suggester field begins with
that search string. Suggesters can be configured to find matches for the exact query, or to perform
a fuzzy matching process to correct the query string. The ‘Fuzzy Matching’ can be defined with
fuzziness level Low, High or Default.
Suggesters also can be configured with SortExpression, which computes a score for each one. It’s
important to do domain indexing when a new Suggester is configured. Suggestions will not be
reflected until all of the documents are indexed.
Conclusion
Amazon CloudSearch provides simple yet powerful ‘Suggest’ implementation, which is sufficient
for most of the applications. If you are looking for advanced options or any further customizations
on ‘Suggestions’, Apache Solr and Elasticsearch offer some good options.
Amazon CloudSearch Comparison Report
PAGE 23 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
5.4 Highlighting
Highlighting is a way of giving formatting clues to end users in the search results. It is a valuable
feature, where the front-end search applications highlight search snippets of text from each search
result. This function conveys to end users why the result document matched their query. In this
section, we will describe the options present in all three search engines.
Apache Solr
Apache Solr includes document text fragments, which are matched in the query response. These
text fragments are included in the response as a highlighted section that is used as a cue by search
clients for representation. Apache Solr is packaged with good highlighting collections which give
control over the text fragments, fragment size, fragment formatting, and so on. These highlighting
collections can be incorporated with Solr Query parsers and Request Handlers.
Apache Solr comes with three highlighting utilities
• Standard Highlighter
• FastVector Highlighter
• Postings Highlighter
Standard Highlighter is most commonly used by search engineers because it is a good choice for a
wide variety of search use-cases. The FastVector Highlighter is ideal for large documents and
highlighting text in a variety of languages. The Postings Highlighter works well for full-text
keyword search.
Elasticsearch
Elasticsearch also allows for highlighting search results on one or more fields. The implementation
uses a Lucene based highlighter, fast-vector-highlighter and postings-highlighter. In Elasticsearch,
the highlighter can be configured in the query to force a specific highlighter type. This is a very
flexible option for developers to choose a specific highlighter to suit their requirements.
Like Apache Solr, the three highlighters present in Elasticsearch emulate the same behavior which
is seen in Solr because these highlighters are inherited from the Lucene family.
Amazon CloudSearch
Amazon CloudSearch simplifies the highlighting by specifying the highlight.FIELD parameter in the
search request. Amazon CloudSearch returns excerpts with the search results to show where the
search terms occur within a particular field of a matching document.
For example: Search terms ‘Smart Phone’ is highlighted for the description field:
Highlights": {"description": "A *smartphone* is a mobile phone with an advanced mobile
operating system. They typically combine the features of a cell phone with those of other popular
mobile devices, such as personal digital assistant (PDA), media player and GPS navigation unit. A
*smartphone* has a touchscreen user interface and can run third-party apps, and are camera
phones."}
Amazon CloudSearch also provides controls like number of search term occurrences within an
excerpt, how they should be highlighted, plain text or HTML and so on.
Conclusion
From a development perspective, all three search engines provide easy and simple highlighting
implementations. If you are looking for different and more advanced highlighting options, Apache
Solr and Elasticsearch have some good features.
Amazon CloudSearch Comparison Report
PAGE 24 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 6: Multilingual support
Multilingualism is a very important feature for global applications which cater to non-English
speaking geographies. A leading information measurement company’s survey reveals that search
engines built with multilingual features are emerging and successful because of native language
support, and focus on the cultural background of the users.
Business Impact: A multilingual search is an effective marketing strategy to get the attention of
consumers. In e-commerce, a platform to do more business is created when the language is in the
native tongue of the customer.
Apache Solr
Apache Solr is packaged with multilingual support for most common languages. Apache Solr
carries many language-specific tokenizers, and filters libraries which can be configured during
indexing and querying.
Apache Solr engineering forums recommend using multi-core architecture where each core
manages one language. Solr also supports language detection using Tika and LangDetect detection
features. This helps to map the text data to language-specific fields during indexing.
Elasticsearch
Elasticsearch has incorporated a vast collection of language analyzers for most commonly spoken
languages. The primary role of the language analyzer is to split, stem, filter, and apply required
transformations specific to the language.
Elasticsearch also allows a user to define a custom analyzer that can be a base extension of
another analyzer.
Amazon CloudSearch
Amazon CloudSearch has strong support for language-specific text processing. Amazon
CloudSearch has pre-defined default analysis schemes support to 34 languages. Amazon
CloudSearch processes the text and text-array fields based on the configured language-specific
analysis scheme.
Amazon CloudSearch also allows a user to define a new analysis scheme that can be an extension
of the default language analysis scheme.
Conclusion
All three search engines have ample and effective support features for widely spoken
international languages.
Amazon CloudSearch Comparison Report
PAGE 25 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Languages Support
The table below lists the languages supported by each search engine
Search engine Languages supported
Apache Solr
Arabic, Brazilian, Portuguese, Bulgarian, Catalan, Chinese, Simplified Chinese, CJK,
Czech, Danish, Dutch, Finnish, French, Galician, German, Greek, Hebrew, Lao,
Myanmar Khmer, Hindi, Indonesian, Italian, Irish, Japanese, Latvian, Norwegian,
Persian, Polish, Portuguese, Romanian, Russian, Scandinavian, Serbian, Spanish,
Swedish, Thai and Turkish
Elasticsearch
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish,
Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian,
Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian,
Portuguese, Romanian, Russian, Spanish, Swedish, Thai and Turkish
Amazon
CloudSearch
Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese - Simplified, Chinese -
Traditional, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek,
Hindi, Hebrew, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian,
Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish and
Thai
Amazon CloudSearch Comparison Report
PAGE 26 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 7: Protocol & API Support
7.1 Request and Response formats
Search engine Request formats Response formats
Apache Solr XML, JSON, CSV JSON, XML, CSV
Elasticsearch JSON XML, JSON
Amazon CloudSearch XML, JSON XML, JSON
7.2 External Integrations
Search
engine
Integrations available
Apache Solr
Drupal, Magento, Django, ColdFusion, Wordpress, OpenCMS, Plone, Typo3, ez
Publish, Symfony2, Riak, DataStax Enterprise Search, Cloudera Search, Hortonworks
Data Platform, MapR
https://wiki.apache.org/solr/IntegratingSolr#Integrating_Solr_With_Other_.28Non_
Search.29_Applications
Elasticsearch
Drupal, Django, Symfony2, Wordpress, CouchBase, SearchBlox, Hortonworks Data
Platform, MapR
http://www.Elasticsearch.org/guide/en/Elasticsearch/client/community/current/inte
grations.html
Amazon
CloudSearch
7.3 Protocols Support
Search engine Protocols support
Apache Solr HTTP, HTTPS
Elasticsearch HTTP, HTTPS
Amazon CloudSearch HTTP, HTTPS
Amazon CloudSearch Comparison Report
PAGE 27 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 8: High Availability
All three search engines are architected for
• High availability (HA)
• Replication
• Scaling design principles
In this next section, we will discuss high availability options present in these three search engines.
8.1 Replication
Replication is copying or synchronizing the search index from master nodes to slave nodes for
managing the data efficiently.
Replication is a key design principle exercised in high availability searches and scaling. From a High
Availability perspective, replication can be effective for both HA and failovers from master nodes
(shards or leaders) to slave nodes (replicas). Replication from a scaling perspective is used to scale
the slave or replica nodes when the requests traffic increases.
Apache Solr
Apache Solr supports two models of replication, namely legacy mode and SolrCloud. In legacy
mode, the replication handler copies data from the master node index to slave nodes. The master
server manages all index updates and the slave nodes handle read queries. This segmentation of
master and slave allows scaling Solr clusters to deliver heavy volume loads.
Apache SolrCloud is a distributed advanced cluster setup using Solr nodes designed with high
availability and fault-tolerance. Unlike legacy mode, there is no explicit concept of "master/slave"
nodes. Instead, the search cluster is categorically split into leaders and replicas. The leader has the
responsibility to ensure the replicas are updated with the same data stored in the leader. Apache
Solr has a configuration called ‘numShards’ which defines number of shards (leaders). During start-
up, the core index is split across the ‘numShards’ (number of shards) and the shards are
represented as leaders. The nodes that are attached in the Solr cluster after the initial ‘numShards’
will be automatically assigned as replicas for the leaders.
Elasticsearch
Elasticsearch follows a similar concept to SolrCloud. In brief, an Elasticsearch index can be split into
multiple shards and each shard can be replicated into any number of nodes (0, 1, 2 …n). When
replication is completed, the index will have primary shards and replica shards. During index
creation, the number of shards and replicas are defined. The number of replicas can be dynamically
changed, but the shards count cannot.
Apache Solr and Elasticsearch
Both Apache Solr and Elasticsearch support synchronous and asynchronous replication models. If
the replication is configured in ‘synchronous’ mode, the primary (leader) shard will wait for
successful responses from the replica shards before returning commit transaction. If the model is
‘asynchronous’, the response is returned to the client as soon as the request is executed on the
primary or leader shard. The request to the replicas is forwarded asynchronously.
The diagram below depicts the replication concept which is followed in Solr and Elasticsearch.
Amazon CloudSearch Comparison Report
PAGE 28 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Replication handling in Apache Solr and Elasticsearch
S1 Node 1 Shard 1 of the cluster
S2 Node 2 Shard 1 of the cluster
R1 Node 3 Replica 1 of Shard 1
R2 Node 4 Replica 1 of Shard 2
R3 Node 5 Replica 2 of Shard 1
R4 Node 6 Replica 2 of Shard 2
Amazon CloudSearch
Amazon CloudSearch is simple and refined when it comes to handling replication and streamlines
the job of search engineers and administrators. During the configuration of scaling, Amazon
CloudSearch prompts for the desired replication count which should be based on load
requirements.
Amazon CloudSearch will automatically scale up and scale down the replicas for a domain based on
the requests traffic and data volume, but not below the desired replication count. In Amazon
CloudSearch, the replication scaling option can be changed at any time. If the scale requirement is
temporary, (for example, anticipated spikes because of a seasonal sale) the desired replication
count of the domain can be pre-scaled, and then the changes reverted after the requests volume
returns to a steady state. Modifying the replication count does not require any index rebuilding but
the replica sync completion is dependent on the size of search index.
The following describes the benefits of Amazon CloudSearch replication model.
• The search instance capacity is automatically replicated and load is distributed, the search
layer is robust and highly available at all times.
• Improved fault tolerance. If any one of the replicas is down, the other replica(s) will
continue to handle requests while the failed replica is in recovery mode.
• The entire process of scaling and distribution is automated and avoids manual intervention
and support.
Amazon CloudSearch Comparison Report
PAGE 29 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Conclusion
All three search engines have a good base to support the ‘replication’ feature. Apache Solr and
Elasticsearch allow defining your own replication topology which can be configured for synchronous
and asynchronous replication. They can be manually or automatically scaled based on application
requirements and by writing custom programs. However, substantial managed service operations
are required if the cluster replication is set up in enterprise scale.
Amazon CloudSearch fully manages the replication by managing scaling, load distribution, and fault
tolerance. This simplicity saves operations costs for the enterprises and companies.
8.2 Failover
Failover is a back-end operation that switches to secondary or standby nodes in the event of primary
server failure. Failover is identified as an important fault tolerance function for systems with lower or
zero downtime requirements.
Apache Solr and Elasticsearch
When an Apache Solr or Elasticsearch cluster is built with shards and replicas, the cluster
inherently becomes fault-tolerant and mechanically supports failover.
During any failure, a cluster is expected to support the operations while the failed node is put into
recovery state. Both the Apache Solr and Elasticsearch documentation strongly recommend a
distributed cluster setup to protect user experience from application or infrastructure failure.
In the event of all nodes storing shards and replicas failing, then the client requests will also fail. If
the shards are set to tolerant configuration, partial results can be returned from the available
shards. This behavior is anticipated in both Apache Solr and Elasticsearch.
The representation below depicts how failover is handled in cluster. This flow is applicable for
both Solr and Elasticsearch.
Amazon CloudSearch Comparison Report
PAGE 30 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Node Replica 1 Replica 2
SHARD 1 – Node Number 1
SHARD 1 FIRST REPLICA –
Node Number 3
SHARD 1 SECOND REPLICA –
Node Number 5
SHARD 2 – Node Number 2
SHARD 2 FIRST REPLICA –
Node Number 4
SHARD 2 SECOND REPLICA –
Node Number 6
The below table illustrates the failure scenarios in a Search cluster.
Scenario A
If SHARD1 fails, then one of its replica nodes, either Node number 3 or Node
number 5 is chosen as leader.
Scenario B
If SHARD2 fails, then one of its replica nodes, either Node number 4 or Node
number 6 is chosen as leader.
Scenario C
If SHARD 1 REPLICA1 fails, then Shard 1 Replica 2 continues to support
replication and as well serve the requests.
Scenario D
If SHARD 2 REPLICA1 fails, then Shard 2 Replica 2 continues to support
replication and as well serve the requests.
Elasticsearch uses internal Zen Discovery to detect failures. If the node holding a primary shard
dies, then a replica is promoted to the role of primary. Apache Solr uses Apache ZooKeeper for Co-
ordination, failure detection, and leader voting. ZooKeeper initiates leader election process
between replicas during a leader/primary shard failure.
Amazon CloudSearch
Amazon CloudSearch has built-in failover support. Amazon CloudSearch recommends scaling
options and availability options to increase fault tolerance in the event of a service disruption or
node failures.
When Multi-AZ is turned on, Amazon CloudSearch provisions the same number of instances in
your search domain in the second availability zone within that region. The instances in the primary
and secondary zones are capable of handling a full load in the event of any failure.
In the event of a service disruption or failure in one availability zone, the traffic requests are
automatically redirected to the secondary availability zone. In parallel, Amazon CloudSearch self-
heals the cluster in failure, and Multi-AZ restores the nodes without any administrative
intervention. During this switch, the inflight queries might fail, and they will need to be retried
from the front–end application side.
By increasing the partitions and replicas in the Amazon CloudSearch scaling options, failover
support can be improved. If there's a failure in one of the replicas or partitions, the other nodes
(replica or partition) will handle requests and support while it is being recovered.
Amazon CloudSearch is very sophisticated in terms of handling failure, as the node health is
continuously monitored. In the event of infrastructure failures, the nodes are automatically
recovered or replaced.
Conclusion
Failover can be architected by applying techniques like replication, sharding, service discovery,
and failure-detection services. Apache Solr and Elasticsearch advocate building your search system
in ‘Cluster mode’ to address failover. They undertake that responsibility by employing service
discovery which can detect unhealthy nodes. The service discovery maintains the cluster
Amazon CloudSearch Comparison Report
PAGE 31 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
information and balances the search cluster when nodes are detected for failures.
Amazon CloudSearch supports failover for single node as well as for cluster mode. Behind the
scenes, CloudSearch continuously monitors the health of the search instances and they are
automatically managed during failures.
Amazon CloudSearch Comparison Report
PAGE 32 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 9: Scaling
The ability to scale in terms of computing power, memory, or data volume is essential in any data
and traffic bound applications. Scaling is a significant design principle employed to improve
performance, balancing and high availability.
Over time, the search cluster is expected to be scaled horizontally (scale out) or vertically (scale up)
depending upon the needs.
Scale-up is the process of moving from a small server to a large server. Scale-out is the process of
adding multiple servers to handle the load. The scaling strategy should be selected based on
application requirements.
Apache Solr and Elasticsearch
Scaling an Apache Solr or Elasticsearch application involves manual processes. These can include a
simple server addition task or advanced tasks like cluster topology changes, storage changes, or
infrastructure upgrades.
If vertical scaling takes place, the search cluster needs to follow processes like new setup and
configuration, downtime, node restarts, etc. If scaling is horizontal, the process may involve re-
sharding, rebalancing, or cache warming.
While a search cluster system can benefit from powerful hardware, vertical scaling has its own
limitations. Upgrading or increasing the infrastructure specifications on the same server can
involve tasks like:
• New setup
• Backup
• Down time
• Application re-testing
The scaling out process is identified as a relatively easier task compared to scaling up.
An expert search administrator (Apache Solr or Elasticsearch) is usually posted to keep a close
watch on the performance of the search servers. Infrastructure and search metrics play a key role
in administrator decision making.
When these metrics increase beyond the threshold of a particular server and start affecting
overall performance, the new server(s) have to be manually spawned. Also, the scale up task can
expand to index partitioning, auto-warming, caching and re-routing/distribution of the search
queries to the new instances. It requires a Solr expert on your team to identify and execute this
activity periodically.
Sharding and Replication
Though scaling up, scaling out, and scaling down involve manual work, technology-driven
companies automate this process by developing custom programs. These smart programs
continuously monitor the cluster group and make decisions to do elastic scaling. This output is
quite similar to AutoScaling’s offering.
In terms of administration functionality, both Apache Solr and Elasticsearch offer scaling
Amazon CloudSearch Comparison Report
PAGE 33 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
techniques called Sharding and Replication.
Sharding (which means partitioning) is a method in which a search index is split into multiple
logical units called "shards". If the indexed documents exceed the collection’s physical size, then
sharding is recommended by administrators. When sharding is enabled the search requests are
distributed to every shard in the collection, results are individually collected and then merged.
Another scaling technique, replication, (See 8.1 Replication - discussed in detail) allows adding
new servers with redundant copies of your index data to handle higher concurrent query loads by
distributing the requests around to multiple nodes.
Amazon CloudSearch
Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the
amount of data or query volume increases. CloudSearch can be scaled based on the data or based
on the requests traffic. When the search data volume increases, CloudSearch can be scaled from a
smaller instance type to a larger search instance type. If the capacity of largest search instance
type is also exceeded then CloudSearch partitions the search index across multiple search
instances (Sharding technique).
When traffic and concurrency grows, Amazon CloudSearch deploys additional (replicas) search
instances to support traffic load. This automation eases the complexity and manual labour
required in the scaling out process. Conversely, when the traffic drops, Amazon CloudSearch
scales down your search domain by removing the additional search instances in order to minimize
costs.
The Amazon CloudSearch management console allows users to configure the desired partition
count and the desired replication count. The AWS console also allows changing of the instance
type (scaling up) anytime. This inherent behavior of elastic scaling makes one of the most
important points in favor of Amazon CloudSearch.
Conclusion
Scaling in search is implemented in the form Sharding and Replication. All three search engines
have a strong scaling support for setting up their search tier in ‘cluster mode’.
Scaling in Apache Solr and Elasticsearch often requires administration as there is no direct hard
and fast rule. Techniques like elastic scaling can implemented only up to a limit and when cluster
grows further, manual intervention and thought process is required. Vertical scaling in Apache
Solr and Elasticsearch is even more delicate. It requires individual management of the nodes in the
cluster and executed by using techniques like ‘Rolling restarts’ and custom scripts.
Amazon Cloud Search takes away all the operation intricacies from the administrators. The desired
partition count and desired replication count option in CloudSearch will automatically scale up and
scale down based on the volume of data and requests traffic. This saves lot of efforts and cost on
operations and management.
Amazon CloudSearch Comparison Report
PAGE 34 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 10: Customization
At times, the search system or its software may not have support for a specific feature or built-in
integration with other systems. In such cases, most open source software allows developers to
customize and extend their desired features as plugins, extensions or modules. Often, the developer
community shares extension libraries which are helpful for a practical cause. These libraries can be
customized and integrated with the system.
Apache Solr and Elasticsearch
Apache Solr and Elasticsearch both belong to the same source breed, allowing customizations on:
• Analyzers
• Filters
• Tokenizers
• Language analysis
• Field types
• Validators
• Fall back query analysis
• Alternate query custom handlers
Since both products are open source, the developers can customize or extend the libraries to fit
the required feature modifications through plugins and libraries. The build and deployment
becomes a developer’s responsibility after the extending the code base.
Apache Solr and Elasticsearch have many plugin extensions that will allow developers to add
custom functionality for a variety of purposes. These plugins are configured as special libraries and
refer to the application using configuration mapping.
Amazon CloudSearch
Amazon CloudSearch does not allow for any customizations. The search features in Amazon
CloudSearch are offered by AWS after much careful thought and collective feedback from the
customers. The Amazon CloudSearch team continually evaluates new features and rolls them out
proactively.
Conclusion
Amazon CloudSearch has a highly capable feature set to develop search systems. However, if you
anticipate strong customization on your search functionalities, Apache Solr or Elasticsearch are
better choices as their search core libraries are open sourced. It is also important to note that any
customization in the core libraries leaves the build and deployment process responsibility to the
developer. The customization also needs to be maintained for every version upgrade or newer
release of your search engine.
Amazon CloudSearch Comparison Report
PAGE 35 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 11: More
11.1 Client libraries
Client libraries are required for communicating with search engines. They are essential for developers
as they provide essential information to the connecting search engine and allow applications to easily
interact with high-level libraries.
Apache Solr
Apache Solr has an open source API client to interact with Solr using simple high-level methods.
The client libraries are available for PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, and
JavaScript.
Elasticsearch
Elasticsearch provides official clients for Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby.
There are other community-provided client libraries that can be integrated with Elasticsearch.
Open source
Other than official and open source client APIs, Elasticsearch and Apache Solr can be integrated
using the RESTful API. The REST client can use a typical web client developed in the favored
programming language or even called from a normal command line.
Amazon CloudSearch
Amazon CloudSearch exposes a RESTful API for configuration, document service and search.
• The configuration API can be used for CloudSearch domain creation, its configuration and
end to end management.
• The document service API enables the user to add, replace, or delete documents in your
Amazon CloudSearch.
• The search API is used for search or suggestion requests to your Amazon CloudSearch
domain.
Alternatively, AWS also shares a downloadable SDK package, which simplifies coding. The SDK is
available for popular languages like Java, .NET, PHP, Python, and more. The SDK APIs are built for
most Amazon Web services, including Amazon S3, Amazon EC2, CloudSearch, DynamoDB, and
more. The SDK package includes the AWS library, code samples, and documentation.
Amazon CloudSearch Comparison Report
PAGE 36 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Feature 12: Cost
From an overall perspective, Cost is a very important factor and companies always endeavor ways to
reduce Total cost of Ownership (TCO). In this section, we will see the Cost components in these three
search engines.
Apache Solr and Elasticsearch
The cost factor in Apache Solr and Elasticsearch includes infrastructure resources cost, managed
services cost and people resources cost. For any type of deployment, the servers cost and
engineers cost are essential. The commitment to continuous admin operations depends on
application requirements and its criticality.
Amazon CloudSearch
Amazon CloudSearch cost component includes server costs and engineers cost and they are
essential for any search deployment like the above two. Amazon CloudSearch being a fully-
managed service covers the managed services as part of the server costs. Also, Amazon
CloudSearch does not charge during the beginning of service usage but charges at the end of the
month based on CloudSearch usage.
Conclusion
The net operating costs are essentially the same across all three search engines, but people costs
will be 30% more for self-managed Apache Solr or Elasticsearch compared to Amazon
CloudSearch.
For Example, A highly important and critical search application will require 24 * 7 support and
managed services. This cost incurred as part of Managed services which is an additional one in
Apache Solr and Elasticsearch deployments.
A detailed TCO Analysis between Apache Solr, Elasticsearch and Amazon CloudSearch can be read
here.
Link:
Amazon CloudSearch Comparison Report
PAGE 37 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
Conclusion
Search is an indispensable feature in most business applications.
Apache Solr and Elasticsearch are time proven solutions. Many larger organizations have used
Apache Solr and Elasticsearch for years, but are now looking for greater operational efficiency and
cost effectiveness. On the other hand, companies looking for innovative ways grow their businesses
and provide value. In the recent years, a huge number technology companies have started to employ
the benefits of using cloud-based search services, mainly in terms of getting started and then
accommodating growth without the need to switch vendors to do so. When scalability, cost, and
speed-to-market are primary concerns, we recommend using some form of cloud service. And if you
want to enjoy the benefits of a cloud solution built on the architecture of Apache Solr, we
recommend Amazon CloudSearch.
Amazon CloudSearch Comparison Report
PAGE 38 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
About the Authors
Dwarak is a Principal Architect at 8KMiles with more than decade hands-on experience in Cloud
Computing, Big Data, Web technologies and Product Management. He has varied and
progressive experience in architecting distributed Web and Enterprise systems and products.
He is also disciplined with deep domain knowledge in the banking, finance, retail, and e-
commerce industries. At present, he oversees technology consulting, architecture, delivery and
customer end to end transformational programs at 8KMiles.
Dwarakanath Ramachandran
Harish is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles. Harish has more than
decade of experience in architecting and developing cloud computing, e-commerce and mobile
application systems. He has also built large Internet banking solutions that catered to the
needs of millions of users, where security and authentication were critical factors. He is
responsible for the overall technology direction of the 8KMiles products and services in Cloud,
Big Data and Mobility Space. Harish is a thought leader in Cloud related technologies, an
Advisor and has many followers for his blogs.
Harish Ganesan
Amazon CloudSearch Comparison Report
PAGE 39 of 39
8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com
About 8KMiles
8KMiles is a solutions company that is focused on helping organizations of all sizes
to integrate Cloud, Identity, and Big Data into their IT and business strategies.
8KMiles’ team of experts, located in North America and India, offer a host of services
and solutions such as Cloud, Federated Identity Consulting, Cloud Engineering,
Migration, Big Data services, and Managed Services on Amazon Web Services.
8KMiles offers specialized expertise in matured verticals such as Pharma, Retail,
Media, Travel, and Healthcare. Visit us at www.8kmiles.com

More Related Content

What's hot

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)Amazon Web Services
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Amazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 Series
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 SeriesAmazon Aurora Relational Database Built for the AWS Cloud, Version 1 Series
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 SeriesDataLeader.io
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsVasu S
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018James Bromberger
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...Amazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS Amazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 

What's hot (20)

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)AWS Summit London 2014 | Deployment Done Right (300)
AWS Summit London 2014 | Deployment Done Right (300)
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 Series
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 SeriesAmazon Aurora Relational Database Built for the AWS Cloud, Version 1 Series
Amazon Aurora Relational Database Built for the AWS Cloud, Version 1 Series
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018AWS Cost Optimisation - November 2018
AWS Cost Optimisation - November 2018
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
AWS Storage Tiers for Enterprise Workloads - Best Practices (STG301) | AWS re...
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 

Viewers also liked

Auto scaling using Amazon Web Services ( AWS )
Auto scaling using Amazon Web Services ( AWS )Auto scaling using Amazon Web Services ( AWS )
Auto scaling using Amazon Web Services ( AWS )Harish Ganesan
 
Architecting an Highly Available and Scalable WordPress Site in AWS
Architecting an Highly Available and Scalable WordPress Site in AWS Architecting an Highly Available and Scalable WordPress Site in AWS
Architecting an Highly Available and Scalable WordPress Site in AWS Harish Ganesan
 
The art of infrastructure elasticity
The art of infrastructure elasticityThe art of infrastructure elasticity
The art of infrastructure elasticityHarish Ganesan
 
Aws 201:Advanced Breakout Track on HA and DR
Aws 201:Advanced Breakout Track on HA and DRAws 201:Advanced Breakout Track on HA and DR
Aws 201:Advanced Breakout Track on HA and DRHarish Ganesan
 
BarCamp cloudsearch
BarCamp cloudsearchBarCamp cloudsearch
BarCamp cloudsearchkopertop
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellensteinlucenerevolution
 
Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudRightScale
 
Touch Research 2: HCI Details [Handouts]
Touch Research 2: HCI Details [Handouts]Touch Research 2: HCI Details [Handouts]
Touch Research 2: HCI Details [Handouts]Harald Felgner, PhD
 
Usability of User Interface Styles for Learning Graphical Software Applications
Usability of User Interface Styles for Learning Graphical Software ApplicationsUsability of User Interface Styles for Learning Graphical Software Applications
Usability of User Interface Styles for Learning Graphical Software ApplicationsWaqas Tariq
 
Applying user modelling to human computer interaction design
Applying user modelling to human computer interaction designApplying user modelling to human computer interaction design
Applying user modelling to human computer interaction designNika Stuard
 
Comparison of IP Video Phone Software
Comparison of IP Video Phone SoftwareComparison of IP Video Phone Software
Comparison of IP Video Phone SoftwareVideoguy
 
Presentation for screencast
Presentation for screencastPresentation for screencast
Presentation for screencastChris McEwan
 
Busy - Business Accounting Software - Corporate Presentation
Busy - Business Accounting Software - Corporate Presentation Busy - Business Accounting Software - Corporate Presentation
Busy - Business Accounting Software - Corporate Presentation BUSYforSMEs
 
Scale new business peaks with Amazon auto scaling
Scale new business peaks with Amazon auto scalingScale new business peaks with Amazon auto scaling
Scale new business peaks with Amazon auto scalingHarish Ganesan
 
Prepare your IT Infrastructure for Thanksgiving
Prepare your IT Infrastructure for ThanksgivingPrepare your IT Infrastructure for Thanksgiving
Prepare your IT Infrastructure for ThanksgivingHarish Ganesan
 
Wordpress site scaling architecture on cloud infrastructure with AWS
Wordpress site scaling architecture on cloud infrastructure with AWSWordpress site scaling architecture on cloud infrastructure with AWS
Wordpress site scaling architecture on cloud infrastructure with AWSLe Kien Truc
 

Viewers also liked (20)

Auto scaling using Amazon Web Services ( AWS )
Auto scaling using Amazon Web Services ( AWS )Auto scaling using Amazon Web Services ( AWS )
Auto scaling using Amazon Web Services ( AWS )
 
Architecting an Highly Available and Scalable WordPress Site in AWS
Architecting an Highly Available and Scalable WordPress Site in AWS Architecting an Highly Available and Scalable WordPress Site in AWS
Architecting an Highly Available and Scalable WordPress Site in AWS
 
The art of infrastructure elasticity
The art of infrastructure elasticityThe art of infrastructure elasticity
The art of infrastructure elasticity
 
Aws 201:Advanced Breakout Track on HA and DR
Aws 201:Advanced Breakout Track on HA and DRAws 201:Advanced Breakout Track on HA and DR
Aws 201:Advanced Breakout Track on HA and DR
 
BarCamp cloudsearch
BarCamp cloudsearchBarCamp cloudsearch
BarCamp cloudsearch
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Wikipedia Cloud Search Webinar
Wikipedia Cloud Search WebinarWikipedia Cloud Search Webinar
Wikipedia Cloud Search Webinar
 
Practical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc KrellensteinPractical Search in the Cloud - By Marc Krellenstein
Practical Search in the Cloud - By Marc Krellenstein
 
Implementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the CloudImplementing Powerful IT Search on the Cloud
Implementing Powerful IT Search on the Cloud
 
Touch Research 2: HCI Details [Handouts]
Touch Research 2: HCI Details [Handouts]Touch Research 2: HCI Details [Handouts]
Touch Research 2: HCI Details [Handouts]
 
Usability of User Interface Styles for Learning Graphical Software Applications
Usability of User Interface Styles for Learning Graphical Software ApplicationsUsability of User Interface Styles for Learning Graphical Software Applications
Usability of User Interface Styles for Learning Graphical Software Applications
 
Applying user modelling to human computer interaction design
Applying user modelling to human computer interaction designApplying user modelling to human computer interaction design
Applying user modelling to human computer interaction design
 
Comparison of IP Video Phone Software
Comparison of IP Video Phone SoftwareComparison of IP Video Phone Software
Comparison of IP Video Phone Software
 
Citation Software Comparison
Citation Software ComparisonCitation Software Comparison
Citation Software Comparison
 
Presentation for screencast
Presentation for screencastPresentation for screencast
Presentation for screencast
 
Busy - Business Accounting Software - Corporate Presentation
Busy - Business Accounting Software - Corporate Presentation Busy - Business Accounting Software - Corporate Presentation
Busy - Business Accounting Software - Corporate Presentation
 
Scale new business peaks with Amazon auto scaling
Scale new business peaks with Amazon auto scalingScale new business peaks with Amazon auto scaling
Scale new business peaks with Amazon auto scaling
 
Prepare your IT Infrastructure for Thanksgiving
Prepare your IT Infrastructure for ThanksgivingPrepare your IT Infrastructure for Thanksgiving
Prepare your IT Infrastructure for Thanksgiving
 
Wordpress site scaling architecture on cloud infrastructure with AWS
Wordpress site scaling architecture on cloud infrastructure with AWSWordpress site scaling architecture on cloud infrastructure with AWS
Wordpress site scaling architecture on cloud infrastructure with AWS
 

Similar to Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11

(DAT311) Large-Scale Genomic Analysis with Amazon Redshift
(DAT311) Large-Scale Genomic Analysis with Amazon Redshift(DAT311) Large-Scale Genomic Analysis with Amazon Redshift
(DAT311) Large-Scale Genomic Analysis with Amazon RedshiftAmazon Web Services
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016oysteing
 
Your practical reference guide to build an stream analytics solution
Your practical reference guide to build an stream analytics solutionYour practical reference guide to build an stream analytics solution
Your practical reference guide to build an stream analytics solutionJesus Rodriguez
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Heath Turner
 
Explain the explain_plan
Explain the explain_planExplain the explain_plan
Explain the explain_planMaria Colgan
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanKirti Bhushan
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql TuningChris Adkin
 
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...Amazon Web Services
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Amazon Web Services
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data WarehouseSandesh Rao
 
PoolParty Semantic Suite - Release 5.5
PoolParty Semantic Suite - Release 5.5PoolParty Semantic Suite - Release 5.5
PoolParty Semantic Suite - Release 5.5Semantic Web Company
 

Similar to Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11 (20)

Amazon CloudSearch TCO Analysis
Amazon CloudSearch TCO AnalysisAmazon CloudSearch TCO Analysis
Amazon CloudSearch TCO Analysis
 
(DAT311) Large-Scale Genomic Analysis with Amazon Redshift
(DAT311) Large-Scale Genomic Analysis with Amazon Redshift(DAT311) Large-Scale Genomic Analysis with Amazon Redshift
(DAT311) Large-Scale Genomic Analysis with Amazon Redshift
 
How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016How to analyze and tune sql queries for better performance vts2016
How to analyze and tune sql queries for better performance vts2016
 
Your practical reference guide to build an stream analytics solution
Your practical reference guide to build an stream analytics solutionYour practical reference guide to build an stream analytics solution
Your practical reference guide to build an stream analytics solution
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights
 
Explain the explain_plan
Explain the explain_planExplain the explain_plan
Explain the explain_plan
 
Asha_Resume
Asha_ResumeAsha_Resume
Asha_Resume
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Data day2017
Data day2017Data day2017
Data day2017
 
Machine Learning in Autonomous Data Warehouse
 Machine Learning in Autonomous Data Warehouse Machine Learning in Autonomous Data Warehouse
Machine Learning in Autonomous Data Warehouse
 
PoolParty Semantic Suite - Release 5.5
PoolParty Semantic Suite - Release 5.5PoolParty Semantic Suite - Release 5.5
PoolParty Semantic Suite - Release 5.5
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Amazon cloud search_vs_apache_solr_vs_elasticsearch_comparison_report_v11

  • 1. 8KMiles Software Services, Inc Amazon CloudSearch Comparison Report
  • 2. TABLE OF CONTENTS Smackdown.............................................................................................................................................3 Introduction ............................................................................................................................................5 Search features 1 - 1 comparison ...........................................................................................................6 Feature 1: Getting Started ......................................................................................................................7 Feature 2: Operations and Management ...............................................................................................9 2.1 Backup...........................................................................................................................................9 2.2 System upgrades and patch management .................................................................................10 2.3 Re-indexing .................................................................................................................................11 Feature 3: Monitoring...........................................................................................................................13 Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export ...........................................14 4.1 Schema management .................................................................................................................14 4.2 Dynamic fields.............................................................................................................................14 4.3 Data types ...................................................................................................................................15 4.4 Data import & export..................................................................................................................16 Feature 5: Search and Indexing features..............................................................................................18 5.1 Analyzers, Tokenizers and Token filters......................................................................................18 5.2 Faceting.......................................................................................................................................20 5.3 Auto Suggestion..........................................................................................................................21 5.4 Highlighting.................................................................................................................................23 Feature 6: Multilingual support............................................................................................................24 Feature 7: Protocol & API Support........................................................................................................26 7.1 Request and Response formats ..................................................................................................26 7.2 External Integrations...................................................................................................................26 7.3 Protocols Support .......................................................................................................................26 Feature 8: High Availability...................................................................................................................27 8.1 Replication ..................................................................................................................................27 8.2 Failover........................................................................................................................................29 Feature 9: Scaling..................................................................................................................................32 Feature 10: Customization....................................................................................................................34 Feature 11: More..................................................................................................................................35 11.1 Client libraries...........................................................................................................................35 Feature 12: Cost....................................................................................................................................36 Conclusion.............................................................................................................................................37
  • 3. Amazon CloudSearch Comparison Report PAGE 3 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Smackdown Feature Apache Solr Elasticsearch Amazon CloudSearch Admin Operations Backup Replication/Custom handler/Custom scripts Snapshot API/Custom scripts Fully-managed Patch Management Manual/Automated via custom scripts Manual/Automated via custom scripts Fully-managed Re-indexing Manual Manual Fully-managed Manual option available from management console Monitoring If hosted on EC2, Amazon CloudWatch SaaS Monitoring tools like NewRelic, Stackdriver, Datadog If hosted on EC2, Amazon CloudWatch SaaS Monitoring tools like NewRelic, Stackdriver, Datadog CloudSearch default metrics Maintenance External managed service External managed service Fully-managed API Client Library Java, PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, JavaScript Java, Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby Amazon SDK HTTP RESTful API YES YES YES Request Format XML, JSON, CSV XML, JSON XML, JSON Response Format XML, JSON, CSV XML, JSON XML, JSON Third party Integrations Available for Commercial and Open source Available for Commercial and Open source Amazon Web Services Integrations available Search Functions Schema Schema and Schema- less Schema and Schema- less Schema Dynamic fields support Yes Yes Yes Synonyms Yes Yes Yes Multiple indexes Yes Yes No Faceting Yes Yes Yes Rich documents support Yes Yes No Auto Suggest Yes Yes Yes Highlighting Yes Yes Yes Query parser Standard, DisMax, Extended DisMax, Other parsers Standard, query_string, DisMax, match, multi_match Simple, structured, Lucene, or DisMax Geosearch Yes Yes Yes Analyzers, Default/Custom Default/Custom Default
  • 4. Amazon CloudSearch Comparison Report PAGE 4 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Tokenizers and Token filters Fuzzy Logic Yes Yes Yes Did you mean Default/Custom Default/Custom No Stopwords Yes Yes Yes Customization Yes Yes No Advanced Cluster management ZooKeeper in-built Fully-managed Scaling Vertical scaling/ Horizontal scaling Vertical scaling/ Horizontal scaling Fully-managed horizontal scaling Replication Yes Yes Yes Sharding Yes Yes Yes Failover Yes, if set up in Cluster Replica mode Yes, if set up in Cluster Replica mode Fully-managed Fault tolerant Yes, if set up in Cluster mode Yes, if set up in Cluster mode Fully-managed Import and Export Data import Default import handlers, custom import handlers Rivers modules, Logstash input plugins, custom programs Batch upload Data export Default export handlers, custom export handlers Snapshot API Custom program Others Web Interface Solr Admin Sense AWS Management Console
  • 5. Amazon CloudSearch Comparison Report PAGE 5 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Introduction In today's world of vast and available information, a good search experience is central to a good user experience. Hence, delivering effective search tools has become the key goal of all software products, market places, e-commerce websites, and content management systems. Developers looking to deliver a premium search experience to their users should be aware of some broad trends: 1) Open source and platform-based search engines are replacing proprietary search engines because of better licensing models and community support. 2) The cloud delivery model is succeeding over the on-premise delivery model because of scalability, high availability and operating expense. In light of the above trends, the choice of leading candidates for search technology boils down to three: Apache Solr, Elasticsearch, and Amazon CloudSearch. At 8KMiles, our clients often ask us how these three choices compare relative to each other. This report aims to make it easy for developers to pick the right technology for their application by presenting a comprehensive framework for evaluation of the three options. We have also applied our framework to top feature sets that are critical to any search workload. We then broke them down further into granular features and compared each of the three search engines. In this report, we summarize our conclusions and present them in a smack down style summary card. We encourage our readers to run a more in-depth evaluation for their specific use cases.
  • 6. Amazon CloudSearch Comparison Report PAGE 6 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Search features 1 - 1 comparison This section discusses about the search features in detail and how they are present in Apache Solr, Elasticsearch, and Amazon CloudSearch. The table below illustrates the most influential features involved in the assessment of our search engines. These features are identified and grouped based on the various operations of a search application. •Server setup, search engine installation and configurationGetting started •Backup, patches, re-indexing, monitoringOperations •Schema management, field types, dynamic fields, data import/export, analyzers, did you mean, facets, auto complete, spatial Indexing , Search and Query •Replica, Failover, Self-healing clustersHigh Availability •Read scaling, write scaling, partitioning options, replication options Scaling •Request and response formats and support, protocols supported, external integrations Protocol & API Support •Data field types, functionsCustomization •Supported programming languages, administration interface Others •Infrastructure cost, support cost, on-going management cost, licensing cost, talent cost Cost
  • 7. Amazon CloudSearch Comparison Report PAGE 7 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 1: Getting Started ‘Getting Started’ is the first step by an engineer to understand the basics of the major features of a product. In this section, we will see how the search engines discussed above facilitate ‘Getting Started’. Apache Solr and Elasticsearch Apache Solr and Elasticsearch require the end users to spend quality time in understanding and setting up the respective search engines. The “Getting Started” manuals of Apache Solr and Elasticsearch assume the end user to have minimal knowledge of search engines, their related functions and architecture. The installation processes for Apache Solr and Elasticsearch include tasks such as: • Server setup • Search engine download • Dependent software installations • Setup of environmental requirements • Understanding of basic server commands • Administrative access Apache Solr and Elasticsearch are shipped with test examples which allow users to do “warm up” search and indexing operations. While the default test schema in Apache Solr is sufficient for the user to get started, the Elasticsearch’s schema-less design allows the user to send document request without any schema. Amazon CloudSearch If you already have an Amazon Web Services (AWS) account set up, you can create a CloudSearch domain in a few clicks using the AWS Management console. The AWS CloudSearch Management console guides administrators through a step-by-step process, requesting the user input: • Instance type • High availability options • Replication options • Schema definitions • Access policies Among these options, it is important to note that Amazon CloudSearch does not mandatorily prompt for all information. The CloudSearch domain name and engine type are adequate to create a CloudSearch instance. The other configurations such as schema, instance type, access policies, and high availability options can be modified at a later time based on the application requirements. The users are abstracted from hardware provisioning, software installation, configuration, cluster setup and other administration activities. Users receive two administrative regional endpoints: a search endpoint and document endpoint. Both endpoints can be accessed using RESTful API or AWS Software Development kit (SDK) with Identity and Access Management (IAM) credentials.
  • 8. Amazon CloudSearch Comparison Report PAGE 8 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Another important note is that CloudSearch default access policies for the document service and search service endpoints are configured to block all IP addresses. Developers should configure the authorized IP addresses to access the CloudSearch’s endpoints. CloudSearch also provides a sample dataset, the IMDB movies, which can be used to test drive the CloudSearch service. The CloudSearch developer documentation walks through the steps to launch a test domain using the sample IMDB dataset. Conclusion Apache Solr and Elasticsearch expect users to have basic practical knowledge of the search engine and also complete a few significant tasks to accomplish the first step ‘Getting Started’. In Amazon CloudSearch, the ‘Getting Started’ activities are easier and end users can have the CloudSearch instance up and running with few clicks in a few minutes.
  • 9. Amazon CloudSearch Comparison Report PAGE 9 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 2: Operations and Management In this section, we’ll discuss some important administrative operations such as • Index backup • Patch management • Re-indexing and recovery 2.1 Backup Data backup is a routine operation, carried out within a defined period of time. Data backup is an essential task for recovering data responsively from failures such as hardware crash, data corruption or related events. Apache Solr Apache Solr provides a feature called ‘ReplicationHandler’. The main objective of ReplicationHandler is to replicate index data to slave servers, but it can also be used as a backup copy server. A replication slave node can be configured with Solr Master, which can be solely identified as a backup server, with no other operations taking place on the slave node. Solr‘s implicit support for replication allows ReplicationHandler to be used as an API. The API has optional parameters like location, name of snapshot, and number of backups. The backup API is a bound-to-store snapshot on a local disk, but for any other storage options the backup API requires customization. If you are required to store the backups in a different store location like Amazon’s Simple Storage Service (S3), a local storage server, or in a remote data center, ReplicationHandler has to be customized. Solr core libraries are available in open source that allows for any customization. Elasticsearch Elasticsearch provides an advanced option called ‘Snapshot API’ for backing up the entire cluster. The API will back up the current cluster state and related data and save it to a shared repository. The first or initial backup process will be a complete copy of the data. The subsequent backup processes will snapshot the delta between the backup of fresh data with previous snapshots. Elasticsearch prompts end users to create a repository type, which can be chosen from a shared file system: • Amazon S3 • Hadoop Distributed File System (HDFS) • Azure Cloud This integration gives a greater flexibility for developers to manage their backups. Backup Process The backup options present in Apache Solr and Elasticsearch can be executed manually or can be automated. To automate the entire backup process, one has to write custom scripts that calls the relevant API or handler. Most engineering companies follow this model of writing custom scripts for backup automation.
  • 10. Amazon CloudSearch Comparison Report PAGE 10 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com The backup also involves maintenance of the latest snapshots and archives. The management tasks involve key tasks like snapshot retrieval, archival, and expiration. In an alternate approach, if the Solr or Elasticsearch cluster is set up in a cluster replication mode, any one of the slave nodes is identified as backup server. The automation of the slave node backup server needs a script written by the developer. Amazon CloudSearch Amazon CloudSearch inherently takes cares of the data that is stored and indexed, leaving a lighter load for engineering and operations teams. Amazon CloudSearch self-manages all the data backup and its management. The backups are internally maintained behind the scenes. In the event of any hardware failure or other problem, Amazon CloudSearch restores the backup automatically, and this process is not revealed to end users. Conclusion The default option in Apache Solr is only to back up to a ‘local disk’ store; it does not offer any other storage options as Elasticsearch does. However, the engineers can write their own handlers to manage the backup process. Elasticsearch is packaged with multiple storage options plugins which gives added advantage for engineers. Amazon CloudSearch relieves the users of the intricacies of the backup and its management process. The IT operations or managed service team have a lesser role in the backup process as the entire operations are managed behind the scenes by CloudSearch. 2.2 System upgrades and patch management Patch management and system upgrades like OS patches and fixes are inevitable in operations and administration. For any system, there is always a version upgrade, or maintenance on the OS, and hardware or software changes. Rolling Restarts Apache Solr and Elasticsearch both recommend using ‘Rolling Restarts’ for patch management, operating system upgrades and other fixes. Rolling Restarts involve stopping and starting each cluster node in the cluster sequentially. This allows the cluster to continue its operations while each node is updated with the latest code, fixes, or patches while continuing to serve search requests. Rolling Restarts is adopted when high availability is mandatory and downtime is not allowable. Sometimes, the Rolling Restarts require some intelligent decision making based on cluster topology. If a cluster consists of shards and replicas, the order of restarting each node has to be done decisively.
  • 11. Amazon CloudSearch Comparison Report PAGE 11 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Apache Solr Apache’s ZooKeeper service acts as a stand-alone application and does not get upgraded automatically when Apache Solr is upgraded, but it should be done manually at the same time. Elasticsearch Elasticsearch recommends disabling the ‘shard allocation’ configuration during node restart. This informs Elasticsearch to stop re-balancing missing shards because the cluster will immediately start working on node loss. Amazon CloudSearch Amazon CloudSearch internally manages all patches and upgrades related to its operating system. The managed search service offering from Amazon CloudSearch monitors for when new features are rolled out; upgrades are self-managed and immediately available to all customers without any action on their part. Conclusion The patch management in Apache Solr and Elasticsearch has to be carried out manually using the Rolling Restarts feature. Customers automate this process by developing custom scripts to do system upgrades and patch management. Patch management in Amazon CloudSearch is transparent to the customers. The upgrades and patches done on Amazon CloudSearch are regularly updated in the ‘What’s New’ section of the CloudSearch documentation. 2.3 Re-indexing Any business application changes over its lifetime, as the business running it changes. The business change has a direct effect on the data structure of the system’s persistent information store. The search engine, which is seen as a secondary or alternate store, will eventually have to change its data structure when required. Any changes to the search engine data structure will require a re-indexing of the data. Example: A product company started collecting ‘feedback’ from their customer for a given product. The text string from the new field ‘feedback’ needs to be added into the search schema, and may require re-indexing. If the search data is not re-indexed after a structural change, the data that has already been indexed could become inaccurate and the search results may behave differently than expected. Re-indexing becomes a necessary process over a period of time as the application grows. It is also identified as a common and mandatory admin operation executed periodically based on application requirements.
  • 12. Amazon CloudSearch Comparison Report PAGE 12 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Apache Solr Apache Solr recommends re-indexing if there is a change in your schema definitions. The options below are widely used by the Apache Solr user community. • Create a fresh index with new settings. Copy all of the documents from the old index to the new one. • Configure Data import handler with ‘SolrEntityProcessor’. The SolrEntityProcessor imports data from Solr instances or cores for a given search query. The SolrEntityProcessor has a limitation where it can only copy fields that are stored in the source index. • Configure Data import handler with the source or origination data source. Push the data freshly to the new index Elasticsearch Elasticsearch proposes several approaches for data re-indexing. The following approaches are usually combined:  Use Elasticsearch’s Scan and Scroll and Bulk APIs to fetch and push data into the new index.  Update or create an index alias with the old index name and delete the old index.  Use open source Elasticsearch plugins that can extract all data from the cluster and re- index the data. Most of these plugins internally use the Scan and Scroll and Bulk API (as mentioned above) which reduces development time. Amazon CloudSearch Amazon CloudSearch recommends data rebuilding when index fields are added or modified. Amazon CloudSearch expects to issue an indexing request after a configuration change. Whenever there is a configuration change, the CloudSearch domain status changes to ‘NEEDS INDEXING’. During the index rebuilding, the domain's status changes to ‘PROCESSING’, and upon completion the status is changed to ‘ACTIVE’. Amazon CloudSearch can continue to serve search requests during the indexing process, but the configuration changes are not reflected in the search results. The re-indexing process can take some time for the changes to take effect. It is directly proportional to the amount of data volume in your index. Amazon CloudSearch also allows document uploads while indexing is in progress, but the updates can become slower, if there are is large volume of document updates. During such a scenario, the uploads or updates can be throttled or paused until the Amazon CloudSearch domain returns to an ‘ACTIVE’ state. Customers can initiate re-indexing by issuing the index-documents command using RESTful API, AWS command line interface (CLI), or AWS SDK. They can also initiate re-indexing from the CloudSearch management console. Conclusion Re-indexing in Apache Solr and Elasticsearch is mostly a manual process because it requires a decision that factors data size, current request size, and offline hours. Amazon CloudSearch manages the re-indexing process inherently and leaves much less to administrators. The re-indexing time period is abstracted and not disclosed to administrators but Amazon CloudSearch runs the re-indexing process based on the best practices mentioned above.
  • 13. Amazon CloudSearch Comparison Report PAGE 13 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 3: Monitoring Monitoring server health is an essential daily task for operations and administration. In this section, we will describe the built-in monitoring capabilities for all three search engines. Apache Solr Apache Solr has a built-in web console for monitoring indexes, performance metrics, information about index distribution and replication, and information on all threads running in the Java Virtual Machine (JVM) at the time. For more detailed monitoring, Java Management Extensions (JMX) can be configured with Solr that share runtime statistics as MBeans. The Apache Solr JVM container has built-in instrumentation that enables monitoring using JMX. Elasticsearch Elasticsearch has a management and monitoring plugin called ‘Marvel’. Marvel has an interactive console called ‘Sense’ that helps users to interact easily with Elasticsearch nodes. Elasticsearch has in-built diversified APIs that emit heap usage, garbage collection stats, file descriptions, and more. Marvel is strongly integrated with these APIs, and it periodically executes polling, collects statistics and stores the data back in Elasticsearch. Marvel’s interactive graph report dashboard allows administrators to query and aggregate historical stats data. Amazon CloudSearch Amazon CloudSearch recently introduced Amazon CloudWatch integration. The Amazon CloudSearch metrics can be used to make scaling decisions, troubleshoot issues, and manage clusters. Amazon CloudSearch publishes four metrics into Amazon CloudWatch: SuccessfulRequests, Searchable Documents, Index Utilization, and Partition Count. The CloudWatch metrics can be configured to set alarms, which can notify administrators through Amazon Simple Notification Service. Conclusion Apache Solr and Elasticsearch have integrations with in-built and external plugins. They can also support SaaS based monitoring plugins or custom plugins developed by the customers. CloudSearch’s integration with CloudWatch shares some good metrics and it is expected to offer newer ones in the future.
  • 14. Amazon CloudSearch Comparison Report PAGE 14 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export 4.1 Schema management Schema: A schema is a definition of fields and field types used by the search system to organize data within the document files it indexes. Schema definition is the foremost task in the search data structure design. It is important that the schema definition caters to all business requirements and is designed to suit the application. Apache Solr and Elasticsearch Both Elasticsearch and Apache Solr can run the search application in ‘Schema-less’ and ‘Schema’ mode. Schema mode is suitable for application development or any production environments. Schema-less is a very good option for entrants to get started. After server setup, users can start the application without a schema structure and create the field definitions on the search indexing. However, to have a production-grade application running, a proper schema structure becomes mandatory and the schema definition is a necessity. Amazon CloudSearch Amazon CloudSearch also allows users to set up search domains without any index fields. The index fields can be added anytime, but before any valid document indexing or any search request. In addition, the CloudSearch management console has integration with Amazon Web Services like S3, DynamoDB, or can access a local machine from where the schema can be imported directly to CloudSearch domain. After the schema import, CloudSearch allows the user to edit the fields or add new fields. This is a convenient feature for a pre-built schema that is to be migrated to a CloudSearch domain. Conclusion Apache Solr and Elasticsearch can be started without any schema but they cannot be put into production use. Amazon CloudSearch allows creating domains without any index fields, but to have any index and search requests served the schema should be created. The general best practice in schema management is to rehearse and design the schema suiting application requirements before finalizing the search structure. The underlying schema concept of all three search engines is consistent with this practice. 4.2 Dynamic fields Dynamic fields are like regular field definitions which support wildcard matching. They allow the indexing of documents without knowing the type of fields they contain. A dynamic field is defined using a wildcard pattern (*) for first, last, or only character. All undefined fields go through dynamic field rules which validate the pattern match configured with the dynamic field's indexing options.
  • 15. Amazon CloudSearch Comparison Report PAGE 15 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Apache Solr and Elasticsearch Apache Solr and Elasticsearch allow end users to set up dynamic fields and rules using RESTful API and schema configuration. Amazon CloudSearch In Amazon CloudSearch, dynamic fields can be configured using indexing options in the CloudSearch management console or using CloudSearch, RESTful API, or AWS SDK API. Conclusion If you are unsure about the schema structure or exact field names, dynamic fields come in handy. Amazon CloudSearch, Apache Solr, and Elasticsearch all allow the flexibility to configure dynamic fields. This helps the application development team to describe any omitted field definitions in the schema document. 4.3 Data types There are a variety of data types supported by these search engines. The table below illustrates the data field types supported by each search engine. Data type Solr Elasticsearch CloudSearch String / Text Yes Yes Yes Number types integer, double, float, long byte, short, integer, long, float, double integer, double Date types Yes Yes Yes Enum fields Yes Yes No Currency Yes No No Geo location / Latitude – Longitude Yes Yes Yes Boolean Yes Yes No Array types Yes Yes Yes Conclusion The most important data types like string, date, and number types are supported by all three search engines. Geo location data type, which is now regularly used by modern applications, is also supported by all search engines. Engineers and developers may use an alternate data type if a particular data type is not supported for their chosen search engine. Example, ‘currency’ data type supported in Solr is not available in Elasticsearch and CloudSearch. During such cases, engineers use number type as an alternative
  • 16. Amazon CloudSearch Comparison Report PAGE 16 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com data type for ‘Currency’. 4.4 Data import & export The most important task in a search application development is data migration from origination source to the search engine. The origination data can be of a data source like a database, or a file system or a persistent store. To commence a search data set, it is required to migrate or import the full data set from its origin to the search engine. Likewise, extracting data from a search engine and exporting it to a different destination source is also a crucial task but executed occasionally. Apache Solr Apache Solr has in-built handler called Data import handler (DIH). The DIH provides a tool for migrating and/or importing data from the origin store. The DIH can index data from data sources such as • Relational Database Management System (RDBMS) • Email • HTTP URL end point • Feeds like RSS and ATOM • Structured XML files The DIH has more advanced features like Apache Tika integration, delta import, and transformers to quickly migrate the data. The Apache Solr export handler can export the query result data to a Javascript Object Notification (JSON) or comma-separated values (CSV) format. The export query expects to sort and filter query parameters and returns only the stored fields. Users also have the option of developing a custom export handler and incorporate it with Solr core libraries. Elasticsearch Elasticsearch ‘Rivers’ is an elegant pluggable service which runs inside the Elasticsearch cluster. This service can be configured for pulling or pushing the data that is indexed into the cluster. Some of the popular Elasticsearch Rivers modules are CouchDB, Dropbox, DynamoDB, FileSystem, Java Database Connectivity (JDBC), Java Messaging Service (JMS), MongoDB, neo4j, Redis, Solr, Twitter, and Wikipedia. However, ‘Rivers’ will be deprecated in the newer release of Elasticsearch, which recommends using official client libraries built for popular programming languages. Alternatively, the Logstash input plugin is also one of the identified tools that can be used to ship data into Elasticsearch. For data export, Elasticsearch snapshot can be used for any individual indices or an entire cluster into a remote repository. This is discussed in detail in the section ‘Operations and Management - Backup’. Amazon CloudSearch Amazon CloudSearch recommends sending the documents in batches to upload on CloudSearch domain. A batch is a collection of add, update, and delete operations which should be described in
  • 17. Amazon CloudSearch Comparison Report PAGE 17 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com JSON or XML format. Amazon CloudSearch limits a single batch upload to 5 MB per batch, but allows running parallel upload batches to reduce the time frame for full data upload. The number of parallel batch uploads is directly proportional to the CloudSearch instance types. Larger instance types have a higher upload capacity, while smaller instance types have lower. During such scenarios, the batch upload programs should intelligently threshold the uploads based on instance capacity. Conclusion Apache Solr has good handlers to export and import the data. In any case, if the options present are not viable, Apache Solr allows one to develop a new custom handler or customize an existing handler that can be used for data import and export. Elasticsearch has integration with popular data sources in the form of ‘River’ modules or plugins. However, the future versions of Elasticsearch strongly recommend using Logstash input plugins or developing and contributing new Logstash input, as customization of a plugin is allowed in Elasticsearch. Amazon CloudSearch does not have elaborate options like other two search engines. However by combining custom programs with bulk upload recommendations in Amazon CloudSearch, customers can successfully migrate data into CloudSearch.
  • 18. Amazon CloudSearch Comparison Report PAGE 18 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 5: Search and Indexing features In this section, we will evaluate ‘Search and Indexing’ features present in the search engines we are evaluating. This is a very important feature set as they are widely used by search application engineers. 5.1 Analyzers, Tokenizers and Token filters Generally speaking, the search engine prepares text strings for indexing and searching using analyzers, tokenizers, and filters. These tools are frequently used by libraries configured for indexing and searching the data. Most of the time, the libraries are composed in a sequential series. • During indexing and querying, analyzer assesses the field text and tokenizes each block of text into individual terms. Each token is a sub-sequence of the characters in the text. • The token filter filters each token in the stream sequentially and applies its filter functionality. Apache Solr and Elasticsearch Apache Solr and Elasticsearch have multifaceted in-built libraries for analyzers, tokenizers and token filters. These libraries are packaged with search engine installable that can be configured during indexing and searching. Although the analyzers can be configured for indexing and querying, the same series of libraries doesn’t need to be used for both operations. The indexing and searching operations can be configured to have different tokenizers and filters, as their goals can be different. Search Engine Tokenizers Filters Apache Solr Standard, Classic, Keyword, Letter, Lower Case, N-Gram, Edge N-Gram, ICU, Path Hierarchy, Regular Expression Pattern, UAX29 URL Email, White Space ASCII Folding, Beider-Morse, Classic, Common Grams, Collation Key, Daitch- MokotoffSoundex, Double Metaphone, Edge N-Gram, English Minimal Stem, Hunspell Stem, Hyphenated Words, ICU Folding, ICU Normalizer 2, ICU Transform, Keep Words, KStem, Length, Lower Case, Managed Stop, Managed Synonym, N- Gram, Numeric Payload Token, Pattern Replace, Phonetic, Porter Stem, Remove Duplicates Token, Reversed Wildcard, Shingle, Snowball Porter, Stemmer, Standard, Stop, Suggest Stop, Synonym, Token Offset Payload, Trim, Type As Payload, Type Token, Word Delimiter Elasticsearch Standard, Edge NGram, Keyword, Letter, Lowercase, NGram, Whitespace, Pattern, UAX Email URL, Path Hierarchy, Classic, Thai Standard Token, ASCII Folding Token, Length Token, Lowercase Token, Uppercase Token, NGram Token, Edge NGram Token, Porter Stem Token, Shingle Token, Stop Token, Word Delimiter Token, Stemmer Token, Stemmer Override Token, Keyword Marker Token, Keyword Repeat Token, KStem Token, Snowball Token, Phonetic Token, Synonym Token, Compound Word
  • 19. Amazon CloudSearch Comparison Report PAGE 19 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Token, Reverse Token, Elision Token, Truncate Token, Unique Token, Pattern Capture Token, Pattern Replace Token, Trim Token, Limit Token Count Token, Hunspell Token, Common Grams Token, Normalization Token, CJK Width Token, CJK Bigram Token, Delimited Payload Token, Keep Words Token, Keep Types Token, Classic Token, Apostrophe Token Amazon CloudSearch Amazon CloudSearch analysis scheme configuration is used for analyzing text data during indexing. The analysis schemes basically control: • Text field content processing • Stemming • Inclusion of stopwords and synonyms • Tokenization (Japanese language) • Bigrams (Chinese, Japanese, Korean languages) The following analysis options are executed when text fields are configured with an analysis scheme 1. Algorithmic stemming: Level of algorithmic stemming (minimal, light, and heavy) to perform. The stemming levels vary depending on the analysis scheme language. 2. Stemming dictionary: A dictionary to override the results of the algorithmic stemming. 3. Japanese Tokenization Dictionary: A dictionary which specifies how particular characters should be grouped into words (only for Japanese language). 4. Stopwords: A set of terms that should be ignored both during indexing and at search. 5. Synonyms: A dictionary of words that have the same meaning in the text data Before processing the analysis scheme, Amazon CloudSearch tokenizes and normalizes the text data. During tokenization, the text data is split into multiple tokens; this is common behavior in all search engine text processing. During normalization, upper case characters are converted to lower case, and more formatting is applied. After the tokenization and normalization processes are completed, stemming, stopwords, and synonyms are applied. Conclusion Apache Solr and Elasticsearch are packaged with varied libraries with distinct functions of analyzers, tokenizers, and filters. Also, these libraries are allowed to be customized which gives greater flexibility for the developers. Amazon CloudSearch doesn’t carry sophisticated tokenizers or filter libraries like Apache Solr or Elasticsearch, but it has simplified the configuration. Amazon CloudSearch tokenizers and filters cover most common search requirements and use cases. This is ideal for developers who want to quickly integrate search functionality into their application stack.
  • 20. Amazon CloudSearch Comparison Report PAGE 20 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com 5.2 Faceting Faceting is the composition of search results into categories or groups, based on indexed terms. Faceting allows for categorizing search results into more sub-groups, which can be used as the basis for filters or other searches. Faceting is also for efficient computation of search results by facets. For example, facets for ‘Laptop’ search results can be 'Price', ‘Operating System’, 'RAM' or 'Shipping Method’. Faceting is a popular function that helps consumers filter through search results easily and effectively. Apache Solr Apache Solr has far advanced options in faceting ranging from simple to very advanced faceting behavior. The below table details the parameters used during faceting. They can be grouped by field value, date, range, pivot, multi-select, and interval. Facet grouping Parameters Field value parameters facet.field, facet.prefix, facet.sort, facet.limit, facet.offset, facet.mincount, facet.missing, facet.method, facet.enum.cache.minDffacet.threads Date faceting parameters facet.date, facet.date.start, facet.date.end, facet.date.gap, facet.date.hardend, facet.date.other, facet.date.include Range faceting parameters facet.range, facet.range.start, facet.range.end, facet.range.gap, facet.range.hardend, facet.range.other, facet.range.include Pivot facet.pivot, facet.pivot.mincount Interval facet.interval, facet.interval.set Elasticsearch Elasticsearch has deprecated facets and announced that they will be removed in a future release. The Elasticsearch team felt that their facet implementation was not designed from the ground up to support complex aggregations. Elasticsearch will be replacing facets with aggregations in their next release. Elasticsearch says “An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (for example, a top-level aggregation executes within the context of the executed query/filters of the search request).” Elasticsearch strongly recommends migrating from facets to aggregations. The aggregations are classified into two main families, Bucketing and Metric. The following table lists the aggregations available in Elasticsearch.
  • 21. Amazon CloudSearch Comparison Report PAGE 21 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Elasticsearch Aggregators Min Aggregation, Max Aggregation, Sum Aggregation, Avg Aggregation, Stats Aggregation, Extended Stats Aggregation, Value Count Aggregation, Percentiles Aggregation, Percentile Ranks Aggregation, Cardinality Aggregation, Geo Bounds Aggregation, Top hits Aggregation, Scripted Metric Aggregation, Global Aggregation, Filter Aggregation, Filters Aggregation, Missing Aggregation, Nested Aggregation, Reverse nested Aggregation, Children Aggregation, Terms Aggregation, Significant Terms Aggregation, Range Aggregation, Date Range Aggregation, IPv4 Range Aggregation, Histogram Aggregation, Date Histogram Aggregation, Geo Distance Aggregation, GeoHash grid Aggregation Amazon CloudSearch Amazon CloudSearch simplifies facet configuration when defining indexing options. These facets are targeted at common use cases like e-commerce, online travel, classifieds, etc. The facet can be of any field having data type as date, literal, or numeric field. This is done during CloudSearch domain configuration. Amazon CloudSearch also allows the buckets definition to calculate facet counts for particular subsets of the facet values. The facet information can be retrieved in two ways: Sort: returns facet information sorted either by facet counts or facet values. Buckets: returns facet information for particular facet values or ranges During searching, facet information can be fetched for any facet-enabled field by specifying the “facet.FIELD” parameter in the search request (‘FIELD’ is the name of a facet-enabled field). Amazon CloudSearch does allow multiple facets which help to refine search results further. See the below example. Example: "q=poet&facet.genres={}&facet.rating={}&facet.year={}&return=_no_fields" Conclusion All three search engines allow users to perform faceting with minimal effort. However, in terms of an advanced complex implementation, the approaches are different for each search engine. 5.3 Auto Suggestion When a user types a search query, suggestions relevant to the query input are presented and as more characters are typed by the user, refined suggestions are presented. This feature is called auto- suggest. Auto-suggest is an appealing and useful requirement and employed in many search user interfaces. This feature can be implemented at the Search Engine level or at the Search Application level. Below are some options available in these three search engines.
  • 22. Amazon CloudSearch Comparison Report PAGE 22 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Apache Solr Apache Solr has native support for the auto-suggest feature. It can be facilitated by using NGramFilterFactory, EdgeNGramFilterFactory, or TermsComponent. Usually, this Apache Solr feature is used in conjunction with jQuery or asynchronous client libraries for creating powerful auto-suggestion and user experience in the front-end applications. Elasticsearch Elasticsearch also has many edge n-grams, which are easy to set up, flexible, and fast. Elasticsearch introduced a new data structure, Finite State Transducer (FST), which resembles big graph data structure. This data structure is managed in memory, which makes it much faster than a term-based query could be. Elasticsearch also recommends using edge n-grams when query input and its word ordering are less predictable. Amazon CloudSearch Amazon CloudSearch offers ‘Suggesters’ to achieve auto-suggest. CloudSearch Suggesters are configured based on a particular text field. When Suggesters are used for querying with a search string, CloudSearch lists all documents where the search string in the Suggester field begins with that search string. Suggesters can be configured to find matches for the exact query, or to perform a fuzzy matching process to correct the query string. The ‘Fuzzy Matching’ can be defined with fuzziness level Low, High or Default. Suggesters also can be configured with SortExpression, which computes a score for each one. It’s important to do domain indexing when a new Suggester is configured. Suggestions will not be reflected until all of the documents are indexed. Conclusion Amazon CloudSearch provides simple yet powerful ‘Suggest’ implementation, which is sufficient for most of the applications. If you are looking for advanced options or any further customizations on ‘Suggestions’, Apache Solr and Elasticsearch offer some good options.
  • 23. Amazon CloudSearch Comparison Report PAGE 23 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com 5.4 Highlighting Highlighting is a way of giving formatting clues to end users in the search results. It is a valuable feature, where the front-end search applications highlight search snippets of text from each search result. This function conveys to end users why the result document matched their query. In this section, we will describe the options present in all three search engines. Apache Solr Apache Solr includes document text fragments, which are matched in the query response. These text fragments are included in the response as a highlighted section that is used as a cue by search clients for representation. Apache Solr is packaged with good highlighting collections which give control over the text fragments, fragment size, fragment formatting, and so on. These highlighting collections can be incorporated with Solr Query parsers and Request Handlers. Apache Solr comes with three highlighting utilities • Standard Highlighter • FastVector Highlighter • Postings Highlighter Standard Highlighter is most commonly used by search engineers because it is a good choice for a wide variety of search use-cases. The FastVector Highlighter is ideal for large documents and highlighting text in a variety of languages. The Postings Highlighter works well for full-text keyword search. Elasticsearch Elasticsearch also allows for highlighting search results on one or more fields. The implementation uses a Lucene based highlighter, fast-vector-highlighter and postings-highlighter. In Elasticsearch, the highlighter can be configured in the query to force a specific highlighter type. This is a very flexible option for developers to choose a specific highlighter to suit their requirements. Like Apache Solr, the three highlighters present in Elasticsearch emulate the same behavior which is seen in Solr because these highlighters are inherited from the Lucene family. Amazon CloudSearch Amazon CloudSearch simplifies the highlighting by specifying the highlight.FIELD parameter in the search request. Amazon CloudSearch returns excerpts with the search results to show where the search terms occur within a particular field of a matching document. For example: Search terms ‘Smart Phone’ is highlighted for the description field: Highlights": {"description": "A *smartphone* is a mobile phone with an advanced mobile operating system. They typically combine the features of a cell phone with those of other popular mobile devices, such as personal digital assistant (PDA), media player and GPS navigation unit. A *smartphone* has a touchscreen user interface and can run third-party apps, and are camera phones."} Amazon CloudSearch also provides controls like number of search term occurrences within an excerpt, how they should be highlighted, plain text or HTML and so on. Conclusion From a development perspective, all three search engines provide easy and simple highlighting implementations. If you are looking for different and more advanced highlighting options, Apache Solr and Elasticsearch have some good features.
  • 24. Amazon CloudSearch Comparison Report PAGE 24 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 6: Multilingual support Multilingualism is a very important feature for global applications which cater to non-English speaking geographies. A leading information measurement company’s survey reveals that search engines built with multilingual features are emerging and successful because of native language support, and focus on the cultural background of the users. Business Impact: A multilingual search is an effective marketing strategy to get the attention of consumers. In e-commerce, a platform to do more business is created when the language is in the native tongue of the customer. Apache Solr Apache Solr is packaged with multilingual support for most common languages. Apache Solr carries many language-specific tokenizers, and filters libraries which can be configured during indexing and querying. Apache Solr engineering forums recommend using multi-core architecture where each core manages one language. Solr also supports language detection using Tika and LangDetect detection features. This helps to map the text data to language-specific fields during indexing. Elasticsearch Elasticsearch has incorporated a vast collection of language analyzers for most commonly spoken languages. The primary role of the language analyzer is to split, stem, filter, and apply required transformations specific to the language. Elasticsearch also allows a user to define a custom analyzer that can be a base extension of another analyzer. Amazon CloudSearch Amazon CloudSearch has strong support for language-specific text processing. Amazon CloudSearch has pre-defined default analysis schemes support to 34 languages. Amazon CloudSearch processes the text and text-array fields based on the configured language-specific analysis scheme. Amazon CloudSearch also allows a user to define a new analysis scheme that can be an extension of the default language analysis scheme. Conclusion All three search engines have ample and effective support features for widely spoken international languages.
  • 25. Amazon CloudSearch Comparison Report PAGE 25 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Languages Support The table below lists the languages supported by each search engine Search engine Languages supported Apache Solr Arabic, Brazilian, Portuguese, Bulgarian, Catalan, Chinese, Simplified Chinese, CJK, Czech, Danish, Dutch, Finnish, French, Galician, German, Greek, Hebrew, Lao, Myanmar Khmer, Hindi, Indonesian, Italian, Irish, Japanese, Latvian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Scandinavian, Serbian, Spanish, Swedish, Thai and Turkish Elasticsearch Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai and Turkish Amazon CloudSearch Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese - Simplified, Chinese - Traditional, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hebrew, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish and Thai
  • 26. Amazon CloudSearch Comparison Report PAGE 26 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 7: Protocol & API Support 7.1 Request and Response formats Search engine Request formats Response formats Apache Solr XML, JSON, CSV JSON, XML, CSV Elasticsearch JSON XML, JSON Amazon CloudSearch XML, JSON XML, JSON 7.2 External Integrations Search engine Integrations available Apache Solr Drupal, Magento, Django, ColdFusion, Wordpress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak, DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR https://wiki.apache.org/solr/IntegratingSolr#Integrating_Solr_With_Other_.28Non_ Search.29_Applications Elasticsearch Drupal, Django, Symfony2, Wordpress, CouchBase, SearchBlox, Hortonworks Data Platform, MapR http://www.Elasticsearch.org/guide/en/Elasticsearch/client/community/current/inte grations.html Amazon CloudSearch 7.3 Protocols Support Search engine Protocols support Apache Solr HTTP, HTTPS Elasticsearch HTTP, HTTPS Amazon CloudSearch HTTP, HTTPS
  • 27. Amazon CloudSearch Comparison Report PAGE 27 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 8: High Availability All three search engines are architected for • High availability (HA) • Replication • Scaling design principles In this next section, we will discuss high availability options present in these three search engines. 8.1 Replication Replication is copying or synchronizing the search index from master nodes to slave nodes for managing the data efficiently. Replication is a key design principle exercised in high availability searches and scaling. From a High Availability perspective, replication can be effective for both HA and failovers from master nodes (shards or leaders) to slave nodes (replicas). Replication from a scaling perspective is used to scale the slave or replica nodes when the requests traffic increases. Apache Solr Apache Solr supports two models of replication, namely legacy mode and SolrCloud. In legacy mode, the replication handler copies data from the master node index to slave nodes. The master server manages all index updates and the slave nodes handle read queries. This segmentation of master and slave allows scaling Solr clusters to deliver heavy volume loads. Apache SolrCloud is a distributed advanced cluster setup using Solr nodes designed with high availability and fault-tolerance. Unlike legacy mode, there is no explicit concept of "master/slave" nodes. Instead, the search cluster is categorically split into leaders and replicas. The leader has the responsibility to ensure the replicas are updated with the same data stored in the leader. Apache Solr has a configuration called ‘numShards’ which defines number of shards (leaders). During start- up, the core index is split across the ‘numShards’ (number of shards) and the shards are represented as leaders. The nodes that are attached in the Solr cluster after the initial ‘numShards’ will be automatically assigned as replicas for the leaders. Elasticsearch Elasticsearch follows a similar concept to SolrCloud. In brief, an Elasticsearch index can be split into multiple shards and each shard can be replicated into any number of nodes (0, 1, 2 …n). When replication is completed, the index will have primary shards and replica shards. During index creation, the number of shards and replicas are defined. The number of replicas can be dynamically changed, but the shards count cannot. Apache Solr and Elasticsearch Both Apache Solr and Elasticsearch support synchronous and asynchronous replication models. If the replication is configured in ‘synchronous’ mode, the primary (leader) shard will wait for successful responses from the replica shards before returning commit transaction. If the model is ‘asynchronous’, the response is returned to the client as soon as the request is executed on the primary or leader shard. The request to the replicas is forwarded asynchronously. The diagram below depicts the replication concept which is followed in Solr and Elasticsearch.
  • 28. Amazon CloudSearch Comparison Report PAGE 28 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Replication handling in Apache Solr and Elasticsearch S1 Node 1 Shard 1 of the cluster S2 Node 2 Shard 1 of the cluster R1 Node 3 Replica 1 of Shard 1 R2 Node 4 Replica 1 of Shard 2 R3 Node 5 Replica 2 of Shard 1 R4 Node 6 Replica 2 of Shard 2 Amazon CloudSearch Amazon CloudSearch is simple and refined when it comes to handling replication and streamlines the job of search engineers and administrators. During the configuration of scaling, Amazon CloudSearch prompts for the desired replication count which should be based on load requirements. Amazon CloudSearch will automatically scale up and scale down the replicas for a domain based on the requests traffic and data volume, but not below the desired replication count. In Amazon CloudSearch, the replication scaling option can be changed at any time. If the scale requirement is temporary, (for example, anticipated spikes because of a seasonal sale) the desired replication count of the domain can be pre-scaled, and then the changes reverted after the requests volume returns to a steady state. Modifying the replication count does not require any index rebuilding but the replica sync completion is dependent on the size of search index. The following describes the benefits of Amazon CloudSearch replication model. • The search instance capacity is automatically replicated and load is distributed, the search layer is robust and highly available at all times. • Improved fault tolerance. If any one of the replicas is down, the other replica(s) will continue to handle requests while the failed replica is in recovery mode. • The entire process of scaling and distribution is automated and avoids manual intervention and support.
  • 29. Amazon CloudSearch Comparison Report PAGE 29 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Conclusion All three search engines have a good base to support the ‘replication’ feature. Apache Solr and Elasticsearch allow defining your own replication topology which can be configured for synchronous and asynchronous replication. They can be manually or automatically scaled based on application requirements and by writing custom programs. However, substantial managed service operations are required if the cluster replication is set up in enterprise scale. Amazon CloudSearch fully manages the replication by managing scaling, load distribution, and fault tolerance. This simplicity saves operations costs for the enterprises and companies. 8.2 Failover Failover is a back-end operation that switches to secondary or standby nodes in the event of primary server failure. Failover is identified as an important fault tolerance function for systems with lower or zero downtime requirements. Apache Solr and Elasticsearch When an Apache Solr or Elasticsearch cluster is built with shards and replicas, the cluster inherently becomes fault-tolerant and mechanically supports failover. During any failure, a cluster is expected to support the operations while the failed node is put into recovery state. Both the Apache Solr and Elasticsearch documentation strongly recommend a distributed cluster setup to protect user experience from application or infrastructure failure. In the event of all nodes storing shards and replicas failing, then the client requests will also fail. If the shards are set to tolerant configuration, partial results can be returned from the available shards. This behavior is anticipated in both Apache Solr and Elasticsearch. The representation below depicts how failover is handled in cluster. This flow is applicable for both Solr and Elasticsearch.
  • 30. Amazon CloudSearch Comparison Report PAGE 30 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Node Replica 1 Replica 2 SHARD 1 – Node Number 1 SHARD 1 FIRST REPLICA – Node Number 3 SHARD 1 SECOND REPLICA – Node Number 5 SHARD 2 – Node Number 2 SHARD 2 FIRST REPLICA – Node Number 4 SHARD 2 SECOND REPLICA – Node Number 6 The below table illustrates the failure scenarios in a Search cluster. Scenario A If SHARD1 fails, then one of its replica nodes, either Node number 3 or Node number 5 is chosen as leader. Scenario B If SHARD2 fails, then one of its replica nodes, either Node number 4 or Node number 6 is chosen as leader. Scenario C If SHARD 1 REPLICA1 fails, then Shard 1 Replica 2 continues to support replication and as well serve the requests. Scenario D If SHARD 2 REPLICA1 fails, then Shard 2 Replica 2 continues to support replication and as well serve the requests. Elasticsearch uses internal Zen Discovery to detect failures. If the node holding a primary shard dies, then a replica is promoted to the role of primary. Apache Solr uses Apache ZooKeeper for Co- ordination, failure detection, and leader voting. ZooKeeper initiates leader election process between replicas during a leader/primary shard failure. Amazon CloudSearch Amazon CloudSearch has built-in failover support. Amazon CloudSearch recommends scaling options and availability options to increase fault tolerance in the event of a service disruption or node failures. When Multi-AZ is turned on, Amazon CloudSearch provisions the same number of instances in your search domain in the second availability zone within that region. The instances in the primary and secondary zones are capable of handling a full load in the event of any failure. In the event of a service disruption or failure in one availability zone, the traffic requests are automatically redirected to the secondary availability zone. In parallel, Amazon CloudSearch self- heals the cluster in failure, and Multi-AZ restores the nodes without any administrative intervention. During this switch, the inflight queries might fail, and they will need to be retried from the front–end application side. By increasing the partitions and replicas in the Amazon CloudSearch scaling options, failover support can be improved. If there's a failure in one of the replicas or partitions, the other nodes (replica or partition) will handle requests and support while it is being recovered. Amazon CloudSearch is very sophisticated in terms of handling failure, as the node health is continuously monitored. In the event of infrastructure failures, the nodes are automatically recovered or replaced. Conclusion Failover can be architected by applying techniques like replication, sharding, service discovery, and failure-detection services. Apache Solr and Elasticsearch advocate building your search system in ‘Cluster mode’ to address failover. They undertake that responsibility by employing service discovery which can detect unhealthy nodes. The service discovery maintains the cluster
  • 31. Amazon CloudSearch Comparison Report PAGE 31 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com information and balances the search cluster when nodes are detected for failures. Amazon CloudSearch supports failover for single node as well as for cluster mode. Behind the scenes, CloudSearch continuously monitors the health of the search instances and they are automatically managed during failures.
  • 32. Amazon CloudSearch Comparison Report PAGE 32 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 9: Scaling The ability to scale in terms of computing power, memory, or data volume is essential in any data and traffic bound applications. Scaling is a significant design principle employed to improve performance, balancing and high availability. Over time, the search cluster is expected to be scaled horizontally (scale out) or vertically (scale up) depending upon the needs. Scale-up is the process of moving from a small server to a large server. Scale-out is the process of adding multiple servers to handle the load. The scaling strategy should be selected based on application requirements. Apache Solr and Elasticsearch Scaling an Apache Solr or Elasticsearch application involves manual processes. These can include a simple server addition task or advanced tasks like cluster topology changes, storage changes, or infrastructure upgrades. If vertical scaling takes place, the search cluster needs to follow processes like new setup and configuration, downtime, node restarts, etc. If scaling is horizontal, the process may involve re- sharding, rebalancing, or cache warming. While a search cluster system can benefit from powerful hardware, vertical scaling has its own limitations. Upgrading or increasing the infrastructure specifications on the same server can involve tasks like: • New setup • Backup • Down time • Application re-testing The scaling out process is identified as a relatively easier task compared to scaling up. An expert search administrator (Apache Solr or Elasticsearch) is usually posted to keep a close watch on the performance of the search servers. Infrastructure and search metrics play a key role in administrator decision making. When these metrics increase beyond the threshold of a particular server and start affecting overall performance, the new server(s) have to be manually spawned. Also, the scale up task can expand to index partitioning, auto-warming, caching and re-routing/distribution of the search queries to the new instances. It requires a Solr expert on your team to identify and execute this activity periodically. Sharding and Replication Though scaling up, scaling out, and scaling down involve manual work, technology-driven companies automate this process by developing custom programs. These smart programs continuously monitor the cluster group and make decisions to do elastic scaling. This output is quite similar to AutoScaling’s offering. In terms of administration functionality, both Apache Solr and Elasticsearch offer scaling
  • 33. Amazon CloudSearch Comparison Report PAGE 33 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com techniques called Sharding and Replication. Sharding (which means partitioning) is a method in which a search index is split into multiple logical units called "shards". If the indexed documents exceed the collection’s physical size, then sharding is recommended by administrators. When sharding is enabled the search requests are distributed to every shard in the collection, results are individually collected and then merged. Another scaling technique, replication, (See 8.1 Replication - discussed in detail) allows adding new servers with redundant copies of your index data to handle higher concurrent query loads by distributing the requests around to multiple nodes. Amazon CloudSearch Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the amount of data or query volume increases. CloudSearch can be scaled based on the data or based on the requests traffic. When the search data volume increases, CloudSearch can be scaled from a smaller instance type to a larger search instance type. If the capacity of largest search instance type is also exceeded then CloudSearch partitions the search index across multiple search instances (Sharding technique). When traffic and concurrency grows, Amazon CloudSearch deploys additional (replicas) search instances to support traffic load. This automation eases the complexity and manual labour required in the scaling out process. Conversely, when the traffic drops, Amazon CloudSearch scales down your search domain by removing the additional search instances in order to minimize costs. The Amazon CloudSearch management console allows users to configure the desired partition count and the desired replication count. The AWS console also allows changing of the instance type (scaling up) anytime. This inherent behavior of elastic scaling makes one of the most important points in favor of Amazon CloudSearch. Conclusion Scaling in search is implemented in the form Sharding and Replication. All three search engines have a strong scaling support for setting up their search tier in ‘cluster mode’. Scaling in Apache Solr and Elasticsearch often requires administration as there is no direct hard and fast rule. Techniques like elastic scaling can implemented only up to a limit and when cluster grows further, manual intervention and thought process is required. Vertical scaling in Apache Solr and Elasticsearch is even more delicate. It requires individual management of the nodes in the cluster and executed by using techniques like ‘Rolling restarts’ and custom scripts. Amazon Cloud Search takes away all the operation intricacies from the administrators. The desired partition count and desired replication count option in CloudSearch will automatically scale up and scale down based on the volume of data and requests traffic. This saves lot of efforts and cost on operations and management.
  • 34. Amazon CloudSearch Comparison Report PAGE 34 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 10: Customization At times, the search system or its software may not have support for a specific feature or built-in integration with other systems. In such cases, most open source software allows developers to customize and extend their desired features as plugins, extensions or modules. Often, the developer community shares extension libraries which are helpful for a practical cause. These libraries can be customized and integrated with the system. Apache Solr and Elasticsearch Apache Solr and Elasticsearch both belong to the same source breed, allowing customizations on: • Analyzers • Filters • Tokenizers • Language analysis • Field types • Validators • Fall back query analysis • Alternate query custom handlers Since both products are open source, the developers can customize or extend the libraries to fit the required feature modifications through plugins and libraries. The build and deployment becomes a developer’s responsibility after the extending the code base. Apache Solr and Elasticsearch have many plugin extensions that will allow developers to add custom functionality for a variety of purposes. These plugins are configured as special libraries and refer to the application using configuration mapping. Amazon CloudSearch Amazon CloudSearch does not allow for any customizations. The search features in Amazon CloudSearch are offered by AWS after much careful thought and collective feedback from the customers. The Amazon CloudSearch team continually evaluates new features and rolls them out proactively. Conclusion Amazon CloudSearch has a highly capable feature set to develop search systems. However, if you anticipate strong customization on your search functionalities, Apache Solr or Elasticsearch are better choices as their search core libraries are open sourced. It is also important to note that any customization in the core libraries leaves the build and deployment process responsibility to the developer. The customization also needs to be maintained for every version upgrade or newer release of your search engine.
  • 35. Amazon CloudSearch Comparison Report PAGE 35 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 11: More 11.1 Client libraries Client libraries are required for communicating with search engines. They are essential for developers as they provide essential information to the connecting search engine and allow applications to easily interact with high-level libraries. Apache Solr Apache Solr has an open source API client to interact with Solr using simple high-level methods. The client libraries are available for PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, and JavaScript. Elasticsearch Elasticsearch provides official clients for Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby. There are other community-provided client libraries that can be integrated with Elasticsearch. Open source Other than official and open source client APIs, Elasticsearch and Apache Solr can be integrated using the RESTful API. The REST client can use a typical web client developed in the favored programming language or even called from a normal command line. Amazon CloudSearch Amazon CloudSearch exposes a RESTful API for configuration, document service and search. • The configuration API can be used for CloudSearch domain creation, its configuration and end to end management. • The document service API enables the user to add, replace, or delete documents in your Amazon CloudSearch. • The search API is used for search or suggestion requests to your Amazon CloudSearch domain. Alternatively, AWS also shares a downloadable SDK package, which simplifies coding. The SDK is available for popular languages like Java, .NET, PHP, Python, and more. The SDK APIs are built for most Amazon Web services, including Amazon S3, Amazon EC2, CloudSearch, DynamoDB, and more. The SDK package includes the AWS library, code samples, and documentation.
  • 36. Amazon CloudSearch Comparison Report PAGE 36 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Feature 12: Cost From an overall perspective, Cost is a very important factor and companies always endeavor ways to reduce Total cost of Ownership (TCO). In this section, we will see the Cost components in these three search engines. Apache Solr and Elasticsearch The cost factor in Apache Solr and Elasticsearch includes infrastructure resources cost, managed services cost and people resources cost. For any type of deployment, the servers cost and engineers cost are essential. The commitment to continuous admin operations depends on application requirements and its criticality. Amazon CloudSearch Amazon CloudSearch cost component includes server costs and engineers cost and they are essential for any search deployment like the above two. Amazon CloudSearch being a fully- managed service covers the managed services as part of the server costs. Also, Amazon CloudSearch does not charge during the beginning of service usage but charges at the end of the month based on CloudSearch usage. Conclusion The net operating costs are essentially the same across all three search engines, but people costs will be 30% more for self-managed Apache Solr or Elasticsearch compared to Amazon CloudSearch. For Example, A highly important and critical search application will require 24 * 7 support and managed services. This cost incurred as part of Managed services which is an additional one in Apache Solr and Elasticsearch deployments. A detailed TCO Analysis between Apache Solr, Elasticsearch and Amazon CloudSearch can be read here. Link:
  • 37. Amazon CloudSearch Comparison Report PAGE 37 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com Conclusion Search is an indispensable feature in most business applications. Apache Solr and Elasticsearch are time proven solutions. Many larger organizations have used Apache Solr and Elasticsearch for years, but are now looking for greater operational efficiency and cost effectiveness. On the other hand, companies looking for innovative ways grow their businesses and provide value. In the recent years, a huge number technology companies have started to employ the benefits of using cloud-based search services, mainly in terms of getting started and then accommodating growth without the need to switch vendors to do so. When scalability, cost, and speed-to-market are primary concerns, we recommend using some form of cloud service. And if you want to enjoy the benefits of a cloud solution built on the architecture of Apache Solr, we recommend Amazon CloudSearch.
  • 38. Amazon CloudSearch Comparison Report PAGE 38 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com About the Authors Dwarak is a Principal Architect at 8KMiles with more than decade hands-on experience in Cloud Computing, Big Data, Web technologies and Product Management. He has varied and progressive experience in architecting distributed Web and Enterprise systems and products. He is also disciplined with deep domain knowledge in the banking, finance, retail, and e- commerce industries. At present, he oversees technology consulting, architecture, delivery and customer end to end transformational programs at 8KMiles. Dwarakanath Ramachandran Harish is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles. Harish has more than decade of experience in architecting and developing cloud computing, e-commerce and mobile application systems. He has also built large Internet banking solutions that catered to the needs of millions of users, where security and authentication were critical factors. He is responsible for the overall technology direction of the 8KMiles products and services in Cloud, Big Data and Mobility Space. Harish is a thought leader in Cloud related technologies, an Advisor and has many followers for his blogs. Harish Ganesan
  • 39. Amazon CloudSearch Comparison Report PAGE 39 of 39 8K Miles 2007-2015 1-855-8KMILES (855-856-4537 info@8kmiles.com About 8KMiles 8KMiles is a solutions company that is focused on helping organizations of all sizes to integrate Cloud, Identity, and Big Data into their IT and business strategies. 8KMiles’ team of experts, located in North America and India, offer a host of services and solutions such as Cloud, Federated Identity Consulting, Cloud Engineering, Migration, Big Data services, and Managed Services on Amazon Web Services. 8KMiles offers specialized expertise in matured verticals such as Pharma, Retail, Media, Travel, and Healthcare. Visit us at www.8kmiles.com