2. • Team DNA
Igneous Systems - Introduction
• Started October 2013
• Based in Seattle
• 12 patents to date
• In production today
2
3. Who wants to be (or rely on) this guy?
SETUP, PROVISION &
SCALE
MANAGE & MONITOR
FAILURES
TROUBLESHOOT &
UPGRADE
Client/App
3
4. What did we want to do
• Provide IAAS on prem
• “I” means : stuff people need, but hate dealing with
• “AAS” means : customer doesn’t deal with it
• Means: we don’t want to touch it if we can help it
5. Single Server
Problems
- I/O bottlenecks
- Zero Server level protection
True Cloud for Local Data
SATA
Traditional Concepts
Features
- Drive failure? No big deal
- Simple for small scale
/mnt/fileserver
Tech
- Direct Attached Storage
- RAID/Parity
6. Dual Head
SAS
True Cloud for Local DataTraditional Concepts
Problems
- I/O bottlenecks
- Scales only so far…
- Mountpoint sprawl
Features
- Drive failure? No big deal
- Single server protection
- Simple to scale capacity
/mnt/fileserver
/mnt/fileserver2 /mnt/fileserver3
/mnt/fileserver4
/mnt/fileserver5
/mnt/fileserver7/mnt/fileserver6 /mnt/fileserver..100
Tech
- SAS & FC
- RAID/Parity/NVRAM
7. Traditional Concepts
Scale out/Clustered
Infiniband
Infiniband
Problems
- One node part fail = node fail
- Node rebuild is painful (weeks)
Features
- Distributed protection & perf
- Drive failure? Who cares
- Protection against node failure
- Scale big on perf & capacity
- One mountpoint
/mnt/fileserver
Tech
- Node Based (CPU/RAM/DISK/NIC/Crtlr)
- Infiniband Interconnect
- Erasure Coding
8. What makes a server/node?
CPU RAM NIC Controller OS
Not redundant No Hot Swap..
Disks
Hot Swap
Highly redundant
14. Hot Swap Power supplies
Hot Swap Ethernet Switches (IOMs)
Chassis == Rack
15. Igneous HW Architecture
–Each 4U Chassis contains
▪ 60 NanoServers, each attached to a 6TB SATA drive
▪ 212TB of Usable Capacity per Chassis
–Start with one, scale out from there
▪ Add Chassis non-disruptively
Basic Architecture
16. Top &
Bottom half
• 1:1:1 – CPU : Disk : I/O
• No Impact Fail
• Erasure Encoding with
Distributed Repair
• Small Fault Domain
• Rack in a Box Redundancy
1
Stateful services
X86 compute
ARM Nano-servers
Stateless Services
18. Data Path Architecture
Object Layer
Disk Layer
Resilient Store
Access control
Namespaces
Metadata
Layout decisions
Erasure coding
Repair/rebalance
Data placement
Device health
19. Distributed Erasure Encoding Prioritized Repair
- 65% space efficiency (20+8)
- Prioritized repair
D D D D L D D D D L D D D D L D D D D L D D D D L G G G
Data block
Local parity
Global parity
D
L
G
20. Protocol Services
Journal
Indexing Services
Resilient Stores
Disk Layer
Ingest Services
Extensible Data Path
Zero-Touch Infrastructure
Workflow Hooks
Deep Inspection
Indexing
Extensible Data
Engine
Primary Data Path
22. 22
That’s all folks..
But btw…
We are hiring @ igneous.io/culture-and-careers
@andypern
@IgneousIO
andy@igneous.io
Editor's Notes
Igneous is an IAAS startup based in Seattle, the cloud capital ..
Our team hails from a mix of infrastructure and cloud companies. Much of our core engineering team were instrumental at Isilon & Netapp in building out their filesystems layer, which helps as we build out a new storage system….
The life of a Sysadmin is thankless…especially if they’re managing storage. Users and business units want agility, but existing storage architectures take time and planning to deploy. The result is that the IT-girl or guy is left to juggle priorities, and many times has to tell the users ‘no, can’t do that’ , or ‘you’ll have to wait XX weeks or months for that..’
Simple…
In the beginning…
The dawn of the filer..Netapp made this an entire industry. It worked quite well for awhile…but that was before unstrucctured data exploded..
Not being able to scale beyond a certain capacity or performance point meant that IT departments had to spin up more and more filers. Doing so not only introduced management headaches, but forced users and applications to spread their data across a dizzying array of systems and silo’s.
The scale out or clustered NAS architecture was a major departure from the controller based architecutre. Simply put, collapsing all layers of storage into a node , and making that node a member of a peer to peer, scalable cluster brought huge advantages. Both performance and capacity could be scaled, all without creating additional mountpoints or points of management. Drive failures could be recovered from quickly, and flexible erasure coding allowed for much better protection of data.
However, as customers demanded better and better density and cost savings, vendors such as Isilon were forced to pack more and more high capacity disks into their nodes. With the largest nodes containing as many as 60 drives, and drive sizes reaching 10TB each, you now can have nodes that have hundreds of TB’s of data. Losing such a large node is a very stressful event, and rebuilding such a densely packed node takes so long, that most vendors will instead send empty chassis and perform surgery (known as disk tango) in order to ensure they can keep the customer up and running.
There had to be a better way….
So in order to solve the problems of the past, we had to go back to first principles and come up with a new way of looking at failure, and at how systems were built. First off, nodes have cpu, ram, networking, some disks, controllers, and an OS. However, not all components are created equal. Consider a storage server’s hard drives: you expect them to fail..largely because they are the most likely to have moving parts. Therefore, vendors have always ensured that both the software and the hardware surrounding drives is able to be resilient to failure.
On the flip side, CPU, memory and other components are not put into a server with the intention of hot-swapping, while at the same time if even ONE cpu or DIMM has issues, it can take the entire system down or make it unstable (DIMMs and CPU’s can cause panic attacks, etc). When one of those components failed, it took all of your storage in that server offline as well…not so redundant eh?
To tackle this problem of large failure domains, we thought: lets shrink the fault domain down as far as we can possibly go. We started with a design, where we assembled a PCB with CPU, memory, bootflash and connectors. We can attach the sATA connector to any standard SATA drive, be it spinning or flash/ssd. We can re-use the SAS connection points to run Ethernet tx/rx signaling on, and voila, we have a server that can attach to every single drive!
A standard chassis contains 60 of these nano-servers , and each of them has 2 gigabit connections, one to each of two internal switches…
Lets compare one of our chassis with a typical datacenter rack. Both are made of metal, both have slots in which to slide servers (or nano-servers ). Both have power supplies (in our case, they can be hot-swapped with minimal effort), and they’ll typically have some top-of-rack (TOR) switching. In our case those switches are hot-swap as well.
Racks are also NOT SMART. That means they don’t run their own software…neither does our chassis! Its all running on the nano-servers .. Why is this good? It means that we’ve truly shrunk the fault domain down to a single drive, and that means you’ll only lose access to an entire chassis if you lost 100% of power or networking to your rack.
Here’s a recap of how we put these together, keep in mind that we can scale these chassis out. Unlike most clustered systems, you can start with just one, and still have great resiliency!
One more thing: the nano-server provide a great place to run all the mission critical and stateful components of a platform, which is what we’ll call the bottom half. In order to have a flexible, scalable top-half, we ship 2 or more x86 based servers which run all the stateless services. They handle protocol requests, aggregate log data and metrics, and run additional applications..but they do NOT store data or even configuration data..all that stuff lives behind the safety of our nano-server army.
So how does this manifest itself in a customer data center? First, we land our systems on the ground, and allow the customer to access it using the S3 api, the defacto standard for object storage. Many tools , sdk’s and 3rd party applications now support s3 natively, including backup&archive, as well as lots of next generation tools, such as spark, hadoop and others.
The secret to making sure that customers are able to use this system without m anaging It lies in our separation of the ‘data plane’ from the ‘control plane’. We keep the data on the ground, in our protected nano-server farm. Our x86 servers collect all the metrics and log information from these systems and push them up to our control plane in the cloud, over SSL port 443 (outbound only). This is where we are able to monitor, deal with failure, and push system updates. We’ve built this platform to be rock solid and available through all of this. In fact, we can and do push updates to our production customer systems once per week..unheard of in the storage industry! Our nano-server technology lets us perform rolling, non-disruptive upgrades across our entire fleet, all without any customer interaction or maintenance windows.
The key point to take away here is that we’ve created an optimal way to protect data. We start with ’your typical’ erasure coding scheme, which is a 20+8 stripe, represented by the data blocks and local parity blocks. However, we take it a step further and create 4 GLOBAL parity blocks, which can be selectively used to repair any failed block. This gives us a great mean-time-to-dataloss metric, far better than we could get with local parity alone.
By asynchronously sending metadata updates from the journal out to our extensible data path, we’ve removed it from the I/O path, and ensure that we don’t impact performance as a result.
Internally , we use the extensible data path, or event stream, to be able to do things like:
Metadata indexing
Asynchronous, continuous replication (to another Igneous system or to the public cloud)
Auditing
Customers can also use this event stream to build their own applications, and we’ve built plugins to allow it to work with message systems such as Kafka..
Basically, every object put or delete results in an event, which also gives us an infinite, time-series log of everything that has happened.
Today, we’ve built out our content store, as well as our event stream. As we move forward, we’re extending our services to allow customrs to run more and more applications, including full blown docker containers.