As a disclaimer we are a Nimble Storage partner and hence most of our real world experience is with Nimble and not Nutanix. Additionally our company has been hesitant to fully buy into much of the buzz around hyper-convergence and this article may reflect some of that hesitation. However at our core we are technology enthusiasts and stay in business by continuing to recommend the best solutions for our customers. In this article we strived for an unbiased vantage point and gaining a complete grasp of the key differences between these architectures. As such we welcome feedback, technical corrections and difference of opinions.
I have wanted to write an article comparing design, density and costs of hyper-convergence compared to SAN/Blade designs for a while. While doing some research on performance of Nutanix in general I came across some material that presented me the opportunity to do the comparison. Specifically I am referring to the Nutanix design published by Michael Webster that supports 1.4 Million exchange 2013 mailboxes http://longwhiteclouds.com/2014/09/28/exchange-best-practices-and-1-4million-mailboxes-on-nutanix-nx-8150/ and comparing it to a Nimble Design that supports 100,000 exchange 2013 mailboxes https://www.nimblestorage.com/blog/technology/new-nimble-array-runs-100000-microsoft-exchange-mailboxes-in-a-single-system/. Although these articles were published a year ago I believe the comparison to still be relevant.
Keep in mind that this is a thought experiment…that is leveraging the published data from each vendor to try and make some comparisons and draw conclusions. It is not endorsed by either vendor and it isn’t perfect as it requires some assumptions. However, even though it is only a thought experiment I believe the conclusions you can draw(if you can get to the end) are insightful and will help a technical reader gain a much better understanding of each system.
The second point to bring up is this is not a performance comparison between Nutanix and Nimble Storage as we don’t have Latency and IOP numbers. Performance is simply deemed to be in acceptable ranges for the workload due to sufficient Jetstress results. I wanted to name this article Performance Review: Nutanix vs Nimble but that simply wouldn’t be accurate. In addition Nutanix has made some interesting(and accurate in book) claims stating that traditional methods for judging performance don’t 100% translate into the Hyper-Convergence world. The traditional approach of using IOMeter with various block sizes mixed with different read/write ratios is not really appropriate unless you have a single VM that will actually use that much performance.
I will assume that you have read the actual articles from Nutanix and Nimble so don’t expect me to re-explain what is already in them. The summary table is included at the top of the post to give an overview of our findings but don’t let that keep you away from the technical meat of the article.
Items | Nutanix | Nimble with HP Blades | Advantage/Notes |
CPU | 576 CPU’s | 560 CPU’s | Tie |
Sata Disk(per Datacenter) | 2880 SATA Disks | 168 SATA Disks | Advantage Nimble – Assuming both provide equal user experience |
SSD Disk(Per Datacenter) | 576 SSD Disks | 56 SSD Disks | Advantage Nimble – Assuming both provide equal user experience |
Networking |
22 switches | 7 switches | Major Advantage to Nimble and Blades |
Racks(per datacenter) | 10 Racks per datacenter | 7 Racks per datacenter | Advantage Nimble but…not clear if Nutanix would want to use their high desnity solution if density was a requirement. Additionally food for thought is that the Nimble/Blade Solution could use less racks if you wanted to fit 4 blade centers on a single rack. |
Rack Units(total) | 616 | 446 | See above |
Power | ??? | ??? | I don’t know how to gain a handle on real world power usage and simply putting the power supply wattage isn’t a good way to compare. Logic tells me they would be about the same due to CPU’s being about the same but I’m not entirely sure if that is an accurate way to view the differences. |
Cost | Won’t publish | Won’t publish | Advantage Nimble and Blades -I won’t publish pricing but I have seen pricing from both vendors(although not the 8150 node used in this design and not at the scale of this project). The pricing I saw was aggressive from both vendors the nimble and blade solution was materially less expensive than Nutanix….that was about 1 year ago and of course pricing changes. |
Labor(Time) | ??? | ??? | It not clear to me that Nutanix requires measurably less labor…although that’s a claim Nutanix makes with the invisible infrastructure marketing. |
Labor(Skill Set) | ??? | ??? | It is not clear to me that one solution has edge over the other here…Lots of theoretical arguments on each side but nothing measurable. |
Centralized Management | Prism claims 1 pane of glass | No centralized management | Advantage Nutanix – I can’t verify how well prism works because I don’t have a lot of experience but the absence of a cohenisve solution for Blades and Nimble gives Nutanix an advantage. |
Vmware Cluster Design | Nutanix has an eloquent design | Nimble/Blades isn’t as eloquent | Nutanix has the edge – Its not clear how much extra value it provides. |
LUN Management | Nutanix has an eloquent design | Nimble/Blades isn’t as eloquent | Nutnix has the edge. Determing strategy for which storage to create LUNs on and which hosts to present those LUNS is not trivial. VvOls is a technology that could help solve this in the future. |
Simplicity | Nutanix claims much simplier total management | Nimble claims simple SAN management | No Clear winner – Nutanix claims simpler but I’m not sure I am fully convinced. The blade Nimble model isn’t that complex and although it might not be as eloquent sounding as Nutanix does it mean it is that much simpler? |
Not Fair:
There are some things that aren’t fair, i.e there are aspects that don’t make this an apples to apples comparison that I want to address before I get a call from Nimble or Nutanix. Despite some of these issues I still believe it to be a worth our time to proceed with the comparison especially because we are not looking at actual performance(IOP and Latency) and assume both solutions provide an acceptable performance level for the workload.
- Nutanix was responding to an RFI which means it designed the system to stand up to the real world not just jetstress. As such it may not be completely fair to compare it to Nimble who is just trying to show off its Mailbox per Disk density via a synthetic test.
- Nimble Storage’s real goal in the paper was to demonstrate SAN performance and wasn’t really trying to optimize the server side of things because the wow factor for them is SAN performance in a small footprint. Because of this we aren’t sure if Nimble could use less servers than the 20 from their paper.
- Nutanix has other models outside the NX-8150 node(which was used in Nutanix paper). These other nodes get higher densities. Its not clear if Nuatnix would prefer other models for the sake of comparing…although it would be strange to use the 8150 in the RFI but actually prefer another solution for comparison.
- 1.4 Million Exchange mailboxes is not exactly a normal workload. Just because Nimble or Nutanix looks better for this bizarre workload doesn’t mean that this advantage carries over to all workloads.
Assumptions:
- Similar to the Nutanix design we will add Exchange LAGs to the nimble design and have it reside at the secondary datacenter with the DAG. The nimble paper did not call for this but we believe it is an appropriate assumption that the hardware can support it.
- These articles were written at similar points in time in 2014 so even though we know CPU cores and SSD drive sizes have increased we won’t give either vendor the benefit of moore’s law for the purpose of the comparison. Obviously 1 year later the new intel CPU’s have come out and SSD drives have gotten larger and could impact things.
- Nimble did not try to design a system to meet 1.4 million mailboxes…instead we used the 100,000 design and replicated it 14 times…. so although it is fairly certain that this would work… its not clear that this is optimal. If nimble designed specifically for 1.4 Million mailboxes they might use a different design. However we can assume that the replicating 14 times does provide a worst case solution and hence a valid way to compare.
Nutanix Design:
Nutanix spells out its design pretty well so no need to repeat it.
Nimble and Blade Design:
To scale the Nimble design to 1.4 million mailboxes and enable it to be comparable to what Nutanix proposed in the RFI we have to specify the design a little more than Nimble’s paper does.
We will use 4 all SAN racks per site. Each all SAN rack will have 4 x CS700, 4 x all flash shelves, 4 x capacity shelves…except the 4th rack which will have 2 of each. That is a total of 14 x Nimble Systems. These could either be managed as 14 autonomous systems or you could cluster the systems in each rack together into 1 logically managed unit(scale-out in nimble speak)…but you can’t manage all 14 Nimble’s as one.
For the compute aspects of this design we could either use HP Blades or SuperMicro Twin servers. Since I know HP blades best I will use those. We are going to use 10 Active Server/10 Passive Server to support 100,000 mailboxes as the scaling factor, i.e. 20 servers per 100,000 mailboxes(this was specified in the Nimble Paper). The main datacenter will have the active servers and secondary datacenter will have the passive(DAG and LAG). To achieve the compute required to have 1.4 Million mailboxes the blade solution requires that the main datacenter and secondary datacenter each have 3 blade racks. Each blade rack will have 3 blade systems. Each blade system will have 16 blades except one blade system on the last rack which will have a single blade system with 12 blades. NOTE: You can only link up to 7 Blade Chassis into a single logical unit for managing so you would not be able to manage all blade systems as one. Each blade system would have two IO Modules leveraging Virtual Connect technology. Each IO Module would have two LACP groups(SUS groups in HP Virtual Connect Speak), a SUS group for storage and a SUS group for VMNets, Vmotion and Management. If you want more specifics about the design you can read HP’s best practices with blades and virtual connect -> http://h10032.www1.hp.com/ctg/Manual/c0261617. In terms of networking you would need 3 X 10Gbs for the Storage SUS Group and 3 X 10Gbs for the VMnet SUS Group(Link Aggregation Group) per IO Module for a total of 12x 10Gbs ports per Blade System. In addition we need 4 X 10Gbs ports per Nimble Systems.
Calculating the total Networking requirements per datacenter you can see that we need a total of 9 Blade Chassis x 12 10Gbs ports per blade system = 108 10Gbs ports…plus 14 Nimble Systems * 4 10Gbs port per system = 56 Gbs ports….giving us a grand total of 164 10Gbs port. To deliver this number of 10Gbs ports I am partial to using Juniper’s virtual chassis technology. This allows a high speed back plane via something that resembles stacking as well as LACP across switches without MC-LAG. We can use 6 x Juniper EX460040F switches which provides 144 fixed 10Gbs SFP+ ports. In addition each switch has 12 x 40Gbs QSFP+ ports. I am not going to diagram out the spin and leaf design but basically you can see that it almost provides sufficient 10Gbs ports via the fixed ports and with a few 10Gbs modules it can provide the rest. For the uplinks each of the 5 leaves can have 120Gbs uplink to each of the spines. The network switches can be placed in any of the racks however the design that comes to mind would be that each blade rack has a single leaf switch servicing it and one of the SAN racks has 1 leaf switch and 2 spine switches. Each Blade System will connect to one leaf switch on its rack as well as 1 leaf switch on another rack. We will use some of the spine switches 10Gbs ports. The Blades Servers themselves will be diskless and boot vmware from SD card.
So in summary at the primary datacenter we have
4 X SAN racks w a total of 14 x CS700 Nimble Units +All Flash Shelf +Capacity Shelf occupying 126 RU.
3 X Blade racks w a total of 9 x C7000 containing 140 blade servers occupying 90 RU
7 x EX4600-40F Switches mixed in the 7 racks occupying 7 RU.
We will simply double it for the secondary datacenter.
Component Differences: I thought it would be good to compare the relevant components for each system.
CPU and Memory: Nutanix has 144 Nodes at the primary datacenter and Nimble has 140. The 4 extra required by Nutanix is probably a rounding error than any meaningful advantage.
Network: Nutanix requires 20 switches and Nimble Blades requires 7 switches. It appears that switches are of comparable quality(cost) and leverage similar leaf and spine design. The lower quantity of switches for the Nimble/Blades Solutions provides a big advantage in terms of total networking cost. I think the virtual connect technology used in the blades actually makes the blade design simple but I have heard from others that don’t feel this way. Accordingly Nutanix’s ability to plug each block directly into the network, although it requires more ports, might be considered simpler to some.
SATA Hard Drives: Nutanix provides 2880 Sata disks per datacenter = 20 disks per node x 144 nodes. Nimble provides 168 SATA disks = 12 SATA disks per unit X 14 units. This means Nutanix has 17 times more SATA disks.
SSD Hard Drives: Nutanix provides 576 SSD disks = 4 x SSD disks X 144 nodes. Nimble provides 56 SSD disks = 4 SSD per unit X 14 units. This means Nutanix has 10 times more SSD.
Power Usage Differences:
I am not the person to calculate or explain real world power differences in power draw/usage between the two solutions. It would seem to me that since CPU dominates power usage and both servers are running approximately the same workload on the same CPU that actual power usage would be similar. The addition of Nimble units compared to the power to run the extra hard disks in the servers for Nutanix is probably small compared to the 576 CPU’s running across the two datacenters. If someone wants to help me on this comparison especially if they can demonstrate a material difference in either solution I would be happy to have the help and update this aspect of the comparison.
COGS Observation:
The quantity of Disk components is ~15 times higher for Nutanix than Nimble. But what is the cost impact on the entire hardware system to the manufacture (Nutanix)? To answer this question I went to ThinkMate to quote a SuperMicro system. I was able to get a 2U unit with 2 X 3.0Ghz ten core processors and 256GB RAM with NO DISKS for ~$10,000. To represent Nutanix’s cost increase of having addition SSD and SATA I added 3 x ssd @ $700 and 8 x 1TB SATA disks @ $250 to the ThinkMate quote which brings the system price to ~$14,000. This means that Nutanix has a COGS disadvantage of ~ 40% = $4000/$10000 compared to Nimble Blade Solution based on additional hard drive requirements. This is Nutanix cost not your cost…your cost is made up of Nutanix Hardware COGS plus margin plus whatever they charge for their software. Nutanix probably gets a slight discount compared to street pricing but my experience with SuperMicro is that the reseller discount isn’t really high compared to street price. An additional point is that HP usually charges a premium for its hardware which reduces that actual margin difference.
Keep in mind this COGS model isn’t necessarily a fair way of looking at it because the cost to the customer for the Nutanix solution is more than just hardware COGS plus margin because of the software Nutanix developed. However the main point here is that somebody has to pay for the fact that hyper-convergence solution is using MATERIALLY more components and right now it is the VCs(Venture Capitalists) that believe this Hyper-C model WILL provide enough value to disrupt the technology landscape and sell sufficient units to make up this difference. This concept(Hardware COGS) is also why Nutanix is trying to determine if it is a software company or a hardware company. At this point Nutanix is playing both fields and saying….if the value of the Nutanix solution is integrated management and buying Computer and Storage from one vendor they are a hardware company but if the value shifts to just valuing their software then they are a software company. C-Levels want one throat to choke and geeks want to pick the hardware they are most comfortable with and Nutanix wants the option to sell to both buyers until their value in the market is crystalized.
Centralized Management:
You can see from this example that the centralized management capabilities of HP Blades are being pushed beyond their limits. The same is true for Nimble Storage. Vcenter is the glue that holds the management of the SAN and Hardware together by linking each of the vmware clusters together. Said another way the single pane of glass for day to day management is achieved through vcenter HOWEVER the single pane of glass for storage and hardware for non day to day tasks(Creating New Volumes on SAN, Expanding Volumes, Deleting Volumes, ILO into physical servers) is not available with the HP and Nimble Solution at this scale. The Nutanix solution claims a true single pain of glass solution through Prism and I have no reason to believe they can’t deliver on this promise. For this scale is this single pain of glass a material advantage that offsets any additional acquisition and support cost for Hyper-Convergence?
VMware Cluster Design with Nimble/Blade Solution:
I didn’t get into details regarding VMWare cluster design yet but the general concept is that each Rack would have 2 VMware clusters with members of each Blade Chassis in each cluster. Depending on the exact resiliency requirements you could create clusters cross Rack like Nutanix if you wanted. I didn’t spent much time on this concept because it doesn’t seem to add much value and no mater how you slice it is kind of clunky for this scale. Nutanix however has an eloquent model in their design with clusters spanning cross rack…and I don’t believe you can achieve a solution as eloquent with the blades.
Density:
The Nimble and Blade solution takes up 446 rack units vs the nutanix solution that takes up 616 rack units. That gives the Nimble Blade Solution an apparent edge however Nutanix has more dense models than the NX-8000 so if density was a requirement it might be able to lower the Rack Units. However the Nimble Blade solution could also be made more dense by using Super Micro Twin rather than Blades which would further reduce the number of Rack Units but increase the network requirements to something more similar to what Nutanix has. The main point is density is important but its not clear if Nimble Blades really has an advantage or not and even if it does with this workload it might not with other workloads.
Labor:
Not having managed the physical infrastructure for a 1.4 Million mailbox environment this one is difficult to comment on. It seems logical that Nutanix would take slightly less labor…but enough to justify the price difference? My experience with Nimble and HP blades is that once you set them up you don’t have to spend much time and it is not as time intensive as managing a legacy SAN environment. Personally, I’m really not convinced that Nutanix could save a tremendous amount of man hours compared to Nimble and Blades but at the same time I’m not so naive to think it is impossible. Its just hard to say.
An additional point Nutanix might make is that the Infrastructure is a little easier to understand which allows you to have a lower level guy handle basic Moves, Adds, Changes… but I think this is a myth. If anything the blade model makes it so you never touch physical networking since blades slide out for things like memory and mobo issues where as Nutanix requires you to handle the physical networking like a traditional server which leads me to believe there is a higher chance of human error. Ultimately it is just not clear what kind of FTE savings exist….both in terms of quantity of man hours and skill set of employee. Given the higher capital outlay over 5 years that I have seen with Hyper-C solutions this labor cost savings will be the most important factor to prove…. not just claim…. or bottom line driven buys won’t bite. On a related not one of my main hesitations preventing me from fully buying into Hyper-C solutions is that they tend to show TCO justifications using labor costs savings derived from a comparison to labor costs for Grandma’s SANs Solutions(EMC)…. not modern easy to use SAN solutions.
Wish List:
I wish I could easily compare add other Hyper-C vendors like Atlantis or Simplivity to the mix. I would also like to see how something from an all flash array’s like pure storage stacks up.
Nutanix Model Issue :
I find it odd that Nutanix, who typically tout’s density as a value proposition, used the NX-8150 in their design which is the least dense of their systems. This is the reason it took Nutanix 20 racks vs the 14 for Nimble/Blades. If you switch the Nutanix solution to use the Super Micro Twin that Nutanix uses in the NX-3000 series or even the NX-6000 Series you would get much better densities. I think it is safe to assume that Nutanix used the NX-8000 in this design for 1 of 3 reasons; they needed better CPU/Memory offered by the NX-8000…or they needed the extra SSD/SATA disks provided by the NX-8000 to handle performance or capacity challenges inherent in 1.4 Million Mailboxes… or they needed to hit a lower price point for the RFI(I’m not sure if the NX-8000 hits a lower price point for these requirements than designing with the NX-3000 series). This brings up a major issue with Hyper-C in general which is that optimizing the compute layer and SAN layer jointly (via picking the right size blocks) rather than independently (optimizing SAN design for storage and optimizing Blades for compute) can added unneeded cost.
Scale Up/Down:
Nutanix claims to have an edge when it comes to simplicity as you scale. Adding a node is as simple as plugging 4 x 10 Gbs ports into the network. In addition, the single pane of glass offered by Prism likely provides some value especially in cases where the general scale of the system has pushed past the limits of the tools for managing the SAN and the blade hardware(i.e. nimble can only manage 4 SANs together and HP blades can only manage 7 together). My personal opinion is that scaling a Nimble SAN and HP blades is not very difficult but certainly it is not as simple as the Nutanix system via prisim. However on the flip side, Nutanix is at a cost disadvantage for scaling compute and storage because you are required to use a nutanix building block which comes with both compute and storage. In addition to the extra hardware you also have to pay support each time you add compute with Nutanix where as traditionally support for compute is cheap.
Is the value of the simplicity that Nutanix claims is inherent in their solution enough to offset the higher acquisition cost and greater maintenance cost? Without truly understanding the labor burden differences and exact acquisition cost differences it is hard to comment on this.
Also I am sure HP provides some ridiculously complex software to manage the hardware and vmware as one but my experience is that HP always misses the target when it comes to making their software simple so I am purposely ignoring this software as possible single pane of glass solution.
Real World Questions:
I believe that if a Nimble/Blade solution was really trying to hit 1.4 Million mailboxes the solution would be designed a little differently depending on if the objective was density or cost. For example could we use addition SSD in the All Flash Shelves and avoid another CS700 unit? The actual blades that are used could be designed to use lower cost lower Ghz CPU to optimize cost or alternatively designed to use higher cost higher Ghz CPU to optimize density. Without specific design objectives and an understanding of Rack Space Cost and Power Costs it is difficult to say that this Nimble Blade Design is optimal.
Asset Lifecycle Questions:
We typically amortize server costs over 4 years and SAN costs over 5. How does this difference in amortization work with Hyper-C when storage and compute are combined? Should we amortize Nutanix over 4 years like a server or over 5 like SAN? Is it even a material difference in the entire solution once you factor in other costs like labor and VMware/MS licensing? These are questions that haven’t really been addressed by any of the Hyper-C players and they are too new to have any real data.
Getting Lucky Three Times:
This is an interesting concept I believe is worth sharing although not related to the comparison. The idea is that Hyper-C works well if you get lucky three times with Moore’s law. Hyper-C needs advancements (moore’s law) in Storage Performance, Storage Capacity and Compute to roughly match each other. This allows them to have building blocks(nodes) that match real world needs. Prior to flash Hyper-C could not have existed because even though Storage Capacity and Compute were following Moore’s law storage performance wasn’t. So Hyper-C is dependent on SSD Capacity, SATA Capacity and CPU cores to make similar advancements year over year.
Automation:
It is difficult to comment on this because I don’t know the exact requirements but it is hard to imagine managing 1.4 Million mailboxes without some sort of automation needs. I believe Nutanix can use both powershell and HTTP REST functions for automation which is a step up from VMware and Nimble would use Powershell only(mostly). The Nimble Blade Solution and may require a little extra code since ensuring storage creation happens on the right nimble SAN and attaching datastores to the right host is a little more complex logic. Vmware does have a orchestration tool called vRealize as well that could be used with either solution for the infrastructure automation and if used may make any differences I brought up moot.
If I was coding the automation my preference would be Nutanix because I imagine a unified command set that can leveraged by both powershell as well as other languages that easily support HTTP REST… like python. It also would allow me to ignore some of the complexity of creating the right LUN on the right SAN and mapping that LUN to the right hosts. However this difference might be negligible depending on the exact requirements and once the automation is coded it isn’t a problem anymore. An additional consideration is that infrastructure automation is not as important as application automation which would be done through exchange powershell API and automation at this layer has little to do with the underlying infrastructure automation.
To summarize…. How important is a single unified Infrastructure Automation API which Nutanix has and Nimble and HP Blades don’t? The answer is it depends on the project. Is it a strong advantage for Nutanix? Probably not for most customers we work with but it highlights the vision of Hyper-C and Nutanix in general.
Simplicity:
Nutanix sends a pretty consistent message that they are much simpler as a whole which means less risk for the organization and potentially less labor cost. But reading the article http://www.storagereview.com/why_we_don_t_have_a_nutanix_nx8150_review doesn’t make it sound simple. Nutanix sounds extremely picky about software version and VMware version and general configuration to maximize performance. Reading this article makes it sound like Nutanix is still a technology with technology challenges… not a silver bullet. Glancing through the Nutanix Bible you can see a whole architecture that you need to learn to understand how Nutanix is working behind the curtain and it is not simple. So the real question is…Can Nutanix abstract away the details of its inner workings or do you still need someone on your team that understands these complexities. Personally, I’m not completely convinced it can abstract them a way to the point that a customer just doesn’t have anyone on their team that cares about these details.
Solving Problems:
Do not under estimate the power of solving real world customer problems. The fact that both solutions can be scaled to handle a 1.4 Million exchange mailbox solution is a feat in itself. Nutanix has happy customers because they solve problems for their customers and the same is true of Nimble. Moving an organization forward from where they were is what technology advancement is all about…and sometimes the geeks comparing the nuts and bolts of a solution(like me) overlook this as the primary goal. Do not under estimate the power of a sales team that can find a customer stuck in the past with their technology and buying decisions. The first team that can find this customer and educate them is typically the one that gets to put in their technology…and its not the cheapest one and its not always ones that the geeks think is the best one! My point is that currently both solutions are out solving customer problems and moving those customers further forward than their past solutions and that’s what is really most important.
Editorial:
I think the reason a number of startups in the storage industry have been successful(think Tegile, Nimble, Atlantis, PernixData, Tintri, Nutanix) is that the sales mentality “If I can prove I can solve your problem for an agreed upon price will you buy” has been simple to execute because EMC and Netapp have been over charging for years and SSD Technology pared with Multicore Advancements has allowed everyone…Hyper-C vendors and SAN startups… to deliver better price points and better performance than the old guard. This fact combined with a new type of buyer,(think Saas Company and Private Cloud Companies) whose organization treats IT as a profit center, creates opportunity where little existed before as this type of buyer is much more likely to make a decision outside of Status Quo if it drives revenue up or costs down. However at some point the unhappy EMC or Netapp customer becomes more and more scarce. At that point the technology vendor that has a sustainable business model, cash in the bank and lots of customers will be the one that survives. Is Hyper-C and/or Nutanix a true technology advantage that will lead to sustainable business models or simply a cool idea riding the hype-cycle?