Friday, October 31, 2008

LEARNING - Question of the WEEK

Question of the week:

One issue that has come up a couple of times is the ability of EMC and I believe Hitachi to identify and remove hot spots by dynamically migrating hot luns. How have you typically addressed this type of issue with EMC accounts?

Answer of the week:

Without wanting to sound too caviler, this sounds like a red herring. I would like to take a few minutes to cover the basics, and then discuss the EMC LUN migration feature.

This question brings up the topic of “performance”. This topic is usually avoided by EMC, for good reason. If a customer were to ask me about this feature, I would respond by asking “does performance matter?” Since they brought up the topic, there is a very good chance they’ll say yes. With the door now open, the real question should be “What is the relative performance of the systems that we’re comparing?”

Not all storage arrays are the same. The basic architecture of the DS4000 products is very different than the EMC Clariion line, even though they may look the same on the outside. This architectural difference gives the DS4000 products a much higher performance curve than the EMC product line. Comparing SPC-1 (industry standard full disclosure) benchmark results shows that the DS4700 gets about 70% of the performance of a CX3-40 with 41% of the drives (we used a lot less drives during the test). This means that the DS4700 achieves about 268 IOPS/drive where the CX3-40 achieves only 161 IOPS/drive. Given those numbers, if the automatic optimization techniques could achieve a 20% gain in performance (which would be amazing), then they would only be about 70 IOPS/drive slower than the DS4700. So before you start chasing the red herring, make sure you understand the complete environment. From examining the empirical evidence (SPC benchmarks), I would have to say we have a much better advantage, meaning your customer can either do the same work with much less investment (buy less drives), or get much better performance by purchasing the same amount of drives.

The second topic is that of MetaLUNs. The IBM DS4000 (the DS3K and DS5K also) does not incorporate MetaLUNs. Instead, our approach is to allocate contiguous blocks of storage for each logical drive, and maintain that relationship. This means that we do similar commands differently. Specifically:

Logical Drive Expansion - When a DS4000 logical drive is expanded, we will move other existing logical drives to maintain the contiguous nature of the LUN being expanded. This may take a little longer, but it does have the advantages of maintaining the performance profile of each LUN. With MetaLUNs, you stitch together a bunch of little pieces. This introduces large seeks after expansion, which negatively impact performance. NOTE: To avoid this effect, you must manually perform “LUN Migration” the LUN that was just expanded. This has a negative impact on controller performance, and requires the extra capacity in a desirable location/RAID set.

RAID Set Expansion - When a DS4000 RAID set is expanded, the data is restriped over all the drives automatically to maintain contiguous allocation. This also means the additional drives are used for existing data giving better performance, like you would expect. This is not the case with the MetaLUN implementation. Adding new drives to a RAID set only provides extra capacity. You must manually copy the existing LUNS around (sometime several times) to take advantage of the new disks performance.

Human Error Recovery – Because we allocate contiguous blocks for each logical drive, the definition of that logical drive is the starting logical block address, and the number of logical blocks. If someone accidently deletes a logical drive definition, or a RAID Array definition, or for that matter resets the entire storage subsystem, we can (usually) recover your data. We can do this by turning off the background LUN binding, and re-applying the storage subsystems configuration. I have not ever heard of another storage vendor who can mimic this capability. With MetaLUNs, the logical drives can be in pieces all over the place in the back-end of the storage subsystem. This makes the difficultly of piecing back together a logical drive near impossible once it’s removed. Although it may be possible to someday build this functionality into our competitors systems, it will be much more difficult because of MetaLUNs.

Controller Overhead – With the implementation of MetaLUNs, the controllers are now also tasked with the additional job of maintaining and managing MetaLUN addressing. This is overhead the DS4000 controllers don’t have to waste cycles performing.

Coping to Different RAID Protection – One of the key performance gains that EMC talks about is migrating the LUN to a different RAID protection scheme. For their method, you’ll need a large enough target area on a RAID set with the desired protection scheme, as long as that target RAID array is not too busy. Wow, good luck with that. With the DS4000, if you want to change the RAID protection, then you change it in place on the existing drives. No need to impact the rest of your system. Just make sure you have enough space (if going from RAID5 to RAID10), and perform the background task. I think our method is much simpler and less risky.

Now, there are advantages to the MetaLUN implementation that I should mention. The only one I can think of is that logical drive and RAID set expansions happen much more quickly.

From:

James Latham
DS3000/DS4000
Product Specialist

Storage Solutions Engenio Storage Group

LSI Logic Corporation

No comments: