Emulex Blog: The Implementer's Blog

Blog Series Part 1: Can disk parameter Disk.DiskIOMaxSize make a difference with large I/Os in VMware vSphere® 5.1?

Posted April 24th, 2013 by Alex Amaya

This blog is the first in a two part series that examines Fibre Channel over Ethernet (FCoE) implementations with VMware vSphere 5.1 using VMware’s software FCoE and a hardware FCoE adapter. These blogs are intended to share our findings regarding the relative performance of software and hardware FCoE adapters when working with large-block, sequential I/O – in particular, the impact of the Disk.DiskIOMaxSize setting on storage performance.

In recent lab tests with software FCoE and a few virtual machines (VMs), we encountered an unexpected drop in throughput (MB/s) with large block I/O. We were using sequential I/O through a single physical 10Gb Ethernet (10GbE) port. The VMs were running Microsoft Windows 2008 R2; each was configured with four virtual CPUs (vCPUs) and 8GB of memory. Two raw device mapping (RDM) disks were mapped to each host. We enabled the software FCoE driver that comes with the hypervisor and made appropriate LUN mappings.

The IOmeter software tool was used to test a range of block sizes (512B – 1MB) across all RDM drives, with two workers per VM – one set to test 50% reads and the other to test 50% writes for full duplex mode. The targets used in this case were four Linux-based storage memory emulators with four targets each, for a total of 16 targets.

Figure 1 shows the results for these sequential I/O tests when we used the default setting for Disk.DiskMaxIOSize. This figure represents the baseline performance for software FCoE.

Figure 1. I/Os with default Disk.DiskMaxIOSize setting. using software FCoE.

With larger block sizes, the array was unable to perform any I/Os.

Figure 2 shows throughput during the same test of software FCoE and, in particular, the drop-off that occurred with larger block sizes. At this point, we theorized that the array became stressed with blocks that were 64KB or larger.

Figure 2. Throughput with default Disk.DiskMaxIOSize setting running software FCoE.

We also observed latency times using esxtop on the host to see if they might be a concern. Results are shown in Table 1, which provides average rather than median values.

For more information on storage performance in vSphere, refer to the VMware vSphere Blog.

Table 1. Average latency values with default setting

Block size DAVG read DAVG write
256K 16 ms 16 ms
512K 17 ms 43 ms
1M 19 ms 50 ms

Note that, with the default Disk.DiskMaxIOSize setting, no I/Os were taking place with larger block sizes, as demonstrated by Figures 1 and 2. DAVG represents the latency between the adapter and the target device. Note that according to VMware, latency of 20 ms or more are major storage performance concerns.

To address this storage performance issue with large block sizes, we turned to VMware KB article (kb:1003469), which suggests reducing the size of I/O requests passed to the storage device in order to enhance storage performance. You can achieve this size reduction by tuning the global parameter Disk.DiskMaxIOSize, which is found on the host under Configuration→Software→Advance Settings→Disk. As shown in Figure 3, this parameter is defined as the Max Disk READ/WRITE I/O size before splitting (in KB); thus, larger blocks are split into multiples of the Disk.DiskMaxIOSize setting.

Kudos to Erik Zandboer, VMware expert and VMdamentals blogger, for bringing this article to our attention!

Figure 3. Displaying the default Disk.DiskMaxIOSize setting, which is 32MB

After reading this KB article, we decided to vary the setting of Disk.DiskMaxIOSize to determine if this would, indeed, enhance storage performance. Since we had noticed that performance was beginning to deteriorate with 64KB blocks, we restricted the maximum block size to 64KB, as shown in Figure 4.

Figure 4. Changing the Disk.DiskMaxIOSize setting

Next, we re-ran the test to see if there was any impact on IOPS, throughput and latency.

Note that we did not monitor CPU utilization, which should not be overlooked if you plan to tune Disk.DiskMaxIOSize.

Figure 5 shows that reducing the Disk.DiskMaxIOSize setting had little impact on read/write I/Os.

Figure 5. I/O performance with the new Disk.DiskMaxIOSize setting when running software FCoE

Figure 6 shows the throughput achieved with the new Disk.DiskMaxIOSize setting. Throughput was now able to approach line rate (2300mb) and, rather than crashing as before, only dropped slightly with large-block I/Os (512KB and 1MB).

Figure 6. Throughput with the new Disk.DiskMaxIOSize setting

Table 2 shows that, with the new Disk.DiskMaxIOSize setting, latency began to average out between read and write I/Os, with 33ms for 512KB blocks and 68ms – 72ms for 1MB blocks.  However, these latency timings are still in the range of sever storage performance conditions.

Table 2. Average latency values with 64KB blocks

Block size DAVG read DAVG write
256K 13 ms 13 ms
512K 33 ms 33 ms
1M 68 ms 72 ms

Please note, these results are specific to our lab environment. You should perform your own tests to determine if changing the default Disk.DiskMaxIOSize setting would be beneficial in your particular environment. In addition, there may be trade-offs elsewhere in the storage stack that we are still investigating; we’ll also be comparing these software FCoE results with a hardware FCoE implementation.

So, do you really need to change the Disk.DiskMaxIOSize setting? We agree with Erik that you first need to determine the block size your VMs are executing and, if you are getting poor storage performance with large blocks, then tuning Disk.DiskMaxIOSize might be a consideration. Note that we performed these tests in order to validate that tuning Disk.DiskMaxIOSize would enhance storage performance in a lab environment with sequential reads and writes. However, in many real-world cases, traffic between ESX/ESXi hosts and the array tends to be more random.

Here are the key takeaways:

  • Software FCoE out of the box does not handle large block I/O requests, resulting in lower throughput and latency outside of the range of recommended by VMware.  Large block performance can negatively impact applications such as backup, streaming media and other large block applications.
  • Using VMware ESXi’s  Disk.DiskMaxIOSize, we could change the performance dynamics.  However, latency still measured outside the acceptable range.

In part two of this blog, we will repeat this testing to evaluate the impact of Disk.DiskMaxIOSize on storage performance with a hardware FCoE implementation.  We will note that hardware FCoE has many advantages including better CPU efficiencies.  Stay tuned…

Emulex VMware vSphere® 5.1 Web Client plug-in and the missing step…

Posted January 18th, 2013 by Alex Amaya

Emulex recently announced support for the new VMware vSphere® 5.1 Web Client with Emulex OneCommand Manager plug-in for VMware vCenter ™ version 1.4.10. So of course, I download the plug-in and replaced my older version. I found out the original OneCommand Manager plug-in for the VMware vCenter desktop client works and installs the same way. But the Web Client is a bit different. I found out I need an extra step to have this puppy working with my Web Client.

My intent in this blog is to inform you of a step that’s different in the configuration process for the plug-in. After trying a few times to get it to appear correctly, I gave in and searched VMware’s documentation. That’s right – I read the manual – in my case. I came across VMware vSphere 5.1 API/SDK Documentation (By default, the plug-in is disabled and does not show up in the Web Client.) When you install the OneCommand Manager plug-in for VMware vCenter version 1.4.10, it will have the plug-in for the Web Client. If you are able to get the plug-in to work through the VMware vCenter desktop client, you should be able to install it for the Web Client. Of course, you must have VMware single sign-on working, the VMware vSphere 5.1 Web Client installed and working, your credentials all taken care off and the correct CIM providers installed to get the plug-in registered and running.

So here’s what we had to do to get the Web Client to appear under “Classic Solutions: for both cluster and host.”

First, the file called webclient.properties in the VMware vSphere Web Client install directory needs to be unhidden, To do that, we need to unhide the %Program Data% directory.

  1. Open windows explorer
  2. Select C drive
  3. Press the alt key to bring up the conventional menu bar
  4. Click tools
  5. Click folder options
  6. Click view
  7. Check show hidden files, folders and drives
  8. Click OK

By the way, if you don’t capitalize the ‘P’ in plug-in, the Web Client won’t launch.

To activate the plug-in in the Web Client, a properties file needs to be modified on the server where the vSphere web client is installed.

Steps:

  • Locate the webclient.properties file in theVMware vSphere Web Client install directory, typically %ProgramData%\VMware\vSphere Web Client, and add the following line.
    scriptPlugin.enabled = true
  • Save and close the file.
  • Restart the VMware vSphere Web Client service.
  • Once the above change is made, log back into the Web Client.
  • Select Host and Clusters from the Home Tab.
  • Open by clicking the right arrow pointing to the name of your VMware vCenter server in the domain to see your cluster and hosts.
  • Select the host or the cluster and at the top, you should see a security certificate error. That’s where the plug-in will be registered.
  • After registering the plugin, you should now see a tab called “Classic Solutions” for either cluster or Host. See image below.

Can you run both of the plug-ins? Sure! See the image below.

VMware vSphere- 5 Web Plugin

The process to unlock and enable Fibre Channel over Ethernet (FCoE) capabilities with IBM BladeCenter HS23 using IBM’s Feature on Demand

Posted December 19th, 2012 by Alex Amaya

The purpose of the blog is to inform you of a new application note written by the Implementer’s Lab as how to enable FCoE with IBM’s Feature on Demand (FoD).

This past October, IBM announced an IBM BladeCenter HS23 which was one of the first IBM BladeCenter servers to offer four integrated LAN ports. The Emulex 10Gb Ethernet (10GbE) Virtual Fabric Adapter II (VFA II) for IBM BladeCenter HS23 LAN on Motherboard (LOM) is integrated into select IBM BladeCenter HS23 blade servers. It features two physical 10GbE ports and two physical 1GbE ports. This LOM solution can be configured to provide up to eight virtual ports (four per physical 10GbE port) each of which can operate at 100Mb – 10Gb with a maximum of 10Gb per physical port.

Emulex Technical Marketing received a pair of new IBM HS23 Blades to test with FCoE. Given the Emulex 10GbE VFAII capabilities and flexibility, two virtual ports can be configured for storage connectivity for iSCSI or FCoE. Because this capability is disabled out of the box and must be unlocked using IBM’s FoD). The Implementer’s Lab team decided to write an application note to help with the unlocking process. Please check out the application note here for more detail on how to enable FCoE with IBM BladeCenter HS23 using IBM FoD and let us know what you think!

Super Computing 2012 – Emulex NX solutions with FastStack DBL​ and low latency switches

Posted November 13th, 2012 by Alex Amaya

It’s important to have the correct network switch with Emulex OneConnect® 10Gb Ethernet (10GbE) Network Xceleration™ (NX) OCe12000-D adapters for low latency applications.

SC12 is in full swing this week, and I am sure we will be hearing all about the latest technology for High Frequency Trading (HFT), High Performance Computing (HPC), RDMA, low latency, InfiniBand (IB), 40GbE and a few others. In line with the low latency, Emulex has been testing the impact of having the correct network switch in HFT environments. In many cases, a network switch is just a network switch with a few bells and whistles. However, when it comes to having the correct network infrastructure to achieve a few more thousand trades than the competition, choosing the right switch can make a difference.

Emulex Technical Marketing engineers (TMEs) tested Emulex’s OCe12000-D low latency adapters on two servers connected back-to-back with no switch in between. Both User Datagram Protocol (UDP) and Transmission Control Protocol (TCP) were tested with and without transparent acceleration (TA) to show the difference between the two protocols. As expected, UDP should have lower latency when compared to TCP, but would a switch make a difference? UDP is small and lightweight so there’s no error correction, which is why it is so quick. There is some error-checking with UDP, but there’s no recovery option. Because UDP is a small lightweight protocol and sits on top of IP, there’s no ordering of messages and no tracking of connections. TCP, on the other hand, has to set up a connection before any data can be sent, and it also does check for reliability and congestion control adding to the overhead. TCP can guarantee the message arrives intact as it was meant to be sent. On the other hand, UDP has no guarantee the messages sent will arrive.

Emulex TMEs tested two switches, one from Gnodal GS4008, and the other a standard 10GbE network switch. Each server had a dual-port OCe12000-D adapter. One port was connected to the Gnodal switch and the second port was connected to the 10GbE switch. The same was done on the second server. We ran a simple utility to test for both UDP and TCP messages. The utility used is called tcp_pinpong which is a simple point-to-point test, showing basic functionality between one OCe12000-D sender adapter port to another OCe12000-D receiver adapter port. There was minimal port switch configuration on both switches, so these results will vary when the switches are tuned to vendor switch recommendation.

The latency test results for both of the switches were compared to the server in back-to-back tests. The servers connected back-to-back were as low as 2.8µs for UDP and 3.8µs for TCP. Just like the server back-to-back test, we used Emulex’s FastStack ™ DBL to demonstrate TA by running dblrun command in front of the tcp_pingpong command. The image below show results of the Emulex OCe12000-D adapter when connected back-to-back with a Gnodal low latency switch and a regular 10GbE network switch. The importance of having the right configuration in your infrastructure can result in more trade transactions versus a standard 10GbE switch.

FastStack DBL

  • Speedometer 1: Highlights the lowest latency simulation scenario utilizing a connection of back-to-back servers with no switch in between
  • Speedometer 2: Highlights low latency numbers utilizing a Gnodal 10GbE low latency switch in between the servers
  • Speedometer 3: Highlights: latency numbers utilizing regular 10GbE switch in between the servers.

We hope to see you at SC12 this week at our booth, #632, to see low latency demonstrations for the HFT market!

What’s an “Error 1327: Invalid Drive E”?

Posted October 25th, 2012 by Alex Amaya

Last week, I was trying to uninstall OneCommand Manager and VMware Update Manager from the same Windows 2008 server. I kept getting a pop window with a message “Error 1327: Invalid Drive E”. So like almost everyone, when something unknown pops up, we refer to the Internet. I saw several postings with regards to “Invalid Drive E:” and a few other drive letters. All seemed to relate to either a system folder mapped to a network driver, changing the CD-ROM letter or a possible corrupt registry key error. I took a look at my registry key settings and all pointed to the correct path. I then picked one of the links from my search and used Adobe’s help forum. I basically followed the solution 1 and it seems to have corrected the problem. Here is the link I used: http://helpx.adobe.com/creative-suite/kb/error-1327-invalid-drive-drive.html

Basically, go to a command prompt and use the DOS command called “subst” to remove the drive letter.

Command prompt

  1. Select Start » Run
  2. Type cmd command
  3. Type the command as shown in the image above – “C:\>subst E: C:\” -and press Enter
  4. Type exit to close the command window
  5. Attempt to uninstall the OneCommand Manager

In my case, both OneCommand Manager and VMware Update Manager successfully uninstalled from the server.

Get Ready for Emulex High-Performance 16GFC and I/O Management Solutions for VMware vSphere 5!

Posted August 27th, 2012 by Alex Amaya

Are you planning on going to VMworld 2012 Moscone Center in San Francisco this week? If you are, stop by the Emulex booth #2023 for a preview of some cool stuff from Emulex for VMware vSphere ESXi 5.1. With VMware’s recent announcement of vSphere 5.1 which includes support for 16Gb Fibre Channel (16GFC) Host Bus Adapters (HBAs) with Emulex in-box drivers, Emulex will be demonstrating the Emulex LightPulse® LPe16000 Series 16GFC HBAs in a side by side comparison with LightPulse LPe12000 Series 8GFC HBAs in two separate virtual machines (VMs). As Figure 1 below shows by using the Emulex in-box driver, we are able to achieve 1600MB/s of read throughput for a single LPe16002 port.

Figure 1

Figure 1

There will be a few other interesting demos on display, such as the Emulex OneCommand® Manager virtual appliance (vAPP) and the Emulex OneCommand Manager plug-in for VMware vCenter Server for vSphere 5. Management is key area of interest for many of our customers. Managing tasks such as driver version, firmware updates and diagnostics – just to name a few – can be accomplished with our Emulex OneCommand Manager applications. For example, see Figure 2. If you are a VMware administrator who prefers to manage Emulex adapters through VMware vCenter Server, the Emulex OneCommand Manager plug-in for VMware vCenter Server provides single-pane-of-glass manageability without the need of having an additional host to run the application.

But if your environment consists of VMware, Windows and Linux hosts, and you’d still like to use the OneCommand Manager application, there are two options available. One is the Emulex OneCommand Mananger virtual appliance (vAPP). You will need to download the application and core kit for each of the operating systems (OS) and install the application, preferably on a management server of some kind. The second is the Emulex OneCommand Manager vAPP available from the Implementer’s Lab tools section. The folks from the Implementer’s Lab have taken all the necessary steps for deploying OneCommand Manager and placed it on a VM running CentOS 6.2 with Emulex OneCommand Manager. Deploy the VM on your vSphere 5 environment, and then install the appropriate CIM provider for each OS running on the host you wish to manage with an Emulex adapter. From within your VMware vCenter Server, you can now manage the VM and also run OneCommand Manager, even within the new web client from VMware.

Figure 2

Figure 2

We have a number of demos to show you at VMworld 2012, so please stop by booth #2023 and ask for a demo. The folks from the Implementer’s Lab will also be there in attendance, so stop by for a free 8Gb USB memory stick with the Emulex OneCommand Manager vAPP loaded for you to try. There will also be several booth theater presentations on the Solution Implementer’s Lab in terms of resources and tools for you for your deployment needs. Stop by Aug. 26 at 4:30, Aug. 27 at 3:30 and 5:00, Aug. 28 at 11:00 and 2:30. We hope to see you there!

Black Hat USA 2012 – Emulex FastStack Sniffer10G Product Demo at the Emulex Booth

Posted July 23rd, 2012 by Mark Jones

With Scott Schweitzer, Myricom

If you’re planning on attending Black Hat USA 2012 at Caesar’s Palace in Las Vegas, be sure to stop by the Emulex booth to see a demonstration of FastStack Sniffer10G working with Suricata, at booth #141 at the show. And, we’re also giving away ten passes to the Gun Store for their Zombie package Thursday afternoon!

Of particular excitement for our Implementer’s Lab team is the demonstration that we built that highlights our new OneConnect® OCe12000 10Gb Ethernet (10GbE) Network Xceleration™ solution running FastStack Sniffer10G with Suricata (see our announcement here, for more information). This demo showcases the key performance benefit of moving to OneConnect Network Xceleration over using a standard network adapter.

FastStack Sniffer10G

In this demonstration, we will show server-efficient 10Gb bandwidth and 100 percent lossless performance of the OCe12000 adapter with FastStack Sniffer10G software. This solution can provide network traffic capture, injection and analysis for performance-sensitive and mission-critical market segments, such as network surveillance, monitoring and analysis, deep packet inspection (DPI), test and measurement, and distributed denial-of-service (DDoS) defense appliances. Our demonstration highlights the performance aspect required of these missions by showing maximum 10Gb Ethernet (10GbE) performance when passing typical enterprise-class traffic of more than 3.5 million packets per second, while not dropping a single packet. Generic 10GbE cards leveraging Suricata encountering this level of traffic will typically drop 70% of the incoming packets.

Suricata with FastStack Sniffer10G

To leverage the performance of FastStack Sniffer10G with Suricata, several things must be done in the proper order:

  1. Install Sniffer10G:This package includes both a firmware program for the Emulex NX adapter and a new device driver for both Linux and Windows. To obtain the code, you’ll need to log on to Myricom’s website and download the latest build of Sniffer10G for your Linux or Windows system. You’ll then need to install the code, confirm that the adapter is licensed to run Sniffer10G, and confirm that the driver is loaded properly. Sniffer10G also includes several utilities for testing both packet capture and generation, these can be used to confirm connectivity.
  2. Build Suricata with Sniffer10G: Suricata is designed to run with a number of adapters. Once you’ve downloaded the Suricata code, make sure that when you configure the build, prior to making the drivers, that you include the necessary flags to utilize Sniffer10Gs libraries in the process.
  3. Tune Suricata: The configuration file is /etc/suricata/suricata.yaml and there are a number of changes that can me made that will greatly improve system performance.

Running Suricata with FastStack Sniffer10G

To run Suricata with Sniffer10G, you also need to pass in some environment variables that define the number of Sniffer10G buffers to setup and the flags that define how to connect those buffers to threads. Typically, these variables are: SNF_NUM_RINGS=16 and SNF_FLAGS=0×1

How to Test at 3.5 Million Packets per Second Using Real Traffic

The packet capture (pcap) file being played back contains 2,049 unique packets and SNF_REPLAY loops through this file 2500 times to generate a traffic stream of 5.12 million packets. It then injects these packets on the wire, in this case at wire rate, to achieve a packet rate of 3.58 million packets per second (Mpps) at a bandwidth of 9.279 Gbps. The difference between this bandwidth and 10Gbps is overhead, for example the inter-packet spacing on the wire.
null

Fig 2. Sniffer10G Replay tool usage

We will have this solution running live in our booth #141 at Black Hat USA in Las Vegas Nevada. Please feel free to stop by our booth and ask for us to give you a proper demonstration. We look forward to seeing you at Black Hat.

Interop 2012 – Emulex New Product Demos at the Emulex Booth

Posted May 8th, 2012 by Mark Jones

If you’re planning on attending Interop 2012 at the Mandalay Bay in Las Vegas, be sure to stop by the Emulex booth to see demonstrations of some of our newly announced products. You can find us at booth #1117 at the show, and it will be hard to miss since we will be displaying a Ducati motorcycle doing a wheelie in our booth, and we are giving it away! Of particular excitement for our Implementer’s Lab team are the demonstrations that we built that highlight our new OneConnect® OCe12000 10Gb Ethernet (10GbE) Network Xceleration™ solution line of products. These demos showcase the key performance benefits that each of the three new OneConnect Network Xceleration solutions have to offer.

FastStack DBL:
This demo showcases the low latency benefits of our new OCe12000 adapter combined with FastStack™ DBL™ software, which should be of interest to High Frequency Trading environments or anyone looking for the lowest possible Ethernet network latency. In our demonstration, we will be comparing the UDP and TCP latency of our network adapter when using the host network stack compared to FastStack DBL.

Fig 1. FastStack DBL Demo Screen

FastStack Sniffer10G
In this demonstration, we will show server-efficient 10Gb bandwidth and 100 percent lossless performance of the OCe12000 adapter with FastStack Sniffer10G software. This solution can provide network traffic capture, injection and analysis for performance-sensitive and mission-critical market segments, such as network surveillance, monitoring and analysis, deep packet inspection (DPI), test and measurement, and distributed denial-of-service (DDoS) defense appliances. Our demonstration highlights the performance aspect required of these missions by showing maximum 10GbE performance of more than 14 million packets per second, while only utilizing ~4.5% of the server CPU resources.

Fig 2. Sniffer10G Demo Screen

FastStack VideoPump
The third demo is a beta showing of our new FastStack VideoPump™ software that will be available later this summer. As the name implies, this product is targeted toward video streaming servers and appliances that require very high amounts of individual streams per adapter, while assuring predictable QoS. Our demonstration will showcase FastStack VideoPump’s extreme scalability and performance while maintaining low server CPU utilization. The demo uses 8 Network Interface Card (NIC) ports in a single server, communicating over 17,000 individual 3.5Mbit/sec traffic streams for an aggregate bandwidth of over 60Gb/s, all the while only using 25% of the server CPU resources.

Figure 3. FastStack VideoPump demo screen

If you would like a personal walk-through of these demos, please stop by the booth and ask to speak with me or anyone else from the Implementer’s Lab team. Also be sure to visit the Ethernet Alliance booth #2360 and ask to see Alex Amaya who is representing us in an industry-wide demonstration of various Ethernet technologies including our new OneConnect Network Xceleration solutions.

Are we up, or are we down?

Posted April 5th, 2012 by Alex Amaya

During our testing with HP’s ProLiant DL380 G7 server and HP’s 82E 8Gb Fibre Channel (8GFC) adapter, we encountered some connectivity issues with our internal infrastructure. With daily changes to our test lab infrastructure to accommodate the different tests we perform, there is always the possibility of something getting damaged along the way.

Deploying HP 8GbFC adapters with VMware ESXi 5.0 is a straightforward install since our Emulex lpfc820 drivers are already inbox . However, we did experience intermittent problems with our LUNs disconnecting and then reconnecting. With Emulex OneCommand® Manager vCenter Server plug-in, there is an option to track up and down link connectivity. This feature is not on by default. When enabled, we noticed our link status in the Tasks & Events tab from vCenter Server showing one of our ports disconnecting often. First, we tried replacing the SFP and we still experienced the intermittent disconnect. Next, we replaced the Fibre cable and the problem was solved. The description in the Task & Events tab will provide the WWWN of the Fibre Channel ports with a link down and up status. The image below illustrates the link up status after the cable was replaced.

For more information, check out the latest technical whitepaper from HP, which covers some of the features with ESXi 5.0. The deployment guide entitled, VMware vSphere 5.0: 8Gb/s Fibre Channel SANs with HP ProLiant DL380 G7 Servers and HP 3PAR Utility Storage, can be downloaded from the Implementer’s Lab

Why do I need hardware offloads, I have CPUs to burn!

Posted March 7th, 2012 by Mark Jones

It wasn’t that long ago that enterprise x86 computing was performed on single processor cores of just a few megahertz (Mhz). Getting data in and out of the computer was an expensive consumer of the processing resources. If you were serious about I/O, it made perfect sense to consider buying one of those fancy Host Bus Adapters (HBAs) that offloaded the I/O protocol processing to specialized processors made just for that, saving the computer processor to perform other general compute functions. But since then, processor technology has marched forward at a tremendous pace, processing speed has increased from a few MHz up to ~3Ghz, which is now the practical limit due to power/thermal efficiency issues. Multithreading, multi-cores and increased processor cache have also been big news in computing to the point where we now can have a tremendous amount of compute power in a very small space in the data center.
Why do I need hardware offloads? I have CPUs to burn!
This week, Intel announced availability of its new Xeon E5-2600 processor family, the platform codenamed “Romley” has a top model of which will be offered by server manufacturers with 16 physical cores and whole menu of other great technologies to improve performance and efficiency. So with all this new compute power, you may be thinking: “Why do I need hardware offloads? I have CPUs to burn!”

Wikipedia is the first place to look to put water on that fire. Moore’s Law (1) is famous for predicting the long-term relationship of the growth of compute power, basically the doubling of processor performance every 18 months. Related to this is Wirth’s law, (2) which states that “software is getting slower more rapidly than hardware becomes faster” or Gate’s law “the speed of commercial software generally slows by 50% every 18 months.” So no matter how fast hardware gets, the data center will evolve to find a way to consume all its resources through software.

If you have worked in a data center during this technology march in recent years, you have noticed that this compute power is getting packed more densely, and it’s possible to get hundreds, if not a few thousand cores into a single rack. This has shifted the data center problem from performance capacity to power and cooling capacity. It’s not about how many servers can fit in a room, it’s about what the maximum power and cooling capacity of the room is. You do not have to look too closely at Intel’s Xeon Processor E5-2600 product announcement before you notice that much of what they promote are features that deliver performance at efficient power levels and features to lower power consumption when not needed. Turbo Boost Technology 2.0 raises CPU performance (increases power draw) only when needed and reduces it when not needed. We have noticed in the lab that these power efficiency features have significant effect on the servers’ power consumption as measured at the AC power cord. For instance, in our Implementer’s Lab, we have measured over a ~110 Watt swing at the power cord between a server at idle to highly loaded (~80% CPU usage).

Emulex HBAs and converged network adapters (CNAs) offload I/O with low power processors that are specifically designed to efficiently process I/O protocols in a far more efficient way than a generalized system processor, and are complementary to the new power/performance efficiency features of the Xeon E5-2600 product family. By offloading the protocol processing from the server operating software stack, we lower the CPU load significantly, which causes the CPU to use power-saving strategies that results in far lower system power usage.

An example of this is with a server running VMware ESX5i and comparing realistic virtual machine (VM) I/O workloads to storage devices over an Fibre Channel over Ethernet (FCoE) network. You have a choice of using software FCoE over a 10Gb Ethernet (10GbE) Network Interface card (NIC) or using an Emulex CNA which will offload the FCoE protocol processing. Our test used four VMs with an equal load to storage of 35k I/O transactions per VM. We measured both the CPU used on the hypervisor and the AC input power usage of the server and found that the server used 53% of the server overall CPU resource while running the I/O using the software FCoE and just 23% when using the offload CNA. Saving 30% of the servers’ CPU resources is significant enough to trigger the servers’ power-saving strategies to use less power and this showed up on the computers’ input power measurements. At idle with no I/O workload running, the server was drawing 110W. While running the I/O over software FCoE, the server was drawing 167 watts. When running over our CNAs with hardware FCoE, it measured 129 watts. The server used 37 less watts to perform at the same performance level, which is significant power savings that can add up over time or when applied throughout the data center.
Remember…it takes energy to cook!
So the next time you get a new super-fast server and you are tempted to burn some of its CPU cycles on running software FCoE or software iSCSI, remember…it takes energy to cook!


(1) George E. Moore 1965, periodically updated by Intel: http://en.wikipedia.org/wiki/Moore’s_law
(2) Nicholas Wirth, 1995: http://en.wikipedia.org/wiki/Wirth%27s_law

«Older Posts