Quality is obviously important for any software product and storage systems have particularly demanding quality requirements. Customers expect to reliably retrieve the data they store. Unfortunately, software has bugs, and the consequence of a bug in a storage system ranges from minor inconvenience, to system down time, to corrupted or lost user data. The recently published paper, "A Study of Linux File System Evolution," by Lu et. al., presented at the FAST'13 Conference, describes the evolution of Linux storage systems. The researchers classified over 5000 patches to six Linux file systems over 40 releases in the 2.6 kernel stream and found a steady rate of bug fixing even in very mature file systems, like XFS and Ext3. All of the systems had crash bugs. All of the systems had data corruption bugs.
While the Panasas storage system doesn't use those file systems (it uses its own OSDFS code for low level management, its own distributed system metadata service, and its own DirectFlow® file system client), we have a large code base, so it would follow that there would be bugs in the development process. The goal is to flush those bugs out before software is released to the field. To that end, we have focused on quality through automated testing since the very early days of development. This is a very complex process in our environment because we can assemble many different configurations of our blade-based solutions, we need to support large numbers of clients in a high-performance networking environment, and we have to handle a wide array of faults in our distributed system, all while storing customer data safely.
The following note is from one of our engineers, Terrence Wong. Terrence has built a number of tools that help developers harness the automated testing frameworks that we have evolved over time. The following is his perspective…
In my college programming courses, all of the time was spent on software development. If a program ran, didn't crash, and produced the correct output, it was considered a success. In addition, software assignments were narrowly focused, whereas in real life we are almost always contributing to a larger endeavor with significant cross-functional interdependencies.
During my first job interviews it became clear that our college experience bore no resemblance to real world testing. I remember being confused by the racks and racks of servers and storage that I saw during company lab tours. I soon learned that this was not mere eye candy to get me excited about an employment opportunity, but that the hardware was actually performing a critical task – stressing the software to breaking point. The typical college experience did not reflect anything close to a true test environment required to keep an application running day-in, day-out, for years at a time. Testing is a major component of the product development process and it differentiates companies in a competitive field.
At Panasas, one of our primary goals for ActiveStor 14 was to deliver a high quality software release to accompany that new hybrid storage appliance. A buggy release is painful for everyone: users of the system experience downtime, Customer Service works overtime, and engineering gets bogged down with fire fighting. Given the criticality of software stability to product success, isn't it ironic that when software development is taught at university level, quality is rarely given priority?
Product quality requires an investment in tools and infrastructure to ensure the product will stand up to high-stress workloads, but physical testing of all scenarios is not possible. At Panasas, one of the ways that we produce high quality software is through test automation. Using a test automation framework developed in-house, we are able to efficiently execute release test plans, concurrently build all variants of a source tree, or test individual developer changes. Automated tests run 24 hours a day which is definitely preferable to humans doing the work, only during business hours.
Since we first started developing our automated testing system in 2001, we have been continually thinking of new ways to improve testing and, as a result, product quality. One way to measure improvements in testing over time is to observe the number of unique test suites that are run. Each test suite probes some area of the product and can take from a few minutes to a few days to complete. The more areas of the product code that are tested, the higher the quality of the software release. Figure 1 provides an overview of the growth in automated testing that Panasas has implemented as part of its release process. The growth in number of tests, now in the thousands, reflects the increasing complexity of the software over time as well as Panasas’ emphasis on maintaining a solid feedback loop from prior releases to drive an even higher quality level.
Yet another way to measure improvements in testing is to consider the overall quantity of tests that are run against a product release. When more tests are run we can have greater confidence that the results of those tests are accurate. Figure 2 highlights the total number of automated tests (now in the hundreds of thousands) run by Panasas in any one year.
Contrast this data with our college experience and it highlights the difference between a project and a product.