Quantum hard drives MTBF.
- Quantum hard drives MTBF.
Quantum web site for MTBF information:
The html page is attached to the 'Doc Link' field.The following is a text excerpt from the page. Graphs and formulas can be viewed at the above mentioned URL. When investigating hardware failures it is recommend that site particulars be investigated such as power and grounding, and environment. Excessive vibration can also contribute to premature hard drive failure.
Hard Disk Drive Reliability
Corporate Reliability Engineering
The purpose of this paper is to address some of the concerns and questions that our customers may have with regard to certain claims of product MTBF and how those claims relate to product reliability and warranty. In addition, this paper intends to describe how Quantum determines a product's predicted MTBF and the relationship of MTBF to Annualized Failure Rate (AFR).
Reliability is defined as: Q. The probability that an item can perform its intended function for a specified interval under stated conditions and 2. The probability that parts, components, products, or system will perform their designed-for functions without failure in specified environments for desired periods at a given confidence level. [1,2] A measure that is frequently used as an indirect indicator of system reliability is called MTBF (Mean Time Between Failure)
MTBF is a term that has been used throughout the disk drive industry as a measure of how reliable a drive is expected to be. MTBF is a predicted value and is meant as gauge by which to measure competitive products. In no way is MTBF meant to imply a condition of warranty.
The term MTBF (Mean Time Between Failures), is often used as a basic measure of reliability for 'repairable' items such as a CPU board, or disk drives. However, in order to calculate MTBF, a system has to fail, then be repaired, returned to service and then fail again. Disk drives are most often replaced once they fail and therefore MTBF is not the correct measure of reliability. What Quantum means by MTBF is Mean Time To First Failure (MTTFF) or as it's sometimes called Mean Time To Failure (MTTF). The MTTF is also generally used to measure reliability of 'non-repairable' items such as IC components, a light bulb.
When we say that a drive has a predicted or theoretical MTBF of 800,000 hours, this means that in a given large population of drives the average time in which a drive may fail will be 800,000 hours. MTBF (or MTTF) is not a means of predicting the life of an individual or a small group of drives. To achieve this number, a drive would run until it reaches its end-of-life period or fail and then be replaced by a new drive of similar reliability and so on. In this case it is theoretically possible that 800,000 hours would elapse before a failure would occur in a large population.
For a specific product, the MTBF is calculated based upon a budgeting technique whereby the PCBA (Printed Circuit Board Assembly) is modeled using Bellcore's Prediction Procedure for part stress/parts count method using Quantum's database ( See Figure 1) [3,4], while the HDA (Head Disk Assembly) is modeled on historical data containing the failure analysis of Field returned drives of similar technology. The data from these models are combined with an analysis of the PCBA, HDA and drive assembly processes to create a complete MTBF estimate.
MTBF Prediction Modeling
In the recent past, disk drives were treated as 'black boxes' that were tested under varying conditions; the failures plotted; and an MTBF estimate was calculated. While this may have served the purpose for relatively low MTBF systems it relied exclusively on historical and test data and was therefore less than adequate for predicting how newer designs with different technologies would perform.
Quantum uses a budgeting technique to model disk drive reliability and arrive at a reasonable MTBF estimate. This is done in order to highlight those components or sub-assemblies that may require more concentrated efforts to achieve the overall reliability goal of the drive.
Modeling begins by dividing the entire drive assembly into its three major sub-assemblies: Head Disk Assembly (HDA) containing the mechanical components; control module or Printed Circuit Board Assembly (PCBA) with its electronic devices and the Flex assembly that includes the drive's pre-Amp.
The Flex assembly is modeled separately because it combines characteristics of both mechanical (the flex cable) and electrical (the pre-Amp and other discrete) components. Another reason for modeling the Flex assembly separately is that it is contained within a unique environment. Because it is housed within the HDA assembly, it does not experience the cooling effect of external air flow that the HDA and the exposed electronic components of the control module encounter. In addition, the ambient air within the HDA is typically 10°C or more, above the air temperature surrounding the drive.
Finally, included in the model are the contributions of the assembly processes for both the PCBA and the drive itself. These are failures that typically fall into the 'early life' category.
The PCB surfacemount assembly process can induce faults not related to individual component functionality such as weak solder joints, contamination and component over-stress. Drive-level assembly can add its own set of potential faults with such things as loose screws and missing parts, assembly-related media defects and firmware-induced faults that could result in eventual mechanical or electrical failures.
We leverage previous product's performance to improve the relative accuracy of our prediction and analysis. An early parts count prediction is made on the basis of previous product for Evolutionary designs or on the basis of very little knowledge for Revolutionary designs, to form a rational basis for evaluating the design. The predictions are used in conjunction with the reliability model to understand how each part failure rate contributes to the whole design. The following tasks are a part of the Quantum's reliability prediction plan :
- Perform a parts count reliability prediction. Perform a theoretical thermal analysis. Perform a parts stress analysis reliability prediction. Perform a parts stress analysis reliability prediction with thermal analysis data.
Figure 1. Quantum's Reliability Database
Our efforts in the reliability prediction process is aimed at minimizing the gap between predicted MTBF and operational MTBF by incorporating both our experience on product, process, and firmware/software failure rates which have not been traditionally included in the MTBF prediction techniques. Although the inclusion of these items in reliability prediction is relatively new in our program, we have begun verification of this approach on some of our products. We believe that this enhanced MTBF prediction technique is significant advance.
Limitation of MTBF Prediction Modeling:
Both Part count/Part Stress models and techniques provide estimate and thus have their limitations:
- Cannot predict device, drive or equipment design errors. Cannot predict unanticipated defects induced in manufacturing. Are limited to a mathematical model relationship in basic laws of physics. Cannot predict the 'Humanware' or 'Human' element.
Annualized Failure Rate
Annualized Failure Rate (AFR) is a commonly used method in the drive industry for reliability performance in the Field and is probably the most useful measure of failure rates or trend for a group of units at a site. It can be applied more effectively to predict the reliability of units during a given time at a particular customer site. The AFR is based on the monthly total number of returned field failure units divided by the total cumulative installed base and multiplied by 12 to annualize the failure rate. The MTBF value can then be estimated from average annual drive operating hours or assumed Power-On- Hours (Duty Cycle) divided by the AFR rate. The MTBF derived from the AFR model can also be used to compare with the accumulated run time MTBF model for accuracy.
A. Quantum assumes a 2 month lag for all (WSSG,DPSG,SSPG)* returned products. This is based on empirical data that allows time for shipment between our configuration center to major OEM customer's, integration & testing in OEM customer's system boxes, and shipment and operation at the end user.
B. The Average Power-On-Hours (POH) used for the AFR model is based on the drive duty cycle. The drive duty cycle is the average Power-On- Hours per year in a system environments and may be defined in one of three categories:
- 100% Duty Cycle, POH 8,760 hours primarily in Workstation or System server environments. 71% Duty Cycle, POH 6,240 hours primarily in a PC or in a small network environment. 24% Duty Cycle, POH 2,080 hours primarily in a portable PC system.
C. The three months moving average is used to smooth the fluctuation of monthly failure rates and is used for tracking the drive's sustained field performance.
* WSSG: Workstation and System Storage Group, DPSG: Desktop PC Storage Group, SSPG: Specialty Storage Product Group
The field reliability tracking results are used as follows:
Provide empirical failure rate to use in updating Quantum's reliability database.
- Complete the 'closed loop' feedback on the Reliability Assessment Program. Identify opportunities for Design For Reliability(DFR) activities on future products. Provide input for life cycle costs, projected field return and reliability growth.
Frequently Asked Questions
Q. What is meant by 'predicted (Theoretical) MTBF' and 'Operational MTBF'?
Today's disk drives are extremely reliable and given the fact that many of them have specified MTBFs of 800,000 hours or more, it is impractical as well as very expensive to attempt to demonstrate this level of reliability.
In order to determine the reliability of new products during the early phases of design and development, mathematical models, as described earlier, are created to ascertain its reliability characteristics with empirical field data. The results derived from these models are the 'predicted (theoretical) MTBF.' Once volume production gets under way and a significant number of run-hours has accumulated, this Field data is used to validate the model.
Since Quantum's prediction database is from field return assessment and failure analysis, the predicted (theoretical) MTBF does not include the NTF(No Trouble Found), and all other failures such as unanticipated design and Manufacturing defects.
Technically the operational MTBF is defined as system shut down caused by customer perceived drive failure that leads to a replacement. At Quantum, the Operational MTBF is calculated by taking all field returned drives as failures excluding handling damage, upgrade and returned drives for credit. With different system platforms and application, the operational MTBF is usually lower than the Theoretical MTBF.
R. What constitutes a failure and What is meant by a normal random failure ?
Any drive that has not reached the end of its warranty period and experiences an event that prevents it from performing any of its specified operations or where the soft or hard error rate exceeds that specified in the Product Specification can be considered a failure. This does not include failures due to mishandling or abuse, failures that occur outside the specified environmental conditions or those which occur beyond the warranty period.
A normal random failure is a random failure caused by manufacturing defects, or inherent defects under drive specified conditions and normal usage during the drive warranty period.
S. Can I expect a drive with a 800,000 hour MTBF (92 years) to run error-free for that length of time?
No. Quantum drives have a warranty or end-of-life period of two to five years with a useful life of a few years beyond that. In any event, the useful life of any given drive is approximately 5 - 7 years (45,000 to 60,000 hours). However, if the drive is replaced with another drive of similar reliability when it reaches end-of-life and this process continues, it is theoretically possible with a large group of drives to accumulate 800,000 hours. Therefore, the probability of survival for a few drives reaching 800,000 hours would be very low. As mentioned before, reliability is the probability that a drive can perform its intended function for a specified interval under stated conditions. The environmental conditions and the usage of a drive will impact the reliability of the drive. Generally, reliability decreases as temperature increases. In addition, reduced air flow, high seek rates, excessive shock and vibration also degrade reliability.
T. What is the relationship between MTBF (92 years) and Design Useful Life (5 years)?
The relationship can be interpreted as:
Q) MTBF = the average time to failure of a population of units under specified conditions. (The specification = 92 years)
R) Design Useful Life = Expected operation time of the product. This is the actual mission(run time) time for the product. (5 years)
** Graph not available **
where R = Probability of survival, assuming constant failure rate model.
U. How is AFR calculated?
AFR is often calculated as a percentage using the following formula to determine the expected number of returns: [7-10]
** Graph not available **(1)
AFR can also be extrapolated based upon the number of failures received during a given month and based upon the total number of units in the installed base. This can also be used to determine if the number of failures meet or exceed expectations.
** Graph not available **(2)
Because AFR is directly linked to the predicted MTBF, it can also be used to measure MTBF:
** Graph not available **(3)
Note: Because AFR is calculated on a monthly basis and then multiplied by 12 to annualize it, and with limited units in a customer site, the AFR is subject to great amounts of statistical variation and fluctuation. This is quite different than looking at Annual Failure Rate with a total of 12 months data. Therefore, any AFR measurement will be subject to statistical variation. The degree of variation will depend on the number of drives included in the measurement. With more installed drives, less variation can be expected.
Figure 2 is a projected new target of MTBF 500K hours ( AFR 1.75 % ), based on our empirical data and reliability growth rate of 0.35.
** Graph not available **
Figure 2. New HDD Field MTBF/AFR Projection
** Graph not available **
Figure 3. DPSG Field MTBF Growth
Based on two years of Field Return analysis and Returned Failure Analysis Pareto reports with more than twenty five million drives in the field, our experience of tracking the field MTBF growth and AFR in both DPSG and WSSG products are heavily dependent upon ship volume, and target MTBF. Figure 3 shows the DPSG field MTBF that met its specification of 300K hours in 4-6 months after volume shipment. Figure 4 is the WSSG field MTBF which met its specification of 500K hours 11 months after volume shipment. It can be seen that one of the major differences is the volume shipment of 6.3 million (DPSG) vs. 0.5 million (WSSG) drives. Figure 3 and 4 also show an example of initial projection compared with actual field MTBF measurement. The correlation is fairly good, therefore one can conclude that this approach is indeed useful in the early product development or early volume shipment. [7,8,9,10]
V. If I purchase 10,000 drives with 800,000 hour MTBFs, how many can I expect to fail in the first year and then during the 5 year warranty period ?
Using formula 1 from above and assuming that each failed drive is replaced with a new one having a similar MTBF and that the drives have a 100% duty cycle, then the failing percentage may be calculated as follows:
** Graph not available **
** Graph not available **
Figure 4. WSSG Field MTBF Growth
The average number of failed units in each year would then be:
** Graph not available **
According to Bellcore prediction theory and field reliability growth rate of 0.35, the number of failures in the first year are estimated approximately between 110 to 169 (additional 59 units failed) units. This estimate is subject to statistical variations with various confidence level. Over the five year warranty, the total failures would be five times of the average failed units plus the additional 59 units in the first year which translates to an approximately 609 units.
The above point estimation calculation is based on the assumption of 10,000 drives purchased and ran simultaneously.Problems
- Need MTBF info