4. R&M Guidance and Reference Data: Methods

The data below is supplied to assist in Reliability, Maintainability, and Availability (RMA) engineering competency advancement and assist in the performance of R&M activities to support NASA missions.

Click the following titles to switch RMA Guidance Topics: Training and RMA Methodologies.

For Method Application Planning see RMA Application Rationale page or

provide feedback, comments, corrections via theR&M Feedback Form

to enable the continued readiness and continuous relevance of the data provided within this site.

Methodology Reference Data and Guidance

Reliability (R) -

The probability or likelihood that a component or system will perform its intended function with no failures (or the inverse likelihood of failure or failure scenarios) or a component's or system's susceptibility to failures over a given period of time (mission time) when used under specific operating conditions (test environment or operating environment).

Click here to expand/contract Reliability Method Performance Guidance

This data set should not be assumed as inclusive of all R&M guidance available. Please see the SMA Toolbox on Reliability and Maintainability as well as the following links:

Raw data sharing may be plausible as well, please see data repository list (Link TBD) and contact the appropriate TDT lead (R&M TDT Points of Contact) to request data.

4. R&M Guidance and Reference Data: Methods

Web Resources

View this section on the website

See edit history of this section

Post feedback on this section

Section Labels:

Unknown macro: {page-info}

Discrete Reliability Value Estimates
Physics of Failure
Reliability Allocations
Reliability Block Diagrams
FMECA
Worst Case Analysis
PSA/EPSA
Fault Tree
PRA
Sneak Circuit Analysis
Single Event Effects Analysis
Software Assessment

Reliability Values/Predictions

The primary purpose of a reliability value estimate or prediction is to provide guidance relative to the expected reliability for a product as compared to the customer's need, expressed or implied, for the product. The use of a prediction is a means of developing information for design analysis without actually testing and measuring the product capabilities.

Reliability prediction plays a major role in many reliability programs. Standards based reliability prediction relies on defining failure rates for the components of a system based on statistics or modeling (e.g., Physics of Failure Handbook or Physics of Failure Practitioner's section), depending on the types of components, the use environment, the way the components are connected and the reliability prediction standard. These component failure rates are then used to obtain an overall system failure rate.

Early predictions are strongly encouraged in the product Concept/Planning phase to support design trades or other analyses as necessary. This is when most decisions are made regarding redundancy, parts, materials, and processes. The first analysis should be considered as soon as the concepts are ready to be traded or initial design data is available. Predictions should be continued throughout the design process, being updated as more detailed design information becomes available. The later predictions evaluate stress conditions and life limiting constraints, as well as the relative impacts of implementation issues. After launch predictions should be updated with system performance/use data to support mission extension and disposal decision making, since pre-launch usage assumptions may no longer be valid.

Direct Practitioner guidance - Quantitative assessment analyses (Reliability Block /Logic Diagrams (RBD), FMECA, quantified FTAs) require the most precise/realistic reliability estimation be developed for each system required for space asset functionality so that life and risk decisions can be made accurately. The more precise the estimation the more likely it is that over-design, premature mission termination, or erroneous mission execution/extension will be avoided. Therefore it is necessary that the appropriate statistical methods be applied given the data available or efforts may need to fall back to less precise handbook data (e.g., MIL-HDBK-217 or FIDES) and routinely reassess values based on performance of all configuration items, especially batteries, Solar Arrays/Strings, mechanisms, optics/sensors, and propulsion systems.

Weibull/Weibayes – This method is best applied when there is sufficient on-orbit/test data available from the system under analysis or similar system(s) or historical system(s) that demonstrates an increasing failure probability or aging or increasing degradation. In this method failure/anomaly data over time is fit to a corresponding Weibull distribution or mix of Weibull distributions to postulate a new failure rate for the system from the start of its operations. Where Weibull distributions with β < 1 have a failure rate that decreases with time, also known as infantile or early-life failures. Weibull distributions with β close to or equal to 1 have a fairly constant failure rate, similar to handbook data but reflective of actual space/test performance. Weibull distributions with β > 1 have a failure rate that increases with time, also known as wear-out failures to will likely be risk factors in disposal and/or extension decisions. If you only have historical failure data, a very small sample set, and/or the current system has not failed yet but it is known that similar systems have had degradation trends (i.e., performance, temperature, power-loss, increased/decreased torque, etc.), it may be best to perform a Weibayes analysis (a Weibull with an assumed beta) to generate a new failure rate but Weibull is preferred.

Bayesian Statistical Inference for updating failure rates – This method is a classical statistical method that is best used when there are components for which the on-orbit/test failure data is insufficient to calculate a new failure rate; however, there is enough success and failure on-orbit/test data to consider updating the existing failure rate assumed. This method can be used prior to design, prior to operations, or in-situ and learns from data incrementally until convergence on a new failure rate is reach.

Any representative prior and distribution type (e.g., binomial for failures in n demands, Poisson for events in time, gamma for n failure in time) can be used with enough data and will converge on the same posterior distribution. Once a posterior is attained it should be routinely update so its distribution (most precise if used) or a selected point estimate (i.e., mean) can be used for the component in further system assessment along with the good-as-new assumption.

Physical Stress Analysis /Physics of Failure – This method is best applied when there is knowledge of the physical stresses on the element as well as the in-situ condition of the element, but there is not enough data to perform a classical statistical evaluation. Using this data this method calculates the probability of failure cause or failure mechanism such as e.g., distortion, fracture/fatigue, wear, corrosion, or material degradation. [Fischer 2016] This calculation is generated for each cause by using a validated algorithm to assess the physical functional model of the system and the physical stresses it is exposed to or its performance degradation parameters. As such the results of this method will likely need to be combined with each other and classical statistical estimations to formulate a complete estimation that can be used in further system assessment along with the good-as-new assumption. These combinations will need to factor in the dependency or independence of each estimation.

Handbook/Database Data – In some cases it may be necessary to use of handbook, such as MIL-HDBK-217 (all version of the MIL handbook), or generalized databases of reference part failure rates, such as Telcordia, PRISM, RDF-2000, FIDES, Siemens SN29500, NTT Procedure, SAE PREL, and British Telecom HRD-5, to estimate a unit’s reliability based on its constituent parts. In these cases limitations and uncertainty of those data uses must be considered and reflected in further system assessments. All these handbooks assume the components of the system have constant failure rates with modification factors to account for various quality, operating, and environmental conditions. [Pecht 2009] This issue is that these factors are too generalized to give the most accurate prediction of reliability and their use may need to misleading results. These handbooks may also not contain the exact part being used in the system so a similar part will need to be substituted and may not characterize the part accurately. However, the good-as-new assumption is still plausible with this method.

In addition, if handbook data is selected with actual performance temperatures versus predicted design upper limits then failure rate estimation accuracy will be increased. In addition, if operating conditions (i.e., voltage, current, temperature, duty cycle, etc.) have changed additional analysis (e.g., part stress, derating, trend analysis, and worst case) is warranted that can either adjust the failure rate directly or to support engineering judgement adjustments.

Engineering Judgement, Performance Indicators, and Assumptions – This method is best used for in-situ updating of probabilities since it depends on having a previous prediction for the system. If a system shows no signs of degradation or wear based on performance data and diagnostic trends then a good-as-new probability (1.0) can be assumed at the beginning of the mission or mission extension period with either the original failure rate or an applicable statistically updated failure rate or distribution (e.g., Bayesian). However, if similar systems have shown degradation/wear over time in operations or in test (i.e., batteries, solar cells, etc.) but the system being assessed has not exhibited these symptoms the original failure rate cannot be used with the good-as-new assumption. Under this condition the system’s underlying failure rate Conversely, if the system has shown degradation then the good-as-new assumption and engineering judgment adjustment factors should not be used.

Related Guidance/Literature Links: TBD

Physics of Failure (PoF)

Direct Practitioner guidance - Coming Soon

Until then see Physics of Failure Handbook

Related Guidance/Literature Links: TBD

Reliability Allocations

Reliability allocation involves setting reliability objectives/goals for a Program based on the Program objectives and mission profile. Once the Reliability Program goal/objective has been determined the Reliability allocation to each Mission and System can be determined. This should occur in the initial development stages of design prior to designing major system upgrades. This can also be allocated to Element and subsystems if desired and based on the Procurement and Design strategy. The simplest method for allocating reliability is to distribute the reliability objective uniformly among all the subsystems or components. While uniform allocation is easy to calculate, it is generally not the best way to allocate a reliability objective. The "best" allocation of reliability would take into account the cost or relative difficulty of improving the reliability of different subsystems or components.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Reliability Block Diagrams

RBDs permit the entire team to consider tradeoffs and impact of decisions on system reliability and balance priorities across various elements of a system.

Not every element of a system has to achieve high-reliability performance when not effectively making any different to the overall system reliability. Focus improvement activities on the elements that actually impact the system’s ability to achieve the goal.

RBDs provide a vehicle for tradeoff analysis and decision making. Given constraints of weight and reliability, and a set of options to improve a power supply reliability. The best option may involve doubling the weight by using redundant power supplies, yet a less reliable option may only use a different circuit design and components. This becomes interesting when the lower weight solution does not meet the power supply reliability allocated goal.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Failure Mode and Effects Analysis / Critical Items List (FMEA/CIL)

The Failure Mode and Effects Analysis (FMEA) is performed to identify failure modes. As part of this process, critical failure modes that could lead to loss of life or loss of mission are also identified. These critical failure modes are then placed into a Critical Items List (CIL), which is carefully examined for programmatic control by implementing inspection requirements, test requirements and/or special design features or changes which would minimize the failure mode occurrence.

Failure Mode and Effects Analyses and resulting CILs can be used not only as a check of the design of systems for reliability, but also as main design drivers for the product or service. Reliability management is the activity involved in coordinating the reliability analyses of design, development, manufacturing, testing, and operations to obtain the proper performance of a given product under specified environmental conditions. Reliability management interfaces with the program management function, the design function, the manufacturing function, the test and inspection function, and the quality function.

Reliability management is approached through the formulation and preparation of reliability plans, the performance of specific product design analysis, the support of classical reliability analysis activities, and project/product team participation using concurrent engineering methodologies (see NASA Reliability Design Practice GD-ED-2204).

Failure Mode and Effects Analysis Criticality Analysis (FMECA)

Failure mode and effects analysis (FMEA) and failure modes, effects, and criticality analysis (FMECA) are methods used to identify ways a product or process can fail. The basic methodology is the same in both cases, but there are important differences between the processes.

Qualitative versus Quantitative: FMEA provides only qualitative information, whereas FMECA also provides limited quantitative information or information capable of being measured. FMEA is widely used in industry as a "what if" process. It is used by NASA as part of its flight assurance program for spacecraft. FMECA attaches a level of criticality to failure modes; it is used by the U.S. Army to assess mission critical equipment and systems.

Extension: FMECA is effectively an extension of FMEA. In order to perform FMECA, analysts must perform FMEA followed by critical analysis (CA). FMEA identifies failure modes of a product or process and their effects, while CA ranks those failure modes in order of importance, according to failure rate and severity of failure.

Critical Analysis: CA does not add information to FMEA. What it does, in fact, is limit the scope of FMECA to the failure modes identified by FMEA as requiring reliability centered maintenance (RCM).

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links:

An Innovative Goddard Space Flight Center Methodology for using FMECA as a Risk Assessment and Communication Tool

Worst-Case Analysis (WCA)

WCA is a design analysis used to verify compliance with performance requirements for flight equipment and to help prevent design flaws. WCA determines the performance margins in order to demonstrate proper functionality of the system throughout the mission life under combinations of adverse conditions. The analytical results with positive design margin will ensure proper operation of the equipment under the most unfavorable combination of realizable conditions. In the opposite case, negative margin will identify potential problems for very specific performance requirements. This technique helps to design in graceful degradation at the performance margin (avoiding catastrophic failure at the boundaries). In real applications, conditions rarely stack up in the worst possible combinations, and the equipment often functions beyond the nominal mission. Because of this, the WCA is an important part of the design and development of spacecraft and instruments. Although this handbook focuses on WCA of electrical circuits, the WCA techniques are applicable to designs at all levels, including component (such as hybrids, multichip modules [MCMs], etc.), circuit, assembly, subsystem, and system. Formal WCA usually is required of all electronic equipment and performed on critical circuits, at a minimum. The WCA is an often used, valuable design practice for other engineering disciplines/systems as well (hydraulic, mechanical, etc.), but it is not always a formal deliverable for those areas.

For electronic circuits, WCA is an extension of classical circuit analysis. In aerospace applications, it is intended to estimate the maximum range of performance of the equipment due to the effects of aging, radiation, temperatures, initial tolerances and any other factors that influence performance. The approach includes variations from both electronic parts and circuit interface conditions. The circuit analysis is performed and repeated for the worst combination of extreme values of part parameters and interface conditions to determine the minimum and maximum performance. To ensure reliable performance of spacecraft circuits, it is essential that variations in these parameters and conditions be addressed as the design is developed.

There are several types of WCA types:

Extreme Value Analysis (EVA) This is an estimate of the most extreme
limits of the circuit's components and function.
Root Sum Square (RSS) method works on a statistical
It assumes that most of the components
fall to the mid of the tolerance zone rather than at
the extreme ends.
Monte Carlo analysis, in which parameters are
randomly selected from a distribution,
and the circuit simulated, anywhere
from 1000 to 100000 times.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Part Stress Analysis (PSA)/Electrical PSA (EPSA)

PSA/EPSA (aka Derating Analysis ) is a design analyzes the applied stresses (e.g., operating voltages, temperatures, radiation, etc.) on component/circuits to assessed operational risks. PSA/EPSA determines the performance margins or exceedances to manufacturer limits in order to demonstrate proper functionality of the component/circuits throughout the mission conditions. The analytical results with positive design margin will ensure proper operation and expected reliability of component/circuits. In the opposite case, negative margin will identify potential problems for very specific conditions or intervals. This technique helps to design in circuits and circuit protections as needed. The PSA/EPSA is an often used and valuable design practice for other engineering disciplines/systems as well (electrical, systems, etc.).

The approach includes variations from both electronic parts and circuit interface conditions and derating (power rating) parameters.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Fault Tree Analysis (FTA)

A fault tree analysis is a top-down/deductive analysis that results in a graphical logic tree that shows all the pathways through which a system can end up in a foreseeable, undesirable state or event. This event can be a loss of a function. The chief focus of a fault tree is “failure” rather than “success”. Fault trees can be used at virtually any level: module, subassembly, system, or mission. The basic idea is that logic functions such as AND, OR can be used to connect events in such a way that the various pathways to the undesirable event are clearly established.

This technique helps to identify minimum paths to failure scenarios and can be preformed qualitatively or quantitatively but it can also be used to evaluate safety/hazard scenarios.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Probabilistic Risk Assessment (PRA)

A PRA is a comprehensive, structured, and logical analysis methodology aimed at identifying and assessing risks in complex and time-based technological systems. It includes input from FTs/FMECAs/ predictions that models and assesses the likelihood of event sequences.

The PRA process and results defines the events required for success and quantifies the potential for success/failure potential scenario outcomes.

defining the dependencies between events;
how damage/failures impact the system (locally and globally); monitoring and redundancy strategy validity and viability;
the risks to mission success and safety architectures

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Sneak Circuit Analysis (SCA)

Sneak Circuit Analysis is a vital part of the assurance of critical electronic and electro-mechanical system performance. Sneak conditions are defined as latent hardware, software, or integrated conditions that may cause unwanted actions or may inhibit a desired function, and are not caused by component failure.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Single Event Effects Analysis (SEEA)

In the space environment, spacecraft designers have to be concerned with two main causes of Single Event Effects (SEEs): cosmic rays and high energy protons. For cosmic rays, SEEs are typically caused by its heavy ion component. These heavy ions cause a direct ionization SEE, i.e., if an ion particle transversing a device deposits sufficient charge an event such as a memory bit flip or transient may occur. Cosmic rays may be galactic or solar in origin. The SEEA is a design analysis to verify both hardware tolerance or susceptibility (failure modes) for:

Temporary loss of operation
Damage to one or more pieces of hardware
Corrupted memory and logic
Corrupted data
System malfunctions, unexpected interrupts
Void/corrupted displays
Undesirable S/W execution
Timing impacts
Momentary loss of communication
Total mission/catastrophic failures

and other system performance effects caused by transient radiation environments (e.g., high-energy- charged particles and radiation). The purpose is to show that inherent device immunity or circuit mitigation techniques are adequate to prevent destructive effects and that system and circuit effects of upsets (memory state change, etc.) and voltage transients produce mission-tolerable errors.

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Software Assessment

Coming soon - Currently being formulated

Direct Practitioner guidance - Coming Soon

Related Guidance/Literature Links: TBD

Maintainability (M) - click title for analysis guidance

The prediction of repair and maintenance measures, such as MTTR (Mean To Time Repair), MTBR/MTTS (Mean Time Between Repair/Mean Time To Service), MCMT (Mean Corrective Maintenance Time), and MPMT (Mean Preventive Maintenance Time), for performing a successful repair/refurbishment action. In other words, maintainability analyses measure the ease and speed with which maintenance tasks and/or repairs (i.e., a mission system can be restored to operational status after a downing-event occurs), including diagnosis time, repair time, supply time, and any testing time as applicable. In space operations these efforts would indicate the probability of the return to service or successful servicing or preventive maintenance or de-orbit and support operations and project logistics/operations planning.

Availability (A) - click title for analysis guidance

The probability a repairable item will perform its intended function at a given point in time (or over a specified period of time). In other words, availability is the probability of an item’s mission readiness, an uptime state with the likelihood of a recoverable downtime state. Its value is in the interval [0, 1], and cannot be less than zero and cannot be greater than one. Note: It is availability and not reliability that addresses downtime (e.g., time for maintenance, repair, and replacement activities). It is important to determine if the management question or system requirement is limited to reliability (only uptime) or if it pertains to availability (uptime with recoverable downtime in the near term).

There are many types of Availability, Point availability (A(t)), Average availability Steady State availability (A(∞)), Inherent availability (A_i), Operational availability (A_o), and Data availability (A_d), and each can be assessed using demonstrated (descriptive - based on actual achieved performance) or predictive (inferential - based solely on the failure distribution (reliability math model) and the repair distribution (maintainability math model)) measures of performance. However each has its own considerations and value to the metric consumer.

Content

Space Tools

Reliability Values/Predictions

Physics of Failure (PoF)

Reliability Allocations

Reliability Block Diagrams

Failure Mode and Effects Analysis / Critical Items List (FMEA/CIL)

Failure Mode and Effects Analysis Criticality Analysis (FMECA)

Worst-Case Analysis (WCA)

Part Stress Analysis (PSA)/Electrical PSA (EPSA)

Fault Tree Analysis (FTA)

Probabilistic Risk Assessment (PRA)

Sneak Circuit Analysis (SCA)

Single Event Effects Analysis (SEEA)

Software Assessment

Coming soon - Currently being formulated

For Additional RMA Guidance click here to access OSMA R&M Guidance page and see the "Policy and Guidance" section.

Content

Space Tools

Breadcrumbs

4. R&M Guidance and Reference Data: Methods

Reliability Values/Predictions

Physics of Failure (PoF)

Reliability Allocations

Reliability Block Diagrams

Failure Mode and Effects Analysis / Critical Items List (FMEA/CIL)

Failure Mode and Effects Analysis Criticality Analysis (FMECA)

Worst-Case Analysis (WCA)

Part Stress Analysis (PSA)/Electrical PSA (EPSA)

Fault Tree Analysis (FTA)

Probabilistic Risk Assessment (PRA)

Sneak Circuit Analysis (SCA)

Single Event Effects Analysis (SEEA)

Software Assessment

Coming soon - Currently being formulated

For Additional RMA Guidance click here to access OSMA R&M Guidance page and see the "Policy and Guidance" section.