RMA Engagement

NASA designs and operates complex mission and systems. In identifying the full set of expectations for missions, a NASA engineering is wise engage various communities, such as those working in the areas of systems engineering, orbital debris, space asset protection, human systems integration, quality assurance, and reliability engineering, systems safety, and design., ensuring that a complete set of expectations is captured. The early engagement (i.e., concept formulation) will help prevent “surprise” features or performance risks from arising later in the life cycle avoid the reliance solely on the intuition or insights of a single manager or discipline.

Engaging Reliability Engineering at NASA will allow a project/mission to better ensure that systems perform as required over their lifecycles to satisfy mission objectives including safety, reliability, maintainability, and quality assurance requirements. This is done by apply engineering knowledge and specialized techniques and mathematics to identify the likelihood or frequency of failures; To identify the risks, causes, and options for prevention of failures, availability, maintainability and performance issues. As such this discipline is best engage during pre-phase A and can continue to aid a project well into operations and disposal.

Synopsis of RMA Methods, Timing, and Value

Reliability Engineering considers or analyzes Reliability, Maintainability, and Availability (RMA) to identify or determine Failure Risks, Recovery Strategies, Robust Design Options,  Availability Risks, Maintenance Planning Needs, System Monitoring Optimization Options, Exigency Operations Planning Needs/Options, and Longevity/Life forecasts as shown below. If you have comments, feedback, improvements or corrections provide them viaR&M Feedback Form to enable the continued readiness and continuous relevance of the data provided within this site.

Hover over topics in the graphic for more information:


Failure Risk identification is done by the Reliability Engineer by synthesizing the results of Failure Modes, Effects, and Criticality Analyses (FMECA/FMEAs), Qualitative/Quantified Fault Trees, Part Stress Analysis (PSA), Worst Case Analysis (WCA), Sneak-Circuit Analysis, and/or Probabilistic Risk Assessment (PRAs), to generate candidate project risks. These risks could be technical, programmatic, institutional, acquisition, and/or safety related and would be handled and managed throughout the project's Risk-Informed Decision Making (RIDM) And Continuous Risk Management (CRM) - See NASA/SP-2011-3422 for more information on those practices.

Since risk management is best done before new processes or projects are conceptualized, before design or process changes are implemented, or new when information concerning system failure risks is available, it is best to allow Reliability Engineers to contribute their identified risks to RIDM/CRM processes early in the design or decision making cycle to avoid missteps and additional risks.

The value of Failure Risk identification support is:
  1. That risk-informed failure mitigation/avoidance plans can be developed prior to failure occurrence;
  2. That it identifies alternative solutions or appropriate constraints on system use;
  3. That known risk are less likely to impact mission performance since they can be actively managed;
  4. TBD
Recovery Strategy development by a Reliability Engineer uses the results of Failure Modes, Effects, and Criticality Analyses (FMECA/FMEAs) with mitigation and detection information, and sometimes Qualitative/Quantified Fault Trees, or Probabilistic Risk Assessment (PRAs), to recommend effective safing and failure recovery strategies. These recommendations can vary depending on the system being analyzed but tend to include part substitutions, redundancy changes, life/stress/functional testing, quality assurance measures, automated safing/detection/switching, and/or re-designs.

Since recovery strategy implementation can impact the final system greatly, it is best that recommendations are shared early in the design phase and should be updated for design changes through launch by keeping supporting analyses current. After launch Fault Detection, Isolation, and Recovery (FDIR) rules and systems should be informed by operational trends and updated analysis as warranted.

The value of Recovery Strategy development support is:
  1. That failure risks have planned mitigation/avoidance prior to occurrence;
  2. That redundancy is optimized but not excessive;
  3. That testing and operation plans (e.g., safe modes) are varied and risk based;
  4. That sensors and monitoring are optimized for survival;
  5. That systems will be protected from failure but may still be allowed to degrade gracefully;
  6. TBD
Assessing the robustness of a design is completed by assessing vulnerabilities of system or systems to failure. Vulnerability assessments or failure risk assessments are formulate by a Reliability Engineer through the use of Failure Modes, Effects, and Criticality Analyses (FMECA/FMEAs), Qualitative/Quantified Fault Trees, Part Stress Analysis (PSA), Worst Case Analysis (WCA), Sneak-Circuit Analysis, and/or Probabilistic Risk Assessment (PRAs).

These analyses use mission success criteria, configuration/interface data, and known failure susceptibility/causes to generate plausible failure scenarios therefore they are best performed early in the design phase and should be updated for design changes through launch to support trades, testing issue investigation, and Fault Recovery Strategies. After launch the same analyses can be used to inform anomaly investigations and support operational-change evaluations.

In addition, probability estimation can be used to support developing an robust design based on design/redundancy trades and should be updated as designs change to support de-orbit or other analyses as necessary. As such these estimation are best best performed early in the design phase through launch. After launch probability estimation should be updated with system performance/use data to support mission extension and disposal decision making, since pre-launch usage assumptions may no longer be valid.

The value of Robust Design assessment support is:
  1. That failure risks are known and can be mitigated prior to occurrence;
  2. That analyses identify interface dependencies and potentially missed requirements;
  3. That redundancy is optimized but not excessive;
  4. That monitoring and detection systems can be optimized to assist in avoiding failures;
  5. That operation plans are risk based;
  6. That results inform Fault Detection, Isolation, and Recovery (FDIR) planning;
  7. TBD
Reliability Engineering enables Availability risk determination through Availability Analysis. In an Availability Analysis the Reliability Engineer evaluates the potential planned and unplanned down-times to determine the system's or data generating function's availability for operations over a specific period of time (Mission Life). This may mean the evaluation of both ground and space assets and may involve multiple missions.

Availability Analysis is best performed when concepts of operations are being formulated and it is wise to continue to monitor and forecast as operations continue.

The value of Availability determination is:
  1. That sparing is optimized but not excessive;
  2. That operational downtime is planned for;
  3. TBD
Maintenance Planning increases in effectivity and fiscal soundness when informed by Reliability Engineering Maintainability Analysis (MA). In MA the probability of performing a successful repair action within a given time, and considering consumables is determined. In other words, maintainability measures the ease and speed with which maintenance tasks and/or repairs (i.e., a mission system can be restored to operational status after a downing-event occurs), including diagnosis time, repair time, supply time, and any testing time as applicable. In space operations these efforts would indicate the probability of the return to service or successful servicing or preventive maintenance or de-orbit and support operations and project logistics/operations planning.

MA is best performed when systems trades are in progress and maintenance

The value of MA is:
  1. That sparing is optimized but not excessive;
  2. That failure prevention operations are planned for and scheduled to avoid operational downtime;
  3. That maintenance is optimized.
  4. TBD
Monitoring Optimization is achieved when just enough sensor to facilitate proactive mitigation of failures by reacting to appropriate pre-failure symptoms. A Reliability Engineer can recommend effective sensor optimization by testing simulated sensor configurations for recognition of failure scenarios (previously identified using Failure Modes, Effects, and Criticality Analyses (FMECA/FMEAs) with mitigation and detection information, and sometimes Qualitative/Quantified Fault Trees, or Probabilistic Risk Assessment (PRAs)).

Since monitoring optimization requires sensor or sensor-software implementation it can impact the final system greatly, therefore it is best that recommendations are shared early in the design phase and should be updated for design changes through launch by keeping supporting analyses current. After launch Fault Detection, Isolation, and Recovery (FDIR) rules and systems should be informed by operational trends and updated analysis as warranted.

The value of Monitoring Optimization support is:
  1. That failures may be avoid before they occur;
  2. That sensor and sensor processing is optimized but not excessive;
  3. That testing and operation plans (e.g., safe modes) are varied and risk based;
  4. That maintenance and preventive maintenance can be triggered not just scheduled;
  5. TBD
Exigency Operations Planning is defining Contingency Operations Operating Procedures/Processes (COOPs). COOP or Exigency planning is very similar to Recovery Strategy in that a Reliability Engineer uses the results of Failure Modes, Effects, and Criticality Analyses (FMECA/FMEAs) with mitigation and detection information, and sometimes Qualitative/ Quantified Fault Trees, or Probabilistic Risk Assessment (PRAs), to recommend effective strategies for continued operations except instead of returning the system to a nominal/recovered state the operations continue with the failure in-place.

Since defining COOPs is best done after the system or systems are well defined it is best done just prior to launch readiness. After launch COOPs should be kept current with operational configuration changes and updated analysis as warranted.

The value of Exigency Operations Planning or COOP development support is:
  1. That systems can be returned to operations without fixing the failure;
  2. That operation plans are kept current with existing operational state;
  3. TBD
Longevity Forecasting is completed by assessing the probability of failure of the system or systems and/or consumable availability projections. The probability estimation (i.e., Probability of Success (Ps) values) can be formulate by a Reliability Engineer through the use of a Reliability Block/Logic diagrams, Quantified Fault Trees, or Probabilistic Risk Assessment. Conversely, the consumable availability projection determines when wear limits or finite supplies will be depleted and the mission will not be able to continue viably.

Probability estimation is best performed early in the design phase through launch to support trades and should be updated for design changes. Similarly, consumable availability or viability will determine the length of a mission therefore it best performed in the design phase as well to establish mission expectations. After launch probability estimation and consumable availability should be updated to support mission extension and disposal decision making since pre-launch usage assumptions may no longer be valid.

The value of Longevity Forecasting support is:
  1. That redundancy is optimized but not excessive;
  2. That operation plans are risk and viability based;
  3. TBD
Reliability Engineering considers or analyzes Reliability, Maintainability, and Availability (RMA) from data gathered and modelling to aid in making decision concerning system failure, operational readiness and success, maintenance and service requirements, and the collection and organization of information from which the effectiveness of a system can be evaluated and improved. 'RMA findings can be considered to be characteristics of a system in the sense that they can be estimated in design, controlled in manufacturing, measured during testing, sustained in the field, and decommissioning.'(Pecht, 1995)

The application of RMA methods is best started early in concept formulation and continued through operations and includes engineering design, manufacturing, testing, and analysis to answer specific mission questions with results/findings.

Overall the Value of engaging Reliability Engineering will reduce life-cycle cost by:
  1. Efficiently and effectively identifying limitations within a system that may cause a failure before the intended lifetime;
  2. Identifying unreliable systems that may pose a safety or health hazard;
  3. Providing specific Reliability requirements for component procurement;
  4. Identifying potentially wasted efforts and/or investments (hardware/software/testing)that were intended to improve Reliability, Availability, and/or Reliability but are providing little value;
  5. Improve system's operational use by increasing their design life;
  6. Enabling the elimination or reduce the likelihood of failures and safety risks through identification;
  7. Enabling reduce downtime (maintenance), thereby increasing available operating time; and
  8. Ensuring mission success and continued advancement of earth/space science and discovery.

Definitions:

Reliability (R) - click title for more information on analysis options

The probability or likelihood that a component or system will perform its intended function with no failures (or the inverse likelihood of failure or failure scenarios) or a component's or system's susceptibility to failures over a given period of time (mission time) when used under specific operating conditions (test environment or operating environment).

Maintainability (M) - click title for more information on analysis options

Maintainability is defined as the probability of performing a successful repair action within a given time, and considering consumables. In other words, maintainability measures the ease and speed with which maintenance tasks and/or repairs (i.e., a mission system can be restored to operational status after a downing-event occurs), including diagnosis time, repair time, supply time, and any testing time as applicable. In space operations these efforts would indicate the probability of the return to service or successful servicing or preventive maintenance or de-orbit and support operations and project logistics/operations planning. 

Availability (A) - click title for more information on analysis options

The probability that a system will perform its intended function at given point in time or over specified period of time when operated and maintained in a prescribed manner. Thus, availability is a function of reliability and maintainability.


For Performance see Guidance and Reference Data page or KSC Integrated Design and Assurance System (IDAS) on Reliability section.







  • No labels