Levels of malfunction: Analysis of malfunctions

There are several different conditions leading to a microcomputer malfunction. We categorize these conditions as Levels. The chart below is organized based on the strength of the ESD or other disturbance leading to the malfunction.

abnormal_level
Levels Description
Level 0 Normal: In this condition, the CPU is running without any problem.
Level 1 Spurious Interrupt: No peripheral generated an interrupt, but the CPU received an interrupt request. Since the interrupt detection circuit is edge sensitive, this is the easiest error to cause; less than 100V of ESD can do it.
Level 2-1 Modified peripheral settings: Peripherals have connections from pins of the microcomputer's IC package to external circuits, such as sensors and drivers. Noise can enter the peripheral circuits via these signal lines and this noise can affect the flip-flop circuits which compose peripheral control or data registers. A fundamental concept of FUJIMI is that register contents are not always reliable. This is why one option with FUJIMI is to re-write all peripheral settings incrementally, completely rewriting them every few interrupt cycles.
Level 2-2 Only one word of RAM contents is corrupted: If electrical noise enters the microcomputer while the CPU is in a RAM write cycle, the word of RAM being written may be corrupted. Whether this happens is critically dependent on timing. For this error, in a FUJIMI system, it is strongly recommended to save important parameters in three separate locations. Then, we can use majority logic to compare the three data values. Furthermore, it is also recommended to check the range of parameters just before their use. If out of range, the programmer can set a default value or some suitable value.
Level 3 CPU error: FUJIMI was developed to handle this level of error. In the CPU, if its Program Counter is erroneously incremented by one, what will happen? Depending on the case, one instruction will be skipped, or an operand will be read as an instruction by the CPU. If one of these problems occur, the total system may be in an abnormal state forever. There is another error condition case. It is the problem between sequencers in the CPU. Within a CPU, there are several sequencers, for bus access, ALU operation, pipeline operation, and so on. Due to noise, if a cue from one sequencer to another is lost, the second sequencer is waiting for the cue, and the first sequencer has completed its operation. In this case, there is no way to resume CPU operation except a reset.
Level 4 Ram contents extensively corrupted: If many bytes of RAM are altered, there is no way to resume the system quickly. In this case, even with FUJIMI, the total system must be re-initialized. Often, this problem is caused by severe noise reaching the power supply.
Level 5 Latch-up: If an IC chip is subject to extreme noise, then due to its multiple layer structure the IC chip may begin operating as an SCR (Thyristor), and will consume as much electric power (current) as it can. Once this has occurred, there is only one way to recover: shut off power. This power-off should occur very quickly. Otherwise, the IC chip (silicon) can become very hot and even melt. This is of course a permanent failure of the IC. FUJIMI cannot handle this since even the FUJIMI circuit may be latched up and cannot power off its own silicon. This power-off must be made by an external circuit.
Remark: Even if a microcomputer's IC chip is in a latch-up condition, in almost all cases the system operation appears normal. Thus, it is very difficult to detect this by itself.

FUJIMI is not only a method of resuming system operation but also an over-all system requirement for highly reliable systems.

【Solution】

With the new FUJIMI High Resilience system technology, the CPU core is reset again and again, hundreds of times per second. All microcomputer systems will malfunction given a certain level of noise, and this cannot be completely avoided by other methods. Many types of malfunctions may be countered by resetting the CPU core at such a high frequency that the disruptions as well as the recovery are effectively invisible to the system's users.

We hear the question whether the CPU's performance will be reduced by such frequent resets and their handling routines. Do not worry, the only part reset is the CPU, similar to the brain of a human, and the time taken by a reset and its handler is quite small, since the major portion of RAM contents and peripheral settings are kept as they are. Thus, the loss in CPU performance is only around one to two percent, similar to the overhead of other interrupts in common microcomputer systems.

【Evolution of FUJIMI technology】

In many embedded systems, the microcomputer is equipped with a dedicated timer which periodically generates interrupts. This is called a “Real Time Interrupt” and with this, many common operations, for example a software timer, can be implemented. Our first modification of this was to have this interrupt reset the entire system (Real Time Reset) and it worked. However, in a Real Time Reset system, all application software modules must have a limited execution time. This was difficult. Furthermore, we could not allow other interrupts since an interrupt and its handler would add execution time causing the application module to exceed its time limit and then, the system would malfunction. To avoid these and other problems, we decided to use a pair of interrupts and a CPU-core-only reset. With this combination, there is no time limit for the application software modules. And also, normal interrupt usage is allowed. Still, due to the CPU-core reset, we can guarantee you that the CPU can appear to function normally over intervals larger than the timer period. And further, almost all your application software can be used unmodified.

This new technology is already patented in Japan and we have filed for patents in many other nations.

rtr

【Interrupt routine protocol】

In FUJIMI, when control passes to the FUJIMI interrupt routine, it should examine whether the application software executing when the interrupt occurred was behaving normally.

If it is normal, the routine should make multiple copies of return information and set the Save-OK flag. This flag is one of the first to be checked in the reset routine. If this flag is valid, the routine should further examine the system parameters and may perform optional application routines and partial peripheral initialization. Then, using the saved return information, program flow can return to the application software.

This way, the FUJIMI interrupt routine also handles the CPU core only reset, and this reset becomes transparent to the application and user.

If some problem is found by the interrupt routine, the programmer has many choices to resume or recover the system, including fail safe, fail a specific task, continue running, initialization and so on. Due to this flexibility, we call it a “High Resilience system“ and were awarded the First Prize from the Award Committee at Embedded Technology 2011.