IT systems are complex, made of many collaborative chains. Reliability is of course a key word but resilience should be also considered as a key aspect improving reliability. If potential failures are part of the initial design, resilience and therefore reliability will be greatly improved.
Very often IT systems are designed as if they were working perfectly. This may be true for each single system but as a matter of fact it appears often not to be true globally! Perfection is often out of reach and this input should be considered from the design phase
Systems that are not designed to behave nicely in case of failure create all sorts of issues on downstream systems.
Let's illustrate this by a very common pitfall: a data (file or unitary data) is missing, what should we do? Block the entire process without processing anything or just skip the missing data and process whatever you can, performing a partial rerun when the data is becoming available ?
Of course, the second option seems more sensible but unfortunately rarely implemented even manually through a process.
Just to highlight this point, a little anecdote: once, as an accounting system was computing million of transactions for end of month results, it got blocked because a currency value was undefined. Fortunately, as the process was considered sensitive for the company, a manual check was performed throughout the chain. This failure was detected during the night and the on call manager who had not the faintest idea of the possible value decided to set it to 1€. The process could start again, and the next morning accountants were able to manually fix the impacted transaction. Of course the run was not perfect but more than 99.99% of the goal was reached!
Even more common is the "cron" syndrome. Because open system developers have usually little experience in managing batches, they are not used to the capacities of enterprise wide schedulers. They are implementing batches with primitive tooling which ends up with very rigid chains which are not flexible enough to adjust should a problem arise.
As an example, very often batches are triggered at a given time and not upon a certain condition. Any delay in the upstream is then creating an issue on whole downstream chain. The same applies by the way with return codes which are not always fully implemented. Chaining jobs becomes then a little difficult without clearly knowing the exact status of the previous job.
Regarding batches, it is very common to see a black and white approach: either it runs or not. If not, nothing is produced, the error has to be fixed to rerun the complete batch. This of course does not fit for large batches that take more that minutes to complete. To increase the resilience of such batches, restart points have to be defined within the job logic so that the rerun is only performing the missing computation and not the complete job.
Data quality is not a common concept either: consolidation systems for example, by design, rely on the input coming from potentially hundreds of systems. Statistically, they cannot be right every day as most likely an input will be missing a day or another.
A way of managing such a situation is to implement fall back mechanism trying to estimate missing data, for example based on previous day input, and reflect this in a quality flag which shows how reliable the computed figure is.
In a nutshell, let´s move away from the optimistic approach and let´s build system that are ready to fail, having the needed fallback mechanisms implemented from the design phase!