Recovering from the Loss of the Primary Host
In the context of this procedure, a failed host is considered to be the permanent loss of a host. For example, a disk crash or other hardware failure renders the host unable to be restarted to re-join the grid.
In order to employ this recovery procedure, the following prerequisites are mandatory:
-
A full backup of the system must be available to be restored which should include both the system files and grid folder structure.
-
The replacement host must have the same FQDN as the original host.
To recover from the loss of the primary host
Please note that other operations may be required or this procedure may be void depending on the application that the grid was delivered with. Please refer to the application documentation for further guidance.
If the primary host fails, the agent on the configured secondary host starts a secondary registry and an admin router on the secondary host. Because the configured ports for these two entities are re-used on the secondary host, it is important that these ports are not occupied on the secondary host. Therefore the chosen fixed ports for the registry and the admin router must be outside of the ephemeral port range; otherwise any other grid node may have claimed these ports.
The secondary host's agent usually detects and recovers a primary host failure in less than a minute.
If the primary host is recovered, the secondary registry and admin router automatically shut down; that is, once the primary registry is back online.
If the primary host fails, the root certificate is unavailable because the primary host is the only place where it exists. Without this certificate you cannot scale-out the Grid to new hosts. Therefore, additionally, on start-up, a router checks whether it has the Grid's root certificate on disk. If the certificate is not on disk, the router can copy it from a host where it is available. The recommended way to achieve this root certificate copying is to setup the Default Router to run on all hosts.
If a secondary host is configured, the Grid still continues to function, although in a slightly limited mode, buying time to recover the primary host.
If the primary host is temporarily lost, for example because of a network failure on the primary host, the secondary host fail-over ensures the Grid functions until the primary host recovers from the temporary condition.