Handling transient hosts in a cloud environment
In most cloud Grid environments, hosts are not as static as they are in on-premises environments. The hosts are more transient in nature. This means that hosts are added to and removed from a Grid configuration regularly. The Grid does not know if hosts are just in a stopped state or have been removed completely. In an on-premises environment, to remove a host from the Grid cluster, you execute a Grid uninstaller. In a cloud environment, you need to run this Grid uninstaller automatically when a host is terminated. The uninstaller takes down the host and then notifies the rest of the Grid that the host has been removed completely. The uninstaller also archives log files for later reference.
If instances fail or for other reasons that the uninstaller cannot be executed, you can configure the Grid to automatically remove hosts that are down in the Grid configuration but are not available in the cloud environment. Note that configuring the Grid to do this cleanup means that there are no archived log files. In addition, if a host is removed without executing the uninstaller, the Grid cannot determine if the host is completely removed or the host is removed because of a temporary network glitch. The Grid will log some proxy warnings that the host is unexpectedly unresponsive. These warnings will stop after the host has been cleaned up, but existing warning will remain.
In some Grid environments, there are hosts that should never be removed from the configuration. If the automatic cleanup is enabled, there is a possibility to add some hosts to a white list of hosts that should not be part of the automatic cleanup. Make sure that the uninstaller is not executed when these hosts are stopped.
Considerations to keep the Grid configuration up to date from a host perspective
- Create a service that will run the Grid uninstaller when a host is shut down or terminated. This will ensure that the host is unregistered from the Grid.
- Set Grid property grid.host.cleanupLostHosts to true to enable the Grid monitor lost hosts. Stopped hosts in Grid where the corresponding cloud instance is not available will be removed from the configuration.
- Any hosts that should not be automatically removed from the configuration should be added to the Grid property grid.host.cleanupWhitelist.
If neither of these steps are enabled, it is your responsibility to remove hosts that have been lost, either manually or by scripting through Grid REST and cloud API.