In many situations, the PBX is an important service that should be available whenever possible. In the overwhelming majority of cases, the underlying server infrastructure provides sufficient reliability that is adequate for the operation of the PBX without the need to add anything specific to the PBX server.
Choosing Suitable Hardware
Hardware reliability is the foundation for a reliable service. Customers running the PBX should take advantage of this and provide a stable ground for a reliable PBX service.
When renting servers in the cloud, the service provider usually picks suitable hardware that works well and automatically switches to a different server in case of a failure.
When running the PBX on your own hardware, it is highly recommended to use SSD hard drives, which also helps speeding up disk interaction. Redundant power supplies, multiple NIC and use an environment that reduces the risk of a hardware failure (humidity, temperature) also help to reduce the risk of a failure.
Because IP addresses can change during a failover, it is important to use DNS addresses for the PBX. This way, if an IP address change is neccessary, clients will find the new instance without manual interaction. This is especially important for the provisioning address of VoIP phones which can be hard to change after the rollout. DNS addresses are also a requirement for using secure communications.
No matter if the failover is automatic or manual, there should be a backup. The PBX stores all information in the working directory. This makes it very easy to take periodic snapshots. This can be achieved by using a cloud file system or by running a backup script e.g. at midnight. It is important that the backup is outside of the server, and possibly also outside the data center.
Restoring a backup is easy: After installing the PBX on a new server, stop the PBX service and replace the content of the working directory of the new server with the content of the old server. Then restart the new server. Depending on your license, you might have to reset the licence on the Vodia portal.
Using an automatic and periodic backup service is easy and cheap. It does provide the means to restore a backup in unforeseen events like hardware failure. But it is also very useful in other events like accidental, major misconfiguration of the PBX or when the server got hacked in the victim of a ransomware attack. In any case, it is imperative that the backup is stored outside of the server and the datacenter.
In a virtualized environment, the host software may take care about hardware failures. The PBX has a relatively small footprint; periodic snapshots of the virtual machine can be restored in a short amount of time on a secondary hardware, where it resumes operation. If this is done within a second, it is possible to keep existing TCP connections alive and ongoing calls connected. This kind of setup can achieve great resilience against hardware failures and even makes it possible to swap out hardware while the PBX is running.
The virtualization solution does not require any specific setup change on the PBX itself.
Virtualization also make it easy to create snapshots of the server as a whole. This can be done after important changes, but also on a periodic basis. We still recommend to take period backups of the working directory as a precautionary step, e.g. if the datacenter becomes unavailable.
External Failover Software
There are different external solutions available that can take care about hardware failover, for Linux, Windows and other operating systems. Those services essentially take care about starting up the secondary server when the first one becomes unavailable; using such a software can be a good solution to increase the uptime of the service and reduce the failover time to a few minutes.
Using the PBX Failover Feature
When the hosting provider datacenter becomes unavailable because of fatal events and there is still a need to operate the PBX as a service within a few minutes, the PBX can assist in automatically starting a secondary server. The PBX failover feature is used for automatic failover from a primary server (physical or virtual) to a standby server, typically in a different data center in a different region. This process takes a few minutes and will drop all calls on the primary server.
Setting up the failover from one PBX to a PBX in a remote location is complex and requires significant effort, training and infrastructure to work properly. For example, the Department of Defense (DoD) Cybersecurity Reference Architecture provides a framework for adding meaningful additional reliability in case of catastrophic events that can be used as a checklist of tasks to be completed.
As with the framework mentioned, it needs to be taken into consideration that the PBX is just one component in the setup. Other single point of failovers include:
- Are the DNS servers ready for failover
- What happens when the SIP trunk becomes unavailable (inbound/outbound)
- Are customers still able to access the PBX service when their internet connection fails
- Are certificates being renewed automatically and is there a monitoring service in place when the renewal runs out of time
- How to handle software updates without triggering a failover
- Is the storage of the data outside of the data center breaking compliance requirements
- Are organizational procedures in place for example to ensure administrators are not able to down systems by accident or deliberately
- What happens if payments fail
The alternative to an automatic failover into a different data center is the manual procedure to restore a backup. While this typically takes longer than a few minutes, it is much more flexible and cheaper than the automatic failover. Especially when major outages are happening, having a recent backup is the most valuable component in the failover setup.
How it works
When the PBX is starting up, it can delay the process of reading the configuration information. This is useful for a secondary server that should start up only when the primary server is going down. This way the secondary server can start with the last configuration that the primary was using, including up to date call records and mailbox messages.
The PBX uses a special path to store the failover information, which can be set with the command line option
--serverdir <dir>. The filename itself is
pbxctrl-failover.xml. The directory path tells the PBX where to read and store the information related to the failover. This way, the complete working directory can be kept in sync with the primary server. If the file system synchronization can make exceptions for the files, the PBX can also store the information in the working directory of the PBX.
The following image shows which options are available.
The current state may be one of the following states:
- Starting: In this state the PBX is testing if the primary server is operating yet. Unless the primary server has been working before, the secondary server will not start counting failures.
- Waiting: After the primary server has been found responding, the secondary server starts polling for failures.
- Verifying: After a failure of the primary server has been found, the secondary PBX needs to verify that itself is still operational by checking the connectivity to a web server that is supposed to be up all the time.
- Failover: In this state the secondary PBX is operating as the failover PBX.
Failover detection parameters
Interval for checking server availability. The secondary server needs to poll in intervals the primary server for its availability. This setting controls how many seconds it should wait between the tests. Shorter intervals result in faster failover detection, however also mean a higher load on the primary server.
Number of tolerated failures. Sometimes the checks fail because of reasons that are not fatal. This settings controls how many times the secondary PBX tolerates failures in order to avoid false alarms. Multiplied with the interval, this results in the time that it takes to detect a failure. Making this interval too short results in false alarms.
URL of the primary server. This settings contains the URL of the primary server (e.g.
http://192.168.1.2. It does not matter what content is being returned, as long as a
200 OK is returned from the primary PBX. In order to rescue the stress caused by the polling, using
https is recommended. It is a good idea to use a IP address instead of a DNS address to make sure that a DNS server is not becoming the point of failure.
URL for validating the fail-over event. When the failure of the primary server is detected, the secondary server will use this URL to validate the event. It should contains a web site that is considered to be always on (e.g.
http://vodia.com). This server needs to return a
200 OK, the page content is irrelevant.
ActionURL in the event of a failover
When the secondary PBX makes the state transition from
failover, it can issue a web request to an outside server. This can be used to trigger events, typically a change of the DNS address for the server. If there are multiple domains involved, it can make sense to use
CNAME addresses for the domains and change just that one DNS record for the PBX.
When using the Action URL to change the DNS address of the system, you should make sure that you provision the domain name instead of the IP address of the server. See Outbound Proxy Provisioning for more information.
Action URL are described on a separate page .
The servers that are involved in the failover setup should use the same license activation code. The license must list all IP addresses that will be used by the primary and the secondary server. The license file can be part of the file system replication because it contains exactly the same content for all involved servers.
Servers that are on standby will not affect the metering for the license. Only servers that are active will be taken into the hosted PBX metering. A failover will not impact the readout data.