Kamailio configuration to provide load balancing and failover for media services

#!ifdef WITH_LOADBALANCE modparam("dispatcher", "db_url", DBURL) modparam("dispatcher", "table_name", "dispatcher") modparam("dispatcher", "flags", 2) modparam("dispatcher", "dst_avp", "$avp(AVP_DST)") modparam("dispatcher", "grp_avp", "$avp(AVP_GRP)") modparam("dispatcher", "cnt_avp", "$avp(AVP_CNT)") #set next two parameters if you want to enable balance alg. no. 10 #modparam("dispatcher", "dstid_avp", "$avp(dsdstid)") #modparam("dispatcher", "ds_hash_size", 8) modparam("dispatcher", "ds_ping_interval", 20) modparam("dispatcher", "ds_ping_from", "sip:kamailio1@awesomedomain.com") #modparam("dispatcher", "ds_ping_method", "INFO") modparam("dispatcher", "ds_probing_mode", 1) modparam("dispatcher", "ds_probing_threshhold", 1) #configure codes or classes of SIP replies to list only allowed replies (i.e. when temporarily unavailable=480) modparam("dispatcher", "ds_ping_reply_codes", "class=2;code=480;code=404") #!endif

#!ifdef WITH_LOADBALANCE #you can customize the condition however you need. For example, request uri checking, specific header checking, etc. if (<CALL IS DESTINED FOR THE MEDIA SERVICES>) { #we go to the load balancer route route(LOADBALANCE); } else { #we perform normal usrloc lookup for the call route(LOCATION); } #!else # user location service route(LOCATION); #!endif

#!ifdef WITH_LOADBALANCE route[LOADBALANCE1] { #ds_select_dst(destination_set, algorithm) function chooses the destination for the call. For this it can use a lot of algorithms. #Alg. 0 is the default one that does the the choosing over the call ID hash #Alg. 4 is a Round-Robin #Alg. 10 is the one that chooses the destination based on the minimum load of all destinations if(!ds_select_dst("0", "4")) { #if we are here that means no destination is available. We notify the user by 404 and exit the script. xlog("L_NOTICE", "No destination available!"); send_reply("404", "No destination"); exit; } xlog("L_DEBUG", "Routing call to <$ru> via <$du>\n"); #set the no_reply_recieved timeout to 2 second ... adjust the value to your need #note: The first value "0" is invite timeout .. we do not need to change it #This means that is the selected media server fails to respond within 2 seconds the failure_route "MANAGE_FAILURE" is called #note: this implies that ale the signaling from media servers on the way back to the user goes through the proxy as well t_set_fr(0,2000); t_on_failure("MANAGE_FAILURE"); return; } #!endif

# manage failure routing cases failure_route[MANAGE_FAILURE] { route(NATMANAGE); if (t_is_canceled()) { exit; } #!ifdef WITH_LOADBALANCE xlog("L_NOTICE", "Media server $du failed to answer, selecting other one!"); # next DST - only for 500 reply or local timeout (set by t_set_fr()) if (t_check_status("500") || t_branch_timeout() || !t_branch_replied()) { #we mark the destination Inactive and Probing ds_mark_dst("ip"); #select the new destination if(ds_next_dst()) { #again set local timeout for reply t_set_fr(0,2000); t_on_failure("MANAGE_FAILURE"); route(RELAY); exit; } else { #last available node failed to reply, no other destinations available send_reply("404", "No destination"); exit; } } #!endif ...

INSERT INTO dispatcher(setid, destination, flags, description) values (0, 'sip:<IP_ADDRESS_OF_ONE>:<PORT>', 2, 'mediaServer1'); INSERT INTO dispatcher(setid, destination, flags, description) values (0, 'sip:<IP_ADDRESS_OF_TWO>:<PORT>', 2, 'mediaServer2');

In many setups Kamailio is used as a PROXY server that takes care of routing calls to servers providing voice services, e.g. voicemail, IVR or conference calls.

There are a few things an administrator must keep in mind.

1. When routing amongst multiple media servers, there is a possibility of doing load balancing between them. In this case, the administrator needs to think carefully about the load balancing algorithm that is going to be used when determining where a call (from the exact user) should go. Maybe it is necessary to route calls from a specific user to a specific specific media server. Most common requirement is to route a call to server with minimum load. Also the administrator has to consider how to route calls with stateless proxy (simple hala-bala load balancing is obviously not a good solution).

2. If eventually one or more media servers become unreachable (for example due to network or config error) the Proxy needs to know this as soon as possible so it can stop routing the calls to that destination. Availability should be checked periodically. Basically there are 2 ways of doing this "keepalive" mechanism. The first is configuring the media servers to REGISTER to the proxy instance and maintain the registration in a normal "registration expire" way. Another approach is to ping the media servers from the proxy using OPTIONS request.

This article talks about possible ways of configuring a single Kamailio instance to provide these features. Please, keep in mind that this article talks about load balancing the calls to group of media servers. It does not talk about distributing the signaling load among multiple proxy instances. That means that after the configuration the proxy still remains a single point of failure and therefore this article should not be taken for high availability configuration example.

Kamailio offers two modules that can do load balancing.

PATH module

DISPATCHER module

The path module uses "path" headers for forcing the SIP message to go to selected destination. This module offers a very basic functionality and does not offer any means of keepalive mechanism for the media servers. When using this module, it is possible to use the "registration based keepalive" mechanism. Before the message is routed to the destination, the proxy needs to check whether the destination is "online" or not. Also, a way of distinguishing the the media servers should be deployed. For example, it will be a good practice to register every media server under different SIP uri to avoid the parallel forking behaviour (or it can be simply turned off, but that can be undesired for the users). If there is a need to prefer one media server over others, that node can register with the "q" value in the contact header, indicating it has a higher priority. Node selection then can happen based on this value (for more information see the documentation of the TM module, specifically this section). The correct appearance of the q parameter in contact header should look like this:

Contact: "Mr. Watson" <sip:watson@worcester.bell-telephone.com>;q=0.7

note: the q parameter needs to be included AFTER the "< >" section, not inside it. Some media servers may include it inside when registering. Take care.

The dispatcher module on the other hand is more powerful. It offers load balancing capabilities (with several destination selection algorithms), failover routes (when a destination fails) and also a keepalive mechanism that is fully automatic. The following section describes how to configure Kamailio with dispatcher modules and what usage options the module offers.

Changes made to kamailio.cfg:

First we define a directive (in the "defined values" section) that will easily allow us to turn on/off the dispatcher capabilities.

#!define WITH_LOADBALANCE

In the "modules section", we load the module:

#!ifdef WITH_LOADBALANCE
loadmodule "dispatcher.so"
#!endif

In the "module specific parameters" section, we configure some stuff for the module:

#!ifdef WITH_LOADBALANCE
modparam("dispatcher", "db_url", DBURL)
modparam("dispatcher", "table_name", "dispatcher")
modparam("dispatcher", "flags", 2)
modparam("dispatcher", "dst_avp", "$avp(AVP_DST)")
modparam("dispatcher", "grp_avp", "$avp(AVP_GRP)")
modparam("dispatcher", "cnt_avp", "$avp(AVP_CNT)")
#set next two parameters if you want to enable balance alg. no. 10
#modparam("dispatcher", "dstid_avp", "$avp(dsdstid)")
#modparam("dispatcher", "ds_hash_size", 8)
modparam("dispatcher", "ds_ping_interval", 20)
modparam("dispatcher", "ds_ping_from", "sip:kamailio1@awesomedomain.com")
#modparam("dispatcher", "ds_ping_method", "INFO")
modparam("dispatcher", "ds_probing_mode", 1)
modparam("dispatcher", "ds_probing_threshhold", 1)
#configure codes or classes of SIP replies to list only allowed replies (i.e. when temporarily unavailable=480)
modparam("dispatcher", "ds_ping_reply_codes", "class=2;code=480;code=404")
#!endif

Let's go over the configuration a little bit. First two parameters (db_url and table_name) are simply telling the module how to access the database to read the information about configured media servers (note: the destination list can also be read from a local file on the disk.. it does not have to be a database). The flags parameter is a two bit mask that influences the behaviour of the module. second least significant bit (value 2) configures the module to store all possible destinations in the AVP variable and if the selected destination fails, next one can be selected from the list. This in fact enables failover. The next three parameters (dst_avp, grp_avp and cnt_avp) tells the module how the variables for storing the destination list, set list and destination count are named.

The rest of the parameters (starting with ds_ping_interval) tells the module how to tread the media servers when it comes to keepalive mechanism. In this case, we configured the module to ping the servers every 20 seconds and use the "sip:kamailio@awesomedomain.com" uri in the contact. If we wanted to use the INFO method instead of the OPTIONS (default) method, it can be done with setting the ds_ping_method. Probing mode indicates what happens if the media server fail to respond to the keepalive message. Value 1 means that all destinations are being probed if they fail. Probing means keeping the keepalive mechanism on but in a much more frequent scale, enabling the Proxy to see the gateway as soon as it becomes online. By default, this is done only for destinations in the database table configured with the probing flag. Probing treshhold means how many keepalive messages can fail before the destination is considered down.

The last parameter, ds_ping_reply_codes is very important for specifying how the media server can respond to the keepalive and not be considered down. For example if the server is overloaded, it can respond by "480 Temporarily unavailable". In this case the destination should not be considered down. The proxy simply needs to try other destination from the set, if there is any. A single code is specified by "code=". You can specify the whole class of replies. "class=2" means all replies from 200 to 299 are accepted.

For a complete list of configuration options, refer to the documentation of dispatcher module (please, find the link above).

Next, we do some changes to the routing logic. We start at the main request_route:

#!ifdef WITH_LOADBALANCE
#you can customize the condition however you need. For example, request uri checking, specific header checking, etc.
if (<CALL IS DESTINED FOR THE MEDIA SERVICES>)
{
  #we go to the load balancer route
  route(LOADBALANCE);
}
else
{
  #we perform normal usrloc lookup for the call
  route(LOCATION);
}
#!else
# user location service
  route(LOCATION);
#!endif

Next we define the new LOADBALANCE route.

#!ifdef WITH_LOADBALANCE
route[LOADBALANCE1] {
        #ds_select_dst(destination_set, algorithm) function chooses the destination for the call. For this it can use a lot of algorithms.
        #Alg. 0 is the default one that does the the choosing over the call ID hash
        #Alg. 4 is a Round-Robin
        #Alg. 10 is the one that chooses the destination based on the minimum load of all destinations
        if(!ds_select_dst("0", "4"))
        {
                #if we are here that means no destination is available. We notify the user by 404 and exit the script.
                xlog("L_NOTICE", "No destination available!");
                send_reply("404", "No destination");
                exit;
        }
        xlog("L_DEBUG", "Routing call to <$ru> via <$du>\n");
        #set the no_reply_recieved timeout to 2 second ... adjust the value to your need
        #note: The first value "0" is invite timeout .. we do not need to change it
        #This means that is the selected media server fails to respond within 2 seconds the failure_route "MANAGE_FAILURE" is called
        #note: this implies that ale the signaling from media servers on the way back to the user goes through the proxy as well
        t_set_fr(0,2000);
        t_on_failure("MANAGE_FAILURE");
        return;
}
#!endif

Next we modify the failure_route so it looks like this:

# manage failure routing cases
failure_route[MANAGE_FAILURE] {
        route(NATMANAGE);
       if (t_is_canceled()) {
                exit;
        }

#!ifdef WITH_LOADBALANCE
        xlog("L_NOTICE", "Media server $du failed to answer, selecting other one!");
        # next DST - only for 500 reply or local timeout (set by t_set_fr())
        if (t_check_status("500") || t_branch_timeout() || !t_branch_replied())
        {
                #we mark the destination Inactive and Probing
                ds_mark_dst("ip");
                #select the new destination
                if(ds_next_dst())
                {
                        #again set local timeout for reply
                        t_set_fr(0,2000);
                        t_on_failure("MANAGE_FAILURE");
                        route(RELAY);
                        exit;
                }
                else
                {
                        #last available node failed to reply, no other destinations available
                        send_reply("404", "No destination");
                        exit;
                }
        }
#!endif

...

This means if the next selected destination fails to reply as well, again the next one will be selected. If even the last one of the destinations meanwhile fails, user is notified and the script ends.

And that's it for the configuration file. All that remains is to fill the databawe with available destination we want to route to.

As we specified in the configuration of the module, the database name is "dispatcher".

This is an example of using two media servers:

INSERT INTO dispatcher(setid, destination, flags, description) values (0, 'sip:<IP_ADDRESS_OF_ONE>:<PORT>', 2, 'mediaServer1');
INSERT INTO dispatcher(setid, destination, flags, description) values (0, 'sip:<IP_ADDRESS_OF_TWO>:<PORT>', 2, 'mediaServer2');

By this you will create 2 servers into destination set "0" (remember how we used function ds_select_dst() in the routing logic?). Destination sets are logical groups of media servers. If you need more groups, you can create them with unique set ID.

Each one of the servers were loaded with flag "2", which means they are enabled for probing (but it is really not important, because we configure the module to probe all destination, regardless of this flag).

note: If you plan to use algorithm 10, you need to specify a unique ID for each destination in the database. This is done by insering "duid=<My_UID>" for each row into column attrs. You have to replace "<My_UID>" with your value.

Once kamailio is started with the new configuration, it will load the destinations from database and start to ping them. Current status of destinations can be seen using the kamctl command.

# kamctl dispatcher dump
SET_NO:: 1
SET:: 0
        URI:: sip:<IP_1>:<port> flags=AP priority=0 attrs=
        URI:: sip:<IP_2>:<port> flags=IP priority=0 attrs=

In this example we see that first destination is considered active (A flag) and will be put to probing state if it fails (P flag). The second destination is inactive (I flag) and in probing state (P flag).

Another useful enhancement is that dispatcher module allows the administrator to define event-routes in the routing logic that are called if a destinations changes state (UP/DOWN).

The following is an example of a "dst-down" route:

event_route[dispatcher:dst-down] {
    # DO SOME LOGGING HERE.. MAYBE NOTIFY SNMP SERVER
}

This can also be done for the "dst-up" route.

This article talked about configuring Kamailio to load balance between multiple media servers, while providing failover in a simple manner (not an active one .. that means if the media server fails, call fails). Again, I would like to remind the reader that the Proxy still remains a single point of failure.

Any comments on this article are welcome.