Project

General

Profile

Balancing » History » Revision 3

Revision 2 (nitrox, 2011-01-22 13:16) → Revision 3/4 (nitrox, 2012-08-11 10:42)

h1. Balancing 

 The goal is to provide a [[mod_balance|generic interface to balance between multiple backends]]; each balancer acts like a backend itself. 

 A balancer is an @action@, which selects a backend from a list (and executes it via @action_enter@); if the backend fails due to specific reasons (overload/timeout), the balancer 
 gets called again (@ActionBackendFail callback@). 

 Now, what can go wrong with backends a balancer is interested in? 
 * backend process died 
 * backend process cannot be spawned 
 * backend overloaded 
 * connect() timeout 
 * connect() reset (no backend listening, i.e. process down) 

 There are more problems, like connection dropping after data was sent; but it should not be the job of the balancer to "fix" such problems. 

 The above problems can be classified in two categories: 
 * timeout/overloaded 
 * backend down 

 The spawn problems like restarting after the process died should be handled in another place. 

 As balancer can be stacked like other actions, at most one balancer (in a path in the "actions-tree") should have a backlog-queue. 

 Some ideas for handling the backlog-queue: 
 * a backend has the states: "alive", "overloaded", "down", "down-retry" 
 * queue has the states: "alive", "overloaded", "down" 
 * if not all backends are "down" or "down-retry" stay/goto "alive" 
 * if "alive" and all backends are "down" or "down-retry", goto "overloaded" and start a small timeout 
 * if "overloaded" and overload-timeout is reached, goto "down" 
 * while "down" and all backends "down", return "503 Service Unavailable" for all requests 
 * "overloaded" backends are tried again (switched to "alive") after a small timeout (e.g. 3 seconds) or when another request from that backend gets completed. 
 * "down" backends are tried again (switched to "down-retry") after a small timeout (e.g. 1 seconds) 
 * the queue has a limit, after it return "503 Service Unavailable" 
 * requests have timeouts for finding a backend, return "504 Gateway Timeout" after it