Item13010: fcgi unstable when run under ProcManager

pencil
Priority: Urgent
Current State: Closed
Released In: 2.0.0
Target Release: major
Applies To: Extension
Component: FastCGIEngineContrib
Branches: Item13010 master
Reported By: MichaelDaum
Waiting For:
Last Change By: GeorgeClark
When using FastCGIEngineContrib in an nginx setup the Foswiki engine has to be set up using a separate process controlled by an init script of the operating system (in /etc/init.d/foswiki). Other web-servers like apache or lighttpd maintain the life cycle of the foswiki.fcgi backends on their own. Not so nginx as it serves as a proxy delegating the real work to daemon processes of their own wherever they are. In principle FastCGIEngineContrib is all set up to run that way by means of a FCGI::ProcManager implementation.

However this way of operation doesn't seem to be well tested as backends lock up under heavy load with a "bind failed, socket already in use" error message. Nginx then returns a "broken upstream" error message as it can't talk to the foswiki.fcgi backends anymore.

After investigation I've found out why this happens.

There is code in the Foswiki::Engine::FastCGI layer that reloads backends whenever LocalSite.cfg changes or after a certain amount of requests. And that's causing problems when that happens on heavy loads. In a FCGI::ProcManager setup process are created like this:

  • foswiki-fcgi-pm, the proc manager master process
  • a working set of foswiki-fcgi, these are called in turn to serve requests coming in via the web browser

The problem is that Foswiki::Engine::FastCGI will kill the master process foswiki-fcgi-pm forcing it to reload LocalSite.cfg ... while there is an active foswiki-fcgi child process. Once that happens communication between both breaks whith foswiki-fcgi not terminating keeping hanging around forever. In effect it thus blocks all open sockets from new ones being created, i.e. by a new foswiki-fcgi-pm being re-executed by Foswiki::Engine::FastCGI. This call then aborts and that's where the web server stops being able to talk to an upstream service.

Basically, the reExec code in Foswiki::Engine::FastCGI is redundant as there is a better FCGI::ProcManager class coming with the perl package that does a better job in housekeeping the working set of Foswiki backends.

This is the FCGI::ProcManager::Constrained class. This implementation does everything we want, respawn after max_requests have been exceeded but also - and much more valueable - respawn after max_size has been exceeded. Respawning when a maximum memory usage has been reached makes more sense than limitting the number of requests a backend is allowed to serve. A backend is potentiall okay to run forever as long as it does not eat up too much memory ... which can be monitored by FCGI::ProcManager::Constrained just fine and then it will kill off the grown client and spawn another.

Besides being more feature-complete than the Foswiki-only max_request code in Foswiki::Engine::FastCGI, it does its housekeeping more robust and - in this specific setup - correctly, i.e. it does not kill off the master process on a max_request or max_size expiration.

Removing these features from Foswiki::Engine::FastCGI and using the FCGI::ProcManager::Constrained class basically cures the instability problems on a production setup.

However, with it goes monitoring LocalSite.cfg changes. These still would require to kill off the proc manager itself and respawn a new one, including all of the working pool of backends. For now I haven't found a way to do that reliable on a production system under heavy load. Which means: you will have to restart the Foswiki service on the operating system level, not as part of the web app. That seems to be the price to pay for more stability. Maybe there is a way to reliably refork the proc manager itself under heavy load.

Some other changes:

Checkins on branch Item13010

Beta release at FastCGIEngineContrib

-- MichaelDaum - 29 Aug 2014

Actually the current patches still don't fully resolve instabilities. I had to disable LSC checking totally to avoid "bad gateway" errors in nginx due to foswiki.fcgi failing to reExec.

diff --git a/lib/Foswiki/Engine/FastCGI.pm b/lib/Foswiki/Engine/FastCGI.pm
index 4720aed..e62a106 100644
--- a/lib/Foswiki/Engine/FastCGI.pm
+++ b/lib/Foswiki/Engine/FastCGI.pm
@@ -115,7 +115,7 @@ sub run {
     die "LocalSite.cfg is not loaded - Check permissions or run configure\n"
       unless defined $localSiteCfg;
 
-    my $lastMTime = ( stat $localSiteCfg )[9];
+    #my $lastMTime = ( stat $localSiteCfg )[9];
 
     while ( $r->Accept() >= 0 ) {
         $manager && $manager->pm_pre_dispatch();
@@ -127,8 +127,8 @@ sub run {
             $this->finalize( $res, $req );
         }
 
-        my $mtime = ( stat $localSiteCfg )[9];
-        if ( $mtime > $lastMTime || $hupRecieved ) {
+        #my $mtime = ( stat $localSiteCfg )[9];
+        if ( $hupRecieved ) {
             $r->LastCall();
             if ($manager) {
                 kill SIGHUP, $manager->pm_parameter('MANAGER_PID');

-- MichaelDaum - 18 Nov 2014

Okay I think I've nailed it. Testers wanted. Note you need to switch to the Item13010 branch.

-- MichaelDaum - 19 Dec 2014

Made a release into Extensions/Testing

-- MichaelDaum - 14 Jan 2015
 
Topic revision: r9 - 06 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy