You are here: Foswiki>Tasks Web>Component>FastCGIEngineContrib>Item13010 (06 Jul 2015, GeorgeClark)Edit Attach

Item13010: fcgi unstable when run under ProcManager

Priority: Urgent
Current State: Closed
Released In: 2.0.0
Target Release: major

Applies To: Extension
Component: FastCGIEngineContrib
Branches: Item13010 master

Reported By: MichaelDaum
Waiting For:
Last Change By: GeorgeClark

When using FastCGIEngineContrib in an nginx setup the Foswiki engine has to be set up using a separate process controlled by an init script of the operating system (in /etc/init.d/foswiki). Other web-servers like apache or lighttpd maintain the life cycle of the foswiki.fcgi backends on their own. Not so nginx as it serves as a proxy delegating the real work to daemon processes of their own wherever they are. In principle FastCGIEngineContrib is all set up to run that way by means of a FCGI::ProcManager implementation.

However this way of operation doesn't seem to be well tested as backends lock up under heavy load with a "bind failed, socket already in use" error message. Nginx then returns a "broken upstream" error message as it can't talk to the foswiki.fcgi backends anymore.

After investigation I've found out why this happens.

There is code in the Foswiki::Engine::FastCGI layer that reloads backends whenever LocalSite.cfg changes or after a certain amount of requests. And that's causing problems when that happens on heavy loads. In a FCGI::ProcManager setup process are created like this:

foswiki-fcgi-pm, the proc manager master process
a working set of foswiki-fcgi, these are called in turn to serve requests coming in via the web browser

The problem is that Foswiki::Engine::FastCGI will kill the master process foswiki-fcgi-pm forcing it to reload LocalSite.cfg ... while there is an active foswiki-fcgi child process. Once that happens communication between both breaks whith foswiki-fcgi not terminating keeping hanging around forever. In effect it thus blocks all open sockets from new ones being created, i.e. by a new foswiki-fcgi-pm being re-executed by Foswiki::Engine::FastCGI. This call then aborts and that's where the web server stops being able to talk to an upstream service.

Basically, the reExec code in Foswiki::Engine::FastCGI is redundant as there is a better FCGI::ProcManager class coming with the perl package that does a better job in housekeeping the working set of Foswiki backends.

This is the FCGI::ProcManager::Constrained class. This implementation does everything we want, respawn after max_requests have been exceeded but also - and much more valueable - respawn after max_size has been exceeded. Respawning when a maximum memory usage has been reached makes more sense than limitting the number of requests a backend is allowed to serve. A backend is potentiall okay to run forever as long as it does not eat up too much memory ... which can be monitored by FCGI::ProcManager::Constrained just fine and then it will kill off the grown client and spawn another.

Besides being more feature-complete than the Foswiki-only max_request code in Foswiki::Engine::FastCGI, it does its housekeeping more robust and - in this specific setup - correctly, i.e. it does not kill off the master process on a max_request or max_size expiration.

Removing these features from Foswiki::Engine::FastCGI and using the FCGI::ProcManager::Constrained class basically cures the instability problems on a production setup.

However, with it goes monitoring LocalSite.cfg changes. These still would require to kill off the proc manager itself and respawn a new one, including all of the working pool of backends. For now I haven't found a way to do that reliable on a production system under heavy load. Which means: you will have to restart the Foswiki service on the operating system level, not as part of the web app. That seems to be the price to pay for more stability. Maybe there is a way to reliably refork the proc manager itself under heavy load.

Some other changes:

removed taint mode flag when starting and respawning backends; taint mode is busted. See also RemoveTaintCheckingFromFoswiki

Checkins on branch Item13010

Beta release at FastCGIEngineContrib

-- MichaelDaum - 29 Aug 2014

Actually the current patches still don't fully resolve instabilities. I had to disable LSC checking totally to avoid "bad gateway" errors in nginx due to foswiki.fcgi failing to reExec.

diff --git a/lib/Foswiki/Engine/FastCGI.pm b/lib/Foswiki/Engine/FastCGI.pm
index 4720aed..e62a106 100644
--- a/lib/Foswiki/Engine/FastCGI.pm
+++ b/lib/Foswiki/Engine/FastCGI.pm
@@ -115,7 +115,7 @@ sub run {
     die "LocalSite.cfg is not loaded - Check permissions or run configure\n"
       unless defined $localSiteCfg;
 
-    my $lastMTime = ( stat $localSiteCfg )[9];
+    #my $lastMTime = ( stat $localSiteCfg )[9];
 
     while ( $r->Accept() >= 0 ) {
         $manager && $manager->pm_pre_dispatch();
@@ -127,8 +127,8 @@ sub run {
             $this->finalize( $res, $req );
         }
 
-        my $mtime = ( stat $localSiteCfg )[9];
-        if ( $mtime > $lastMTime || $hupRecieved ) {
+        #my $mtime = ( stat $localSiteCfg )[9];
+        if ( $hupRecieved ) {
             $r->LastCall();
             if ($manager) {
                 kill SIGHUP, $manager->pm_parameter('MANAGER_PID');

-- MichaelDaum - 18 Nov 2014

Okay I think I've nailed it. Testers wanted. Note you need to switch to the Item13010 branch.

-- MichaelDaum - 19 Dec 2014

Made a release into Extensions/Testing

-- MichaelDaum - 14 Jan 2015

ItemTemplate edit

Summary	fcgi unstable when run under ProcManager
ReportedBy	MichaelDaum
Codebase
SVN Range
AppliesTo	Extension
Component	FastCGIEngineContrib
Priority	Urgent
CurrentState	Closed
WaitingFor
Checkins	FastCGIEngineContrib:ed689e67f129 FastCGIEngineContrib:37d437cfe5b2 FastCGIEngineContrib:9594a731a706 FastCGIEngineContrib:1e648752dc9f FastCGIEngineContrib:9d9a8b0b1024 FastCGIEngineContrib:34636bac681b FastCGIEngineContrib:923c60255b9d distro:c59dd2375f74 distro:9000a3dae439 distro:767cfed05a95 distro:89943e3756ba distro:29477d23f76b distro:9373e87be146 distro:54d3add28e0b
TargetRelease	major
ReleasedIn	2.0.0
CheckinsOnBranches	Item13010 master
trunkCheckins
masterCheckins	FastCGIEngineContrib:ed689e67f129 FastCGIEngineContrib:37d437cfe5b2 FastCGIEngineContrib:9594a731a706 FastCGIEngineContrib:1e648752dc9f FastCGIEngineContrib:9d9a8b0b1024 FastCGIEngineContrib:34636bac681b FastCGIEngineContrib:923c60255b9d distro:c59dd2375f74 distro:9000a3dae439 distro:767cfed05a95 distro:89943e3756ba distro:29477d23f76b distro:9373e87be146 distro:54d3add28e0b
ItemBranchCheckins	FastCGIEngineContrib:ed689e67f129 FastCGIEngineContrib:37d437cfe5b2 FastCGIEngineContrib:9594a731a706 FastCGIEngineContrib:1e648752dc9f FastCGIEngineContrib:9d9a8b0b1024 FastCGIEngineContrib:34636bac681b FastCGIEngineContrib:923c60255b9d
Release01x01Checkins