gzip_cnc - an Apache handler for serving compressed content
What is this all about?
gzip_cnc is a CGI script written in Perl. It has to be embedded into the Apache web server as handler and is then able to serve the requested static page content in gzip compressed form to (sufficiently capable) web browsers.
The program
- uses Content Negotiation to find out whether the web browser is ready to receive gzip encoded contents and
- stores the most up-to-date version of the content for each served document in compressed form within its own cache tree.
From all these attributes the program's name is derived.
(A cryptic name may neither be easy to remember nor easy to pronounce - but as unique term for search engines it still has its advantages ...)
Why gzip_cnc - there is mod_gzip already?
Right. If you want your Apache web server to deliver page contents - especially dynamically created ones - in compressed form, then the Apache module mod_gzip is the best solution (known to me). So if you
- are a provider yourself, or at least
- utilize server hosting and thus may enhance your web server by a 3rd party module like mod_gzip
then you are best taken care of there.
But what if you have a normal home page that mainly consists of static pages and just use a normal web space for this, with a handful of add-ons like CGI and .htaccess
- but your provider isn't willing to install mod_gzip on this server? This is the scenario that gzip_cnc is made for, typical for many smaller web sites.
Which requests can be compressed by gzip_cnc?
gzip_cnc can only compress requests for static content.
This is a fundamental restriction of the program itself - because for being able to compress dynamic content it would need to get its hands on this dynamic content after its creation but before it is to be served by the web server. mod_gzip can do this - because it is an Apache module; gzip_cnc is 'only' an Apache handler, and it doesn't make a lot of sense to teach this handler to partially emulate the functions of other handlers (like CGI or SSI) which can be handled a lot more efficiently by the Apache server itself.
Which file types can be handled by gzip_cnc?
The current version of gzip_cnc compresses only one (configurable) document type with the default value of HTML documents - which are the biggest part of the files available in the Web.
The reason why any restriction exists here at all is that gzip_cnc itself needs to understand exactly what it is compressing (because it has to send this information to the browser as well) but unfortunately isn't being informed by the Apache server what it thinks about the content in question, according to its own configuration (as we are 'only' a handler and not a module and therefore cannot directly access the Apache configuration).
It is easy to change this document type and it would not be too difficult to implement more document types - it would end up to use a similar mapping table as the Apache server itself is already using. But you are free to use several instances of gzip_cnc simultaneously and configure each one of them for handling a MIME type of your choice. And when using the configuration via Environment variables this can even be done with just a single gzip_cnc instance.
Am I able to use gzip_cnc for my web space?
To be able to use gzip_cnc calls for a number of prerequisits:
- availability of a Perl5 interpreter - as gzip_cnc is a Perl script.
- ability to execute your individual CGI applications - as gzip_cnc is one of them.
- some compression tool. There are several possibilities for this:
- a Perl module
Compress::Zlib
installed on the server or - a
gzip
compression program which- for UNIX-type operating systems
- often is already available or
- may be created from the source code
gzip
and - for Windows-type operating systems
- may be installed as program file
gzip.exe
.
- may be installed as program file
- for UNIX-type operating systems
- a Perl module
- an Apache web server - as the way gzip_cnc competes for handling the page requests to be served in compressed form is Apache specific.
- the ability to complete the web server configuration for your own URL tree, probably by using
.htaccess
files - as gzip_cnc has to be included into the Apache configuration as responsible program for handling the page requests to be compressed. - the ability to use directives of the
FileInfo
class within.htaccess
files - as to this class belongs the directive for defining the inclusion of gzip_cnc as responsible program for handling page requests to be compressed. The directiveAllowOverride FileInfo
must have been specified for the corresponding web space in the central Apache configuration to enable this.
At first sight all this may sound awfully complicated and this combination of requirements may look rather improbable. But UNIX (including Linux variants of any kind), Apache and Perl are a reliable team and available at most providers; allowing individual CGI scripts and .htaccess
(the latter one often used for access control to protected areas of the web space) are common add-on features, rarely available for free web space but normally included even in the smallest offered package of commercial provider.
The most likely problem seems to be a restriction of the directives to be used within .htaccess
- this should be clarified first, either by asking your provider or simply by testing.
How much additional load will the use of gzip_cnc put on the server?
You wouldn't want to use it if you have a commercial high traffic server - simply because in this case you would rather use mod_gzip anyway.
Basically, you replace each static page access by a call of a CGI script that is doing not much more than reading a file content and serving it to the browser, just like the web server would normally do.
The trick is that
- actually compressing data is being performed very rarely, due to the cache model, and
- serving compressed content whose size is only 20-40% of the uncompressed original must be faster than always serving the larger original files.
On the other side, you pay the price of having one CGI invocation for every page request - so you want to have a fast CGI execution model on your server. On my web space the Apache server is using mod_fastcgi
, and even though the machine running this server is merely a Pentium 400 it takes
- no more than about 0.04 CPU seconds for each page being served (nearly independent from the document size) and
- between one and five times of this for each page being compressed (compressing a 400 kb page down to 80 kb took 0.21 CPU seconds which is the maximum value I have experienced on my domain).
The logging function of gzip_cnc is telling you the values that hold true for your machine.
But I experience that even though I changed some of the most frequently accessed pages on my web space about once a day during the development phase of gzip_cnc
- still 97% of all accesses are just taking compressed content from the cache (or serve uncompressed original files) and
- only 3% lead to creating or updating cache file content
- and with time this rate should probably decrease (depending on how often you change your documents' content).
So the costs for building up the cache are close to neglectable, and only the costs for serving pages, i. e. for invoking the CGI script, do really count.
How much additional disk space will gzip_cnc's cache use on the server?
A good estimation for disk space occupied by the cache tree is one third of the disk space used for all the uncompressed original HTML files.
The cache tree will contain two types of objects:
- compressed files where you can be sure that small HTML files (10 kb and less) can be compressed by about 40-70% and larger files by about 60-90% - depending on the degree of redundancy inside these documents. Normal text files usually compress by a factor of three; HTML documents that repeatedly contain similar tags patterns like tables may well compress by a factor of 10 and even better.
- directories which don't compress at all, as they hold about the same number of entries as their original instances in the web space.
So the exact amount of space your cache tree will use depends slightly on the number of directories you have but mainly on the degree of redundancies inside your documents.
And as a rule of thumb: If your cache needs one third of the original web space (including the overhead for the directories) then your visitors will be served one third of the original traffic volume (including the overhead for the HTTP headers) and thus experience response times three times faster than normal.
Your mileage may of course vary if your web space offers a large percentage of other file types than static HTML.
(Michael Schröpl, 2004-01-11)