Security warning: Please don't use any version older than 1.11!

Savings by serving compressed data

Introduction

These and similar questions are to be covered by the present document.

An example for a traffic analysis

To illustrate the subject I want to offer a practical example based on my own domain.

The following calculations aren't required in this depth for each user - I just wanted to know more closely what really happens inside my domain and how efficient gzip_cnc can work for it at all.

Therefore I have analyzed the traffic of a sufficient space of time (still without the use of a compression tool), based on the results of a Webalizer report and a little post processing (using a standard spreadsheet software) and came to these results:

Document Type Hits KBytes Traffic Size Packaging
Static HTML pages 985316,30 pct7294455,03 pct8279742,90 pct860511.90 pct
Images (GIF) 3156852,24 pct1811313,66 pct4968125,74 pct161263.54 pct
Dynamic HTML pages (SSI) 637310,55 pct2316117,47 pct2953415,30 pct474521.58 pct
Cascading Style Sheets 686011,35 pct53864,06 pct122466,35 pct182856.02 pct
Images (JPG) 36205,99 pct39272,96 pct75473,91 pct213547.97 pct
ZIP Archives (Downloads) 260,04 pct55874,21 pct56132,91 pct2210660.46 pct
CGI Programs 6681,11 pct21911,65 pct28591,48 pct438323.36 pct
JavaScript Code 12902,13 pct11230,85 pct24131,25 pct191553.46 pct
Images (FavIcon) 570,09 pct1070,08 pct1640,08 pct294634.76 pct
Text Files (robots.txt) 1090,18 pct00,00 pct1090,06 pct1024100.00 pct
Images (PNG) 70,01 pct250,02 pct320,02 pct468121.88 pct

At first sight it may seem odd to add up 'Hits' and 'KBytes' to 'Traffic'.

But in fact this matches reality quite well. No matter how much usable data are really transferred each HTTP request consists of a HTTP header of the browser and a HTTP header of the server in its response - both typically amount to about 250 to 450 bytes each. Now add a handful of bytes for packaging on lower protocol layers (TCP/IP packets don't contain only usable data either) and the basic overhead of each HTTP request will easily reach the area of 800-1000 bytes.

So note that there are no 'small' GIF images of about 50 bytes at HTTP level, as they require their own size twentyfold as packaging for request and response.

Obtainable gains

gzip_cnc can compress only contents of static HTML pages. This refers to 55 pct of the usable data and 43 pct of the transfer volume for my domain. On the other hand this means that our handler has to be activated for one out of six requests only.

But the named quotas of the data volume don't shrink to zero by being served in compressed form. The obtainable savings per document depend upon numberous factors; but the present pages of the gzip_cnc documentation itself can serve as reference values. These pages are between 10 und 20 kB of size (thus twice as voluminous as the average of my domain according to the table above); their content is being reduced by 65-72 pct, i. e. at least two thirds. The most voluminous document by far, the gzip_cnc source code, has the best compression rate of 82 pct.

Overall an expectation of 70 pct for 'normal' HTML documents is reasonable - in case of bad code quality (like those produced by many code generators) and a high percentage of tables it may well be up to 80 pct. or more. (In my domain I have a file that is reduced by no less than 96 pct, i. e. a factor of 25, by compression ...)

For my own domain saving 70 pct per HTML document would mean saving nearly 40 pct of all usable document contents (which may eventually be used by the providers as measure for traffic limited web space) and nearly 30 pct of the real transfer volume (and thus more than 40 pct shorter response delays for my visitors). I consider this to be an obtainable gain for the majority of usual small to mid-size homepages.

Comparing the efficiency to mod_gzip

The Apache module mod_gzip especially would be able to handle the output of dynamic HTML documents (CGI applications and Server Side Includes) as well and therefore compress (in consideration of HTML documents only) 74 pct (usable data) rsp. 60 pct (transfer volume) of all requests.

Assuming again 70 pct compression rate for HTML documents now savings of 53 pct of the document contents rsp. 42 pct of the actual transfer volume (and thus more than 70 pct shorter response delays for the visitors) would be possible.

Other document types than HTML

The numbers of the example in question show that the limitation of gzip_cnc to HTML documents only marginally restricts its efficiency.

As the Netscape 4.x browser (currently still in use) cannot correctly handle JavaScript code nor Cascading Style Sheets from separate files in compressed form (but requests these in compressed form via HTTP header) it wouldn't be without risk in respect to the content to let these document types be processed by gzip_cnc as well. But if one would attempt to indeed serve such files in compressed form the quote of compressable requests in my domain would only rise from 55 pct to 60 pct (usable data) rsp. from 43 pct to 51 pct (transfer volume).

Coping with dynamic HTML pages

The numbers listed in the example calculation above show that within my domain I am using a lot of SSI (no less than one quarter of all HTML contents) which cannot profit from using gzip_cnc.

Originally the present pages of the gzip_cnc documentation have been SSI documents as well (dynamically including and conditionally computing the navigation bar server-side), but to be able to serve these pages (whose content doesn't change dynamically) in compressed form I have written a small Perl script for myself that requests these SSI documents via HTTP (using the CPAN Modul LWP::Simple) and stores the results as static files on my local computer - these static files subsequently can be uploaded to my web space and then served in compressed form by gzip_cnc.

I just mention this as an example to show that in some cases it may even be reasonable to make the type of the existing documents fit to technical restrictions, in order to reduce the transfer volume and allow for a better response time for your visitors. For myself I will thoroughly think over the use of SSI for static content in the future.

Coping with JavaScript and Cascading Style Sheets

Originally the CSS definitions of the present gzip_cnc documentation pages were contained in a common external CSS file and referenced using the HTML tag <link src="..."> (i. e. loaded by the browser via an additional HTTP request). This allowed a browser to keep these CSS definitions separately inside its cache; it also meant that these CSS definitions were not to be served in compressed form (in consideration of poor old buggy Netscape 4).

In the meantime the CSS definitions are an integral part of each of these documents. Embedding them is performed - just like the navigation bar - via Server Side Includes. Furthermode the CSS definitions now are spread over several files so that each document has to include nothing but the definitions it actually uses. Being part of a HTML document now the CSS definitions can be served in compressed form as well, shrinking their size to about a third of the original size, the result being already smaller than a pair of HTTP headers for request and response.

For JavaScript the same would apply analogously.

Browser handling of cache files

Not only for small documents (like JavaScript or Cascading Style Sheets) but also for files that are requested many times (JPGs are often used as common background images for a large number of pages) the average portion of 'packaging', i. e. the HTTP and TCP overhead for the requests, may well reach 50 pct and more.

Especially responsible for this are innumerable browser requests for checking the validity of the cached content. For a large part of all requests - up to 30-40 pct! - the dialog performed between browser and server will run like this:

Depending on the configured strategy, the browsers currently available will check the validity of their cache content

Without doubt the last one of these settings would be the most reasonable one from the point of view of the page provider - if the browser understands what to do.

But to give the browser a clue at all whether to ask the server or not this server would have to send some information earlier (i. e. when serving the data that are kept inside the browser cache) about how long these data should be considered valid. If the browser is able and willing to understand this information it will indeed send no more (superfluous) requests to the server during the corresponding time interval.

And well, the most effective compression of a request is totally avoiding it. Therefore gzip_cnc at this point supplies the function of a corresponding server configuration and sends appropriate HTTP header informations to the browser which suggests it to keep the served page for some (configurable) time interval (24 hours are preselected).

The current version of gzip_cnc itself doesn't yet master the handling of a conditional GET (that normally would be handled by the Apache web server): currently the date submitted by the browser will be ignored and always the content of the requested document be served as response. Such requests aren't posted for HTML documents frequently - they are much more likely in case of files that are referenced in many other documents again and again: images, CSS and JavaScript - which all cannot be handled by gzip_cnc anyway, at the time being.

Maybe some future version of the program will still handle this ... it is nothing but a little boring calculation (the date value inside the HTTP header If-Modified-Since isn't really formatted as for easy evaluation).

(Michael Schröpl, 2002-09-08)