Discussion:
Whitespace removal in html generated by cgi
(too old to reply)
Gregory Toomey
2003-11-16 13:22:37 UTC
Permalink
A few weeks ago a question was asked in this group about removing whitespace from html, in particular from html generated by cgi.
Here's a simple technique I developed for Linux:


1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the input verbatim to Perl. The output of the cgi is piped to delspace.pl. our whitespace munger.

#!/bin/bash
/usr/bin/perl <<'EOFPERL' | ./delspace.pl
#your cgi goes here
use strict;
$|++;
print "Content-type:text/html\n\n";
print " <h1> This is a test <h1> \n";
print " some more text\n";

EOFPERL


2. Now here's delspace.pl, the whitespace remover. It may be a little buggy, but it seems to work for my simple html.

#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;

# remove leading whitespace
s/\s+$//;

# change internal whitespace to single space
s/\s+/ /g;

# remove simple one line comments
s/<!--.*?-->//;

# another simple whitespace removal
s/> </></g;

#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
}



gtoomey
Ben Morrow
2003-11-16 15:57:50 UTC
Permalink
[please limit your line lengths to 72 characters]
[please make sure your blank lines are *actually* blank]
Post by Gregory Toomey
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi.
1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
input verbatim to Perl. The output of the cgi is piped to
delspace.pl. our whitespace munger.
#!/bin/bash
There is absolutely no need to use bash. If nothing better, use the
techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
thing without superflous whitespace in the first place.

<snip>
Post by Gregory Toomey
2. Now here's delspace.pl, the whitespace remover. It may be a
little buggy, but it seems to work for my simple html.
#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;
# remove leading whitespace
s/\s+$//;
# change internal whitespace to single space
s/\s+/ /g;
# remove simple one line comments
s/<!--.*?-->//;
# another simple whitespace removal
s/> </></g;
You realise this changes the presentation of the HTML?
Post by Gregory Toomey
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
Why 4?
Post by Gregory Toomey
}
'A little buggy'? The whole idea's fundamentally flawed: you need to
start by separating the HTTP from the HTML from the data, which means
using an HTML parsing module. For instance, what about this:

<link
rel=stylesheet
type="text/css"
href="..."/>

Or this:

Status: 302 Found
Location: ...
Content-encoding: ...
Content-type: text/html
Content-length: ...

<html>...

Or this:

<pre>
#!/usr/bin/perl

use warnings;
use strict;

print "Hello world\n";
</pre>

Ben
--
I've seen things you people wouldn't believe: attack ships on fire off the
shoulder of Orion; I've watched C-beams glitter in the darkness near the
Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
Time to die. |-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-| ***@morrow.me.uk
Gregory Toomey
2003-11-16 20:55:35 UTC
Permalink
Post by Ben Morrow
[please limit your line lengths to 72 characters]
[please make sure your blank lines are *actually* blank]
Post by Gregory Toomey
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi.
1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
input verbatim to Perl. The output of the cgi is piped to
delspace.pl. our whitespace munger.
#!/bin/bash
There is absolutely no need to use bash. If nothing better, use the
techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
thing without superflous whitespace in the first place.
The technique I described allows you to take an existing cgi & change 2 lines at the top & one at the bottom.
What you described will work, but its more complicated.
Post by Ben Morrow
<snip>
Post by Gregory Toomey
2. Now here's delspace.pl, the whitespace remover. It may be a
little buggy, but it seems to work for my simple html.
#!/usr/bin/perl
my $count=0;
while(<>){
# remove trailing whitespace
s/^\s+//;
# remove leading whitespace
s/\s+$//;
# change internal whitespace to single space
s/\s+/ /g;
# remove simple one line comments
s/<!--.*?-->//;
# another simple whitespace removal
s/> </></g;
You realise this changes the presentation of the HTML?
Post by Gregory Toomey
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
Why 4?
Post by Gregory Toomey
}
'A little buggy'? The whole idea's fundamentally flawed: you need to
start by separating the HTTP from the HTML from the data, which means
It worked with all the cgis I've created.
Its just a simple pragmatic way to solve a real world problem .


gtoomey
Eric J. Roode
2003-11-16 22:02:33 UTC
Permalink
Post by Gregory Toomey
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi. Here's
What is the goal of this? Reducing the amount of data that is
transmitted to the client browser? If so, you would probably be better
off compressing the output with gzip -- all major browsers support gzip
compressed data.

[...]
Post by Gregory Toomey
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print
Jeff 'japhy' Pinyan
2003-11-16 22:13:24 UTC
Permalink
Post by Eric J. Roode
Post by Gregory Toomey
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.
Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".
--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
Eric J. Roode
2003-11-17 00:41:40 UTC
Permalink
Post by Jeff 'japhy' Pinyan
Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".
Ooh, good point.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print
Gregory Toomey
2003-11-17 02:19:56 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Jeff 'japhy' Pinyan
Not to mention that, although most HTML renders multiple whitespace as a
SINGLE space, a SINGLE newline IS needed, because the browser will render
it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
string like "foo \n bar" is also just rendered as "foo bar".
Ooh, good point.
I tried it on a dozen cgis and it worked.

To make this foolproof your need to write a HTML parser - this is left as an exercise for the reader!

gtoomey
Gregory Toomey
2003-11-16 23:10:43 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Gregory Toomey
A few weeks ago a question was asked in this group about removing
whitespace from html, in particular from html generated by cgi. Here's
What is the goal of this? Reducing the amount of data that is
transmitted to the client browser?
Yes.
If so, you would probably be better
off compressing the output with gzip -- all major browsers support gzip
compressed data.
Yes I use Apache with gzip so that's another level of compression.

People hate waiting for pages to load, especially for people on dialup.
[...]
Post by Gregory Toomey
#newlines are not needed
#except for Content-type-text/html\n\n
# which occurs at the start
print;
print "\n" if $count++<4;
Newlines are needed in <pre>...</pre> sections, and sometimes in
<textarea>...</textarea> sections.
- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print
-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>
iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
YT5a72WavpE6GErYnLOzUQ+d
=zRRz
-----END PGP SIGNATURE-----
Eric J. Roode
2003-11-17 00:43:22 UTC
Permalink
Post by Gregory Toomey
People hate waiting for pages to load, especially for people on dialup.
Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :-)

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print
Gregory Toomey
2003-11-17 02:15:16 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Gregory Toomey
People hate waiting for pages to load, especially for people on dialup.
Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :-)
The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
Running the script to remove whitespace takes under 1 second for 1000 lines of HTML, and does not increase the load to any discernable extent.

The database-driven cgi I use is disk IO bound, not CPU bound.

gtoomey
Chris Mattern
2003-11-17 15:03:36 UTC
Permalink
Post by Gregory Toomey
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Gregory Toomey
People hate waiting for pages to load, especially for people on dialup.
Have you verified that the extra time your CGI scripts take to execute is
less than the transfer time of the spaces you are eliminating? :-)
The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
Running the script to remove whitespace takes under 1 second for 1000 lines of HTML,
and does not increase the load to any discernable extent.
The database-driven cgi I use is disk IO bound, not CPU bound.
Which doesn't answer the question. The question isn't "Are you overloading the
server?", the question is "Are your users waiting longer for you to remove the
whitespace than they would wait for the whitespace to download?" Assuming there
is ten bytes of removable whitespace per line (which would be rather a lot),
then the whitespace in 1000 lines takes less than two seconds to download on
a 56K modem. It would take a small fraction of a second with broadband. It
scarcely seems worth the effort.

Chris Mattern
Louis Erickson
2003-11-17 23:55:55 UTC
Permalink
Gregory Toomey <***@bigpond.com> wrote:

: It was a dark and stormy night, and Eric J. Roode managed to scribble:
:> What is the goal of this? Reducing the amount of data that is
:> transmitted to the client browser?

: Yes.

:>If so, you would probably be better
:> off compressing the output with gzip -- all major browsers support gzip
:> compressed data.

: Yes I use Apache with gzip so that's another level of compression.

If you're gzipping the output stream, then the removal of spaces isn't likely
to change your transmission size significantly, if at all. The compressor
will flatten them right out, without risking the content of the HTML.

Also note that if you have a CGI that sends back something besides HTML,
such as image or sound data, this will completely screw it up.
--
Louis Erickson - ***@rdwarf.com - http://www.rdwarf.com/~wwonko/

Andrea: Unhappy the land that has no heroes.
Galileo: No, unhappy the land that needs heroes.
-- Bertolt Brecht, "Life of Galileo"
Loading...