Fog Creek Software
Discussion Board




HTML, CR,LFs vs LFs

I know HTML is supposed to have Carriage Return, Linefeed sequences on the end of each line.

I recently ran across a number of HTML files with just Linefeeds, and these seem to work fine on IE

The question is... is how safe is just putting Linefeeds?

Say for example I want to save a few bytes in each HTML file so am considering replacing all CR/LF sequences with just LFs....

Is this safe?

Will other browsers handle this okay?

What about JavaScripts embedded in the HTML?  Will they be affected?

What about .js files linked from the HTML? Is it okay to substitute LFs for CR/LFs in these too?

S. Tanna
Tuesday, February 24, 2004

I guess you have been reading Speed Up Your Website.

Li-fan Chen
Tuesday, February 24, 2004

I'm not convinced that "HTML is supposed to have Carriage Return, Linefeed sequences on the end of each line.".

Indeed, I believe that the browser will generally ignore these characters as whitespace.

Adding these characters can make your source a *lot* easier to read, which may or may not be a good thing.

Steve Jones (UK)
Tuesday, February 24, 2004

HTTP header lines need CRLF endings.

HTML treats control-characters including CR and LF (and possibly TAB) as whitespace, unless inside a PRE chunk.

i like i
Tuesday, February 24, 2004

You don't need any of that in your HTML. I generate my HTML with xmlc, and it strips all of that out for me. My raw source is nice and readable, but when sent to the browser it's that much smaller.

Samo
Tuesday, February 24, 2004

Most web host configurations automatically gzip all your HTML before it's sent to the browser and the browser un-gzips it automatically when it gets to the other side.  It is pretty standard these days.  So pretty much anything you do to make your HTML smaller isn't going to have as much effect as you think.  And stripping out CR/LF's isn't going to change the rendering time (which is more of an issue in some cases).

Almost Anonymous
Tuesday, February 24, 2004

I just want to save space in the files

I couldn't care less about rendering time, bandwidth, etc.

S. Tanna
Tuesday, February 24, 2004

"I couldn't care less about rendering time, bandwidth, etc."
"I just want to save space in the files"

Why?

Almost Anonymous
Tuesday, February 24, 2004

I don't really know how else to say it: I'm only interested in saving storage space.

S. Tanna
Tuesday, February 24, 2004

Ahh.  Sorry.  Generally storage space is the least of all the issues!  Unfortunately, removing all the CRLF's won't save you a whole lot of space.  You could store all the content on the server in compressed form; that could save you alot more space.

Almost Anonymous
Tuesday, February 24, 2004

The reason many HTML-pages only have LF instead of  CR LF is probably because they were made on a Unix or Unix-like system. For those who don't know, that's what Unix always uses for line endings.

Roel Schroeven
Tuesday, February 24, 2004

...and classic Mac OS only uses CR for line endings.

Almost Anonymous
Tuesday, February 24, 2004

google's home page has no spaces or returns.

Li-fan Chen
Tuesday, February 24, 2004

HTML ignores CRs and LFs as whitespace.

HTTP does not. Note well what "i like i" said -- HTTP headers must have CRLF line-endings. Usually such headers would be generated by your web server but if you do them yourself, be aware.

A similar example: you can crash Outlook Express by passing it an e-mail with LF's for headers instead of CRLFs (though some e-mail servers can correct this situation automatically).

Nate Silva
Tuesday, February 24, 2004

The canonical form of HTML documents does indeed involve both CR and LF. However, according to RFC 2616:
"When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body. HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP."

Theoretically, some line breaks in HTML are supposed to go away under certain circumstances, but browsers by and large don't implement that properly, which is why you can sometimes see "whiskers" in people's image links; if there's a line break between the img tag and the a end-tag, that turns into a space that's part of the link, and is underlined.

Chris Hoess
Tuesday, February 24, 2004

So you want to save diskspace?

For static pages, here is an idea:

Store all your pages gzipped.

Most browsers accept gzipped pages directly (because they want to speed-up download times).

If the browser does not accept gzipped content, then your webserver needs to unzip it on the fly and provide the client with a uncompressed stream.

Now the final cravat: I have never needed to look into actually doing this, so I don't know what webservers might work this way etc.

i like i
Wednesday, February 25, 2004

Talking about Classic MacOS and line endings, does OSX use LF just like Unix?  Do OSX apps still generate CR only?

The issue for me is counting lines that may end in CR, LF or CRLF.  For various technical reasons, the state required to distinguish CRLF from CR and count only one line is difficult to maintain.  Right now I am just counting LFs and the old Macs be screwed.  I am not supporting Mac at all at the moment.

How much of a problem is this likely to be in the future?

David Jones
Wednesday, February 25, 2004

No, I'm not interested in gzip

I simply want to reduce the file size of HTML files on a disk.

I'm using a variety of techniques, only one of which is related to CR/LF vs LF thing, and this last one is the last one I'm not sure is legitimate

To those who say line breaks are not significant in HTML, a couple of trivial examples show them wrong: think about the PRE tag, or the SCRIPT tag which is followed by Javascript including lines containing // or <!--

S. Tanna
Wednesday, February 25, 2004

David Jones - something like:

char cPrevious = '\0' ;
char * pCurrent = szBuffer ;
int nLineCount = 0 ;

while ( (*pCurrent) != '\0' )
{
switch ( *pCurrent )
{
default:
break ;

case '\n':
if ( cPrevious != '\r' )
{
nLineCount++ ;
}
break ;

case '\r':
nLineCount++ ;
break;
} // switch

cPrevious = (*pCurrent) ;
pCurrent++ ;
} // while

S. Tanna
Wednesday, February 25, 2004

David,

LF is the standard line-ending in OS X.

Nate Silva
Wednesday, February 25, 2004

"No, I'm not interested in gzip
I simply want to reduce the file size of HTML files on a disk."

But gzip would reduce the size of the files on disk...  and not alter the semantic properties of the file as removing whitespace can in some cases.

Of course, it might be a server configuration headache. 

How come you are so tight on disk space?

Almost Anonymous
Wednesday, February 25, 2004

I am distributing the HTML files, in HTML to be read in by (any) browser. I don't want to use as much disk space.

S. Tanna
Wednesday, February 25, 2004

Consider block sizes on disks. For NTFS, the default block size is 4K; on most FAT32 disks, it's going to be large (even as large as 32K!).

Unless you cross one of those block-sized boundaries, then the 10s or even 100s of bytes you think you're saving... aren't.

Brad Wilson (dotnetguy.techieswithcats.com)
Wednesday, February 25, 2004

Brad is right.
Just try some HTML "compressor" (there must be freeware ones), and see if the output is correct. Then check the size occupied on disk.
Replacing CR-LF with LF will certainly do no harm.

Pakter
Wednesday, February 25, 2004

No, Brad is wrong

To quote myself:

> I'm using a variety of techniques, only one of which is related to CR/LF vs LF thing

The "variety of techniques" can reduce the typical large file size by 70% or more. If I can push it up more by replacing CR/LFs by LFs only, I want to.

S. Tanna
Wednesday, February 25, 2004

I'm curious to know how bad the HTML is that you can "compress" away 70% of it. What, there's 70% excess white space? Seriously?

Brad Wilson (dotnetguy.techieswithcats.com)
Wednesday, February 25, 2004

As explained above, except in boundary cases, it will be less efficient to compress your files, than to leave them uncompressed. This is the case because for most of us disk space is far cheaper (almost an order of magnitude) than the cost of maintaining compressed files. And you have already said you are not interested in saving bandwidth.

This parallels the classic programming advice against early optimization, as well as "constraint theory", which helps you avoid optimizing the wrong thing.

It is possible that your application *is* one of the boundary cases, where it can make a difference to pre-compress your files. If so, it would help to know more about your application so the people here can give more appropriate advice.

Nate Silva
Wednesday, February 25, 2004

Why is it so hard.  All I want to know is whether LF is a fair substitute for CR/LF in HTML files, in all cases

S. Tanna
Thursday, February 26, 2004

To Brad:

> I'm curious to know how bad the HTML is that you can "compress" away 70% of it. What, there's 70% excess white space? Seriously?

I already answered this: The "variety of techniques" can reduce the typical large file size by 70% or more.

I said typical large file - not "bad HTML"

70% is a guess based on testing so far on a couple of dozen files (which are not especially filed with whitespace), I haven't done an accurate average measurement across a large number of files yet.  I just copied all the HTML files on my disk to a folder and used them for testing for now. 

And no sorry, I'm not going to tell you how, except it's a "variety of techniques".


But, the main question remains, is LF a fair substitute for CR/LF in HTML.

S. Tanna
Thursday, February 26, 2004

As far as I know, no one here has found any reason against using LF. So just give it at try.

Pakter
Thursday, February 26, 2004

*  Recent Topics

*  Fog Creek Home