Fog Creek Software
Discussion Board




HTML -> TIFF generation

I'm doing some research at the moment to see if it's possible to take an HTML document and produce a rendered file, ideally as a TIFF.

Are there any third party libraries that cater for this? I can think of some hacks (such as grabbing a device context in IE) but I don't really want to go down that route.

Better than being unemployed...
Wednesday, February 26, 2003

I have not tried this myself, but http://www.americansys.com/psd.htm?From=Main should be able to do the job.

From the feature list:

- Capture active window
- Support for GIF and TIFF formats
- Auto Scroll. Allows you to capture the an entire web page or other document when much of it is not visible.

Just me (Sir to you)
Wednesday, February 26, 2003

If you mean you want to do a screen capture of an HTML page in a browser then check out SnagIt:

http://www.techsmith.com/defaultflash.asp

It has a scroll feature so you grab the entire page even if it is off the screen. We used it for bug screen shots to include in our bug tracking database since "print screen & paste" only gets the visible part of the browser window.

If you mean you want to do this at runtime in code then this won't help you...

KenB
Wednesday, February 26, 2003

I know little about these things, but why not take a look at an open web rendering engine like mozilla.
If its anything like game code, there must be a point where the page becomes nothing more than a load of RGB-triplets to be shipped of to the video mem.  Attach tiff headers an write them to disk.

Im just speculating...

Eric DeBois
Wednesday, February 26, 2003

Thanks guys, but I'm afraid my problem is that I want to do this at runtime on a server that can process lots of HTML documents in the background.

Better than being unemployed...
Wednesday, February 26, 2003

Combine the above with something like http://www.pitrinec.com/toolsworks.htm could still save you boatloads of time. Yes, it's dirty. Yes, it's a hack, but hey, I guess this is not the real focus of your research, just the data collection part, so why not get it done in a day and spent time on the real meat instead?

Just me (Sir to you)
Wednesday, February 26, 2003

More details pls, I'll ask some Russian h*ckers, oh, sorry, programmers:)
This case looks like interesting one:)

Slava
Wednesday, February 26, 2003

>>
I'm doing some research at the moment
<<
No you're not. Because I just looked on google and got the answer in 2 seconds.

google
Wednesday, February 26, 2003

Ok, that's their answer. "Don't spend too much time. Use IE API or Gecko rendering engine".
Another idea for context switching: "Use rendering engine to print something to Postscript device or custom device (yours)"

Slava
Wednesday, February 26, 2003

Okay, my best description of the problem is as follows.

We've currently got a product that takes in document data from a variety of bespoke databases (accounts packages seem to be popular), apply a styling template to it, and output some HTML. This is then uploaded or emailed out to an appropriate recipient.

A suitable application of this would be to send out a load of outstanding invoice reminders at the end of each month.

What we want to do is apply an extra stage in this process, so that instead of sending out HTML, we send it out rendered as a TIFF (or PDF, but the prefers TIFF for reasons I can't quite fathom at the moment).

The document server (which does the meat of this process) runs as an NT/2000/XP service in the background with no user interaction. Therefore any rendering has got to be done at runtime, and anything that involves any UI has got to be ruled out.

A COM component is great. A Win32 library is still good enough. As long as it's thread safe.

Any other information, let me know...

Better than being unemployed...
Wednesday, February 26, 2003

I would use the IE control. Load the page, then print to a PostScript printer to get PDF. I assume going from PDF to TIFF is fairly easy. There's no reason you can't do this from a service. I know, there's no place for the gui to show up when you are running as a service, but it doesn't matter -- you can still instantiate the IE control, load a web page, and print it.

Joel Spolsky
Wednesday, February 26, 2003

something from Google:
>    Does Image Magick support HTML-to-TIFF conversion?

        Yes - though we use some helper apps (html2ps and ghostscript).<

I don't know what's this Image Magick, but it seems like html to postscript and then to tiff idea has a growing support:)

Slava
Wednesday, February 26, 2003

You could also use a TIFF printer driver to print directly to TIFF http://www.informatik.com/tiffwork.html

Just me (Sir to you)
Wednesday, February 26, 2003

around a year ago there was a link on gotdotnet.com site with source code. you entered a url and it returned a gif of that page. cannot find it right now

na
Wednesday, February 26, 2003

The printer driver is actually the best idea of the bunch. Then your app doesn't need to know anything about TIFF - it just prints. Same as if you wanted PDF.

Chris Tavares
Wednesday, February 26, 2003

Install a dummy postscript printer driver. Print the image to a temporary PS file using IE or Gecko. Use ghostscript to convert PS into either TIFF or PDF.  You can do all that on server side.

Ghostscript also comes as Library.

The TIFF printer driver mentioned in other reply looks interesting too.

nitin
Wednesday, February 26, 2003

Most electronic faxing software uses print drivers that generate tiff files, so you might see whether you can re-purpose a library intended for faxing.

Also, I know that Adobe offers a similar driver as part of a "print to your local printshop" feature. It, of course, generates PDFs. I don't have any more details on it, though.

Malcolm
Thursday, February 27, 2003

Slava,

Image Magik is a set of tools and libraries to deal with images. Shipped with most of linux distributions. With ImageMagik you can convert, resize and recolor all images in dir just in one command line.

Yury
Thursday, February 27, 2003

Acrobat will export PDF to TIFF and import HTML to PDF, I am pretty sure under programmer control.

However why are you wasting time on this when there is off the shelf software which does this sort of thing. Have a look at Accelio, FormScape, StreamServe, Optio, etc

Tony E
Thursday, February 27, 2003

Okay. I must be missing something here because so many people have recommended print drivers.

I've been burned by print drivers in the past, causing some severe bugs that have made customers very unhappy. This is how I understand it...

When you're doing batch processing, at the time of sending off the input file, you need to have some idea of where the output file is going to be, so you can tie the two together. This is particularly true if you've got a multithreaded server that processes many files at a time.

Most print drivers allow you to specify an output filename after you've sent something to be printed. Great. But you need some way of knowing exactly where the print job came from in the first place. Otherwise you'll just blindly write an output file, tell the server that you _think_ this is the file they just sent to be processed, but you can't be 100% sure.

This is not a contrived scenario. We used a fax printer driver with a similar mechanism, then got a complaint from a customer that all the faxes were going to the wrong place because they started accidentally printing to the fax printer driver from Word, and the server made some false assumptions about where the resulting temporary files were supposed to go.

What am I doing wrong here?

Better than being unemployed...
Thursday, February 27, 2003

Naming the file with a GUID would ty all the processess together wouldn't it?

Just me (Sir to you)
Thursday, February 27, 2003

I'm not sure what you mean. The input file or the output file?

I'd say you need to name both with the same unique name. And that's what I can't figure out - how do you guarantee that when you've potentially got 5 print jobs happening simultaneously?

Better than being unemployed...
Thursday, February 27, 2003

I am assuming you can already differentiate your input files (if you can not then you can do it artificially by using some singleton naming process, e.g. DB autoID, or again a GUID). If you then have one structure (DB table for instance) that maps the unique input name on the unique output name, you are set. The number of treads doing this in parrallell is irrelevant. The GUID generator garanties no name clashes will ever occur.

Just me (Sir to you)
Thursday, February 27, 2003

Btbu,

Maybe I should explain GUID. A GUID is a Globally Unique IDentifier. It is usually (at least partly) implemeted through a random number generator. The idea is to generate a very large (e.g.128 bit) number so that the chances that that number would ever be generated again become 0 for all practical cases. This ensures that multiple identy issuing processess can generate unique identities without ever having to coordinate.

Just me (Sir to you)
Thursday, February 27, 2003

Hmm. Let me try and explain my problem from another angle.

A print queue is, in effect, not thread safe.

Suppose I have three files, let's call them 001.html, 002.html and 003.html. I want to generate 001.tiff, 002.tiff and 003.tiff respectively.

Send off 001.html to the print queue. Then send 002.html to the print queue. Before we get a chance to send 003.html, however, rogue_app.exe decides to send its own print job to the print queue at the same time without telling us. Just after this happens, we send 003.html to the print queue.

So we now have four items in the print queue.
The first print job finishes. Well that'll be 001.tiff. Then the second one finishes. That's 002.tiff. Then the third print job finishes. Hmm, that's _not_ 003.tiff, but how do we know this?

Better than being unemployed...
Thursday, February 27, 2003

Because the submitting process generated a GUID for 003.html (e.g. 878FA78499377202873AAAB7982C8321), stores in the DB the tuple (003, 878FA78499377202873AAAB7982C8321) and tells the printer driver to print to 878FA78499377202873AAAB7982C8321.tiff
The receiving process that picks up 878FA78499377202873AAAB7982C8321.tiff finds by looking in the DB that it is ascociated with 003.html. The rogue process, or any other tread for that matter, has 0 chance of having by coincidence picked the same name since that is in the nature of the GUID.

Just me (Sir to you)
Thursday, February 27, 2003

Aha, got it. All well and good, except..

1) I can't tell the printer driver what output file I want to generate in advance.
2) When I get notified that a TIFF file should be generated, I don't get told which print job generated it.

Looks like I need a better printer driver...

Better than being unemployed...
Thursday, February 27, 2003

If you want to control the naming of print files have a look at a product called Print Distributor at www.frogmorecs.com , the file name could be derived from the document name.

Tony E
Thursday, February 27, 2003

*  Recent Topics

*  Fog Creek Home