Fog Creek Software
Discussion Board

Version control for Document-type files? (doc,etc)

Hi all!!

I've been pestering all of those who hang around the forum (again, thanks everyone for the _excelent_ tips provided!!), so I am back at it again :)

I just started on my new job (<plug>at </plug>), and my main role is documentation and some testing (I have to know what to write about! ;). I have to set up some sort of version control for the docs (apart from the classic "copy the folder and rename the files ;) so I was wondering:

How do the different Source Control / VCS systems that people around here use work with non-textual files? I am refering to, mostly, .doc, .pdf, Visio Files and the like.

I would like to (slowly) start using some sort of single-source publishing system in the company. I have been testing on my (scarce) free time the Help Tools recommended on previous threads, and they seem to be able to run on "HTML+" (augmented HMTL) files, so that'd be fine with almost any source management system, but right now we have to get the docs for the new product ready, so the first priority is to get something out of the door. That means avaliable tools (Office, Visio, Acrobat, mainly). And that's the reason for my query. Any tools that work OK? Any that die horribly on binary files?

Thanks a lot to all!!

Monday, December 30, 2002

CVS can take binary files, but it doesn't like it. As for PDF's, well, what is the source you're running through Acrobat?

I recall reading that Office XP's format was XML. If this is correct, you may be able to throw it at any version control sys without too much worry.

Office's "Save as HTML" output leads to big files and ugly code, but a VCS shouldn't really care how nasty it is.

I know this post focuses more on avoiding binary files than on dealing with them, but this is all I know. Hope it helps.

Mike Swieton
Monday, December 30, 2002

You could try a simple (and free) document management system:

(no I don't work for them)

Monday, December 30, 2002

Javier -

My version control experience has been mostly with MS VSS and StarTeam, to a lesser degree with PVCS and CCC Harvest (also some with RCS and SCCS, but I never tried to feed them binary files, so I don't include them in my comments).

We never experienced any difficulties feeding StarTeam, VSS, PVCS or (at least to my knowledge, YMMV) CCC Harvest any binary files (i.e. MS Word Docs, Visio files, images, etc). We never had any problems with these tools versioning them, either.

What you are likely to find, however, is that because there really aren't any good tools out there (i.e. incorporated into these version control systems) for 'diffing' binary files, some of these tools may not internally retain only the difference between successive versions, and may simply retain copies of the files. This could be a diskspace issue for you if your VCS server is tight on diskspace. Again, though, this is internal to the VCS itself, it'll be transparent to you, but you won't be able to do diff's between versions of binary files. These tools trap attempts to run such diffs, so they don't puke if you try it, they just decline gracefully to do it.

In my current organization, we have been successful in migrating our visual designers and sales folks (well, most of them) over to using our VCS for their binary files ( photoshop files, flash, proposals, contracts, etc ). After some initial resistance from the 'non-tech' departments, they have admitted to liking it better than the old system of copying the file over with the date-time in the filename.

Short war story -- we had a project that was a bit unusual for us; a development project run out of our visual design section. After stressing to them that the artifacts for this effort needed to be treated just like source code--because it was--they blew me off with comments such as ' visual designers just don't work with VCS's', or ' we've done this for a lot of years and never lost a file yet', and another one was ' we're saving everything to the shared drive, that's backed up to tape, so we don't need to use the VCS.'  Source, in this case, was itself binary, not plain text. Well, I couldn't convince them to set their project code up like a real programming project, and lacked the authority to compel them to do so. Over time, after hundreds of bug reports, months of testing, dozens of builds, and final delivery of a 'gold disk', the head 'developer' from the visual design group left. Since we were now facing a maintenance situation (and I had the backing of the senior sales executive on this--he was also confident future modifications were going to be requested ) I challenged them to re-generate an exact copy of the final delivered build from the source, and the remaining folks were unable to do so -- we had had several builds subsequent to the 'gold disk' we sent the client (turned out we didn't use those last couple of builds--there were problems with them). Well, turns out the guy over-wrote the reference source files (it's not 'code' in the ordinary sense, so it can't just be de-compiled; the process of 'compilation' in this project is strictly a one-way process). So, the source for the version we sent out the door was gone.

They wouldn't listen to me when I told them what to do originally, but actually 'burning their hands on the hot stove' like this is what it took to make believers out of them. All resistance to using VCS vanished after that, so some good came out of the experience after all. (Lesson: Listen to your QA Manager! ;-)

Recommend you go right ahead and dump binary files into your VCS (so long as it will take them--the ones I mentioned will, though with the diffing limitations cited). It'll probably save your butt some where down the road. Irrespective of any reduced capability to diff binary files, you may also find as we have that there's considerable advantage to having all files relating to a project all together in one place--the VCS. We no longer have to look over on one shared drive outside the VCS for contractual, visual, and project management documents for a project, then for the technical artifacts in the VCS -- it's all together in one place. If your organization, like ours, has lots of projects and a considerable history, that ends up being a pretty big advantage when you're looking to reuse artifacts for other projects.

Regarding saving files such as office app files in text format -- if you can work it out, I'd say it's a great idea--you can gain meainingful diffing and more compact version storage, as well as fewer false status changes. I concur with other poster's comments, however, in that the HTML version in which MS Office saves files is beyond shitty; haven't had a chance to fiddle with XML versions much yet, though. Since we're not planning to goto Windows/Office XP any time soon (if ever), that may not be an option for us.

Personally, if it were up to me, I'd opt for creating as much documentation as possible in plain text format to begin with--substance is more important than packaging normally anyway. In practice, though, this often isn't practical/possible for diagrams, and MS Word's formatting capabilties do make lots of documents faster/easier to write and much more readable than simply using plain text. So, I haven't yet hit on a good solution for non-binary storage of most of our documentation; we just keep the binaries in our VCS.

Good luck to you,


Tuesday, December 31, 2002

I would certainly recommend using a VCS for documents and binary files.  Having the ability to diff would be nice but using the same system across the company is more important.  (And using a system that won't cost as much as a small car is often important to us small companies on the board.)  So I'd recommend for just using what is already there, with one caution:

I would NOT recommend storing anything in a shared Visual SourceSafe database if VSS is using structured storage file format internally for its database and the document is also SSF.  Office 97 is SSF, and I suspect any Office since is, up to the change to XML format for XP.  Some people have never had problems with it, but I have, I'm not alone, and there's no way of knowing which camp you'd end up in.  And really, the last thing you want to deal with when you're in a hurry is losing doc files and having them take random bites out of your source code history.

On the other hand, the next to last thing you want to do when in a hurry is try to set up CVS, or make a big purchase of software.  So, if you have VSS in the rest of the company, just make a separate VSS database for doc files for now, just in case.

Tuesday, December 31, 2002

Mikayla -

hmm. looks like I've been lucky to have dodged the VSS bullets you refer to.

so that I'm clear -- you're saying that you and others have experienced problems storing MS Office files in VSS?

can you tell me how, specifically, the problems manifest themselves? i.e. what symptoms would I watch for if I were experiencing the anomaly you describe?

thanks much,

Oh - and you're certainly right about the cost issues. We use $tarTeam, but if we didn't already have sufficient licenses of it from purchases made a couple of years ago, we certainly couldn't afford to buy it now; we'd be back with VSS (which is what we used before going to $T) as our VCS.

Tuesday, December 31, 2002

Regarding Visual SourceSafe corrupting Word files - this was a known possible problem in older versions of SourceSafe under some conditions (for example, a Word document was edited in Word 97 and saved in Word 95 format and then checked into SourceSafe). This particular problem was corrected in version 6 of Visual SourceSafe.

We store a variety of binary files (Office documents, EXE's, DLL's, PDF files, graphic image files, compiled help files, etc) in Visual SourceSafe (version 6) and have never had a problem with any of these files.

As noted above, the only drawback is that you cannot view the differences between 2 versions of a binary file in Visual SourceSafe (or most other source control systems). However, you can still get 2 different versions of the file from SourceSafe and compare them in some other way (such as, for Word files, the "Compare Documents" feature in Word).

Philip Dickerson
Tuesday, December 31, 2002

The problem with binaries in VSS is that you run up against the database size that causes corruption.

Perforce has an office enhancement specifically for doing revision control on Office docs.  Never used it, however.

Tuesday, December 31, 2002

Althought its a bit like suggesting changing your entire tool set for a nice shiny box that keeps everything that shape organised, if you use OpenOffice then you can version your files within the same structured XML file.

Its likely Office XP could do that but I've given up on Office.

Simon Lucy
Wednesday, January 1, 2003

Perforce can handle not only source code but all types of documents.  I suggest downloading the free 2 user version and trying it out, it can't hurt.

Wednesday, January 1, 2003

Look into Microsoft Sharepoint Whatever It's Called. There are a few different vaguly related versions, I think you want 'team services. It might have versioning, though I forget. Probably comes with some Microsoft Office CD too (backoffice? something else?)

Word has its own versioning weirdness built in, you can sort of use that to track revisions, check in to any of the versioning systems which can handle binary files, and every once in a while blow away the revision history right after a checkin, then immediately check in the clean copy.

Word (and some other programs) can also do diff on binary files. But the source control won't generally understand it.

Giant XML messes may or may not survive much better in source control. They won't be mangled much (hope there's no keyword expansion), but they also won't have good history. Think, for example, of wordwrap. Or even lack of wordwrap. You'll need a special diff program there too.

I imagine people are working on XML diff however, since the results would be shared across multiple application domains. Anyone know of this?

Friday, January 3, 2003

*  Recent Topics

*  Fog Creek Home