The big import and not getting magic names

It's got 92 html files and about 300 other files.  It was created in 1999, last modified last summer, the webmaster is long gone.  All we have is FTP access to the host.

I assumed too much of the "Import folder from disk" function.  Although CityDisk crashed during the import, it worked but then I realized I don't have the advantage of magic names.  I can edit the html files but City won't display the graphics.  I know they are there, I know City will publish them, but when I look at an article the image links appear broken.

If I can get through this and give the site the full CityDesk treatment, we'll be in great shape and I'll be able to give the site to the owner.  It's going to take a lot of work to get there though.

Any suggestions or something I've missed?

Friday, March 29, 2002

Sadly, the "magic fixups" only happen if you import a single page at a time.

And you would still have .html files (not articles).

I know that we have to make large scale imports work better and will be thinking hard about this soon.

It's very hard to figure out a general way to solve the problem of how to convert an HTML page (with the template baked in) into an article (with the template stripped away).

Joel Spolsky
Wednesday, April 03, 2002

I think this is a job for a CityDesk accessory, just to manage the assignment of magic names for imported folders.  I may be missing some key ideas though.

Generaling how to grab part of an imported HTML page and then place it into an article, automatically dealing with keywords, authors, publish dates, etc. is beyond me:  "Some assembly required."

Wednesday, April 03, 2002

Joel Spolsky wrote:

"It's very hard to figure out a general way to solve the problem of how to convert an HTML page (with the template baked in) into an article (with the template stripped away). "

My company demo'd a software product that read PDF's, identified articles, and spit them out in various ways. It worked off of a sort of regular expression. They way they explained it was something like this:

"You tell it where the article begins, and where the article ends. If the every article begins with large red text for a headline, then you tell it 'Start at the large red text, the large red text itself is the headline, and end at the beginning of the next large red text.' The rules could be based on font size, color, and the words themselves."

In HTML you can add things like H1 H2, etc. Basically a user-defined regex expression, probably dumbed down so the user doesn't have to do the regex himself (a-la Agent Ransack).

Dreamweaver has a nice templating system that I use on sites that don't require CityDesk's power. The thing I like about it is you can create a normal page and then tell it "this the header" "this is the left nav" "this is the body - you can edit this part" etc.

If you can open up one HTML file that got copied & pasted to make all the others, you can just highlight swaths of HTML and tell it "this is not part of the article" "this is the sidebar" etc. Hopefully CityDesk could then open up each HTML file and recognize each part.

Lots of sites have a single large table with two or three columns - a column that acts as left nav, a column that acts as the body, and possibly a column that acts as the sidebar. If CityDesk could recognize this table and pull out the columns and plop them into the right places...

Importing is going to have to take a lot of massaging by the person transposing the site. Anything CityDesk can do to import a site would probably be very appreciated.

Mark W
Thursday, April 04, 2002

