ffutures: (Default)
ffutures ([personal profile] ffutures) wrote2007-10-11 08:55 pm

Convert Word file to HTML

While my aim with the Tooth and Claw RPG is to publish it primarily as a PDF, I also want to do an HTML release. But I'd really prefer not to go to the hassle of coding it by hand. Currently it's a Word document, and getting the PDF is really easy, I just print to Acrobat. Getting HTML appears to be a nightmare by comparison, at least if I use Word.

Is there an alternative that'll give me nice clean HTML without the huge amount of crud that Word puts in, e.g. micromanaging the position of every letter? But without mangling the page layout too badly?

Just tried Open Office - it is definitely NOT the answer to this one, layout was awful.

I have a very vague memory of a program called Stripper that did something like this, does anyone know if it's still around?

[identity profile] pengshui-master.livejournal.com 2007-10-11 08:20 pm (UTC)(link)
I used abiword to generate html the last time I need to do this, but I didn't need naything special from the layout.

There was sill a bit of crud - but it was quite easy to strip out. Unfortuantely I don't think abiword runs on windows.

How many files are there? it might be possible do it by hand/custom perl relatively easily.

[identity profile] ffutures.livejournal.com 2007-10-11 08:25 pm (UTC)(link)
I can do it by hand OK, it's just a bit of a pain. I have abiword on my iBook, but it's a big document and the layout is quite complex - drop caps, sidebars, etc. - so I suspect it won't be very simple.

[identity profile] vincentursus.livejournal.com 2007-10-11 11:12 pm (UTC)(link)
I have used a utility called 'Tidy' to do that in the past. http://www.w3.org/People/Raggett/tidy/ (http://www.w3.org/People/Raggett/tidy/)

[identity profile] ffutures.livejournal.com 2007-10-12 06:28 am (UTC)(link)
Thanks, I'll take a look tonight.

[identity profile] jgracio.livejournal.com 2007-10-12 12:00 am (UTC)(link)
Have you tried saving it as Webpage, filtered?

Still leaves some crud, but all Office specific tags should be removed, leaving fairly standard html.

[identity profile] ffutures.livejournal.com 2007-10-12 06:27 am (UTC)(link)
Yes - the results were NOT good. But I may use this for the first pass.
ext_16733: (Default)

[identity profile] akicif.livejournal.com 2007-10-12 12:27 pm (UTC)(link)
The original version of 1stpage from evrsoft has a very nice implementation of Dave Ragett's tidy that degrots Office HTML and does a nice conversion to using stylesheets....

[identity profile] ffutures.livejournal.com 2007-10-12 02:15 pm (UTC)(link)
Thanks, I'll check if out.