README for Geoff's Word to XML document converter (working title)
July 6, 2003
oakhamg@sourceforge.net
PROJECT
This project aims to create a general-purpose conversion tool for Microsoft Word documents.
LICENSE
This software is Copyright (C) 2003 Geoffrey Oakham and published under the terms and conditions of the Gnu Public License (version 2). See LICENSE.txt for details.
(I'm considering adding and/or switch to: the Python Public License, LGPL and the Calvinball Comercial Contract. If you send me patches, please let me know if I'm allowed to relicense the patch on your behalf.)
REQUIREMENTS
-Python 2.2+
-Python XML sax libraries
-Little edian platform recommended.
- INSTALLING
tar -xzf doc2xml-0.0.1.tar.gz
cd doc2xml-0.0.1
python ./driver /tmp/mutt.xxxxxxxx.doc
RELEASE NOTES:
This is a 'proof-of-concept' release which I guess makes it 'alpha' quality. The following features are implemented (and mostly working):
-accepts word 97, 2000 and 2002 documents (but not 95)
-accepts both 'fast save' and 'full save' documents
-Unicode input & output
-bold & italics formatting
-footnotes (basic)
-stylesheets (basic)
For future versions, I'm going to flesh out and stablise the existing features. The code is currently messy (because it's essentially a prototype) but I'm optimistic it's recoverable.
- THANKS
I owe a big thank you to Martin Schwartz and Takanori Kawai for creating PERL's OLE Storage modules. About a month ago, when I was trying to put together a PERL script that did something (anything!) with a word document, I nearly gave up when I discovered I needed to parse OLE datastructures. I was really happy to discover Martin & Takanori had allready done this. Ole.py is a direct port of Takanori's OLE::Storage_Lite.pm
Thanks also to 'Shaheed' (I'll look up your full name when I get online) for maintaining a copy of and corrections to the "Microsoft Word 97 Binary File Format" available at:
http://www.aozw65.dsl.pipex.com/
And of course thanks to everyone who listened to me ramble about this project.. especially the non-geeks.
