xhtml-mathml stylesheet updated 2007/05/09
I commented on Murray's blog That it ought to be possible to get XHTML/MathML documents out of Word. Having speculated that it ought be possible, conscience dictated that I try to do it. The results are described below.
The first problem was that I'd never used Word (or any other WYSIWYG editor). It seems strange to me, but then I've been using emacs so long I'm probably corrupted.
Word 2007 has MathML input/output (via an XSL stylesheet installed with the system), and has HTML input/output (via its save as web page file menu), so the plan of action is: save the document as html, clean it up to xhtml, using the stylesheet to convert the mathematics to MathML at the same time.
- Write your document in Word 2007, save as web page file.htm .
- Use tagsoup to get some usable XML from this output.
java -jar tagsoup-1.1.jar --lexical --output-encoding=iso-8859-1 file.htm > temp.xml
- Use the supplied xhtml-mathml stylesheet to do some further cleanup and apply the Microsoft supplied omml2mml.xsl stylesheet to the math fragments.
java -jar saxon8.jar -o file.xml temp.xml xhtml-mathml.xsl
- Word document (docx)
- "HTML" generated by Word
- XHTML + MathML