at.ncvp.me

First Astro project with 'pages' and 'posts' collections

DOC to Markdown conversion, .doc version

Converted from doc(x) by doc2md

Posted on 22nd May 2026 at 02:06 by Admin


This is the source page test_docx.docx which the conversion process is supposed to convert into test-docx.md

If possible, MD will be preferable to MDX. MDX throws syntax errors on all sorts of innocuous-looking strings.

We need to be able to convert .doc and .docx files from my documentation store to .md(x) for Astro content creation.
We need a utility that will do the conversion automatically file by file. I don’t think we’ll ever need a block converter.
Various schemes have been tried:

See also

Contents

What has to work

Top

Tables

Markdown tables have to have a header by default:

FeatureStatusNotes
Astro LayoutsWorkingUsing @layouts alias
StylesScopedTesting specificity
Indentation2 SpacesConfigured in VS Code

Blank headers are removed with CSS in astro-test.css:

1Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc vel massa tincidunt, aliquam elit id, venenatis tellus.
2Nunc molestie mauris et magna placerat tempus.
3Phasellus sodales dolor enim, vel eleifend ante facilisis semper.
4Integer vel dictum orci.
5Praesent cursus ligula vel nisi rutrum, sit amet mollis tortor euismod.
42Duis sollicitudin elit sit amet quam dictum congue.

Top

Images

IMAGE-PLACEHOLDER

Laus Veneris, by Edward Burne Jones

This is an embedded image. All such images will require manual intervantion after conversion, so fix_word_01.py replaces images with ‘IMAGE-PLACEHOLDER’ text.

See astro-test Textflow for more images with Fancybox effect.

External images should be no problem Mary Emma Jones, by Emma Sandys, 1874

Top

Paragraph with hyperlinks that could cause trouble

Try putting hyperlinks in various places to see if they cause trouble.
The first one is on its own:

ncvp home page

ncvp home page at the start of a line

At the end of a line ncvp home page

In the middle of a line ncvp home page with text following

What about other lines in the same paragraph.
These don’t have links.

fix_word_01.py doesn’t fix any <Enter> terminated line which includes a link. This really isn’t a problem.

Top

Section with CTRL/Enter newlines

Maecenas id sapien risus. Nullam varius id tellus quis volutpat. In a tempor tortor, cursus sodales enim. Integer sed felis et massa lacinia molestie volutpat quis dui.
Duis interdum libero sed velit ornare, at blandit nulla volutpat. Mauris quis mi arcu. Praesent vulputate accumsan tortor, et ullamcorper sapien tincidunt ut. Nulla a eleifend justo.
Nam facilisis efficitur risus, vitae ornare ex fringilla vitae. Phasellus neque justo, vehicula sed finibus non, posuere sed ligula.

Top

Section with Enter newlines

Maecenas id sapien risus. Nullam varius id tellus quis volutpat. In a tempor tortor, cursus sodales enim. Integer sed felis et massa lacinia molestie volutpat quis dui.
Duis interdum libero sed velit ornare, at blandit nulla volutpat. Mauris quis mi arcu. Praesent vulputate accumsan tortor, et ullamcorper sapien tincidunt ut.
Nulla a eleifend justo.
Nam facilisis efficitur risus, vitae ornare ex fringilla vitae. Phasellus neque justo, vehicula sed finibus non, posuere sed ligula.

Top

MD vs MDX vs Astro pages and posts

Gemini says:

Standard .md is often preferable for simple, text-heavy blog posts, but .mdx and .astro offer superior flexibility depending on your specific needs.

Choosing the right format depends entirely on the level of interactivity you want to achieve.

1. Markdown (.md)

Best for standard, text-heavy articles (blogs, documentation, and guides).

2. MDX (.mdx)

Best for content that requires interactive elements or custom UI components.

3. Astro Pages (.astro)

Best for structural site pages (Home, About, Contact, and Tag Archives) or highly customized template layouts.

4. Summary Recommendation

Keep your actual pages/posts in .md if they are purely textual. Switch to .mdx only if you find yourself needing to inject custom interactive elements mid-post. Use .astro strictly for your site’s templates and structural pages.

Top

Conversion issues

Top

General columns

In practice, the only time I format a selection with columns is two columns of hyperlinks.
But what happens if I have two columns of general stuff?

Thing 1
Thing 2
Thing 3
Thing 4
Thing 5
Thing 6
Thing 7
Thing 8
Thing 9

That works. I’m not sure how.

Top

Conversion process

Each stage of the conversion process generates a new -tempn.docx file. These are normally deleted at the end of the process, but may optionally be kept for debugging.

fix_word_01.py

fix_word_02.py

Locate 2-column sections and bracket them with special text for fix_pandoc to replace with HTML

fix_pandoc.py

Change <Enter> newlines within a paragraph to <Shift/Enter>.
Don’t touch ordered or unordered lists, or the lists of links within two column sections
Replace any images with ‘IMAGE-PLACEHOLDER’. They’re going to need manual intervention
Extract the document title from the header and save it in a file for fix_pandoc.py to add to the frontmatter

Top

Pre-conversion tidy

Mainly the deletion of all the random sections which seem to appear, and the re-instatement of the 2-column sections

Top

Category: tech Tags: edit admin

Rendered by src/pages/posts/[slug].astro