Image export #269

pizang · 2025-01-08T08:32:46Z

pizang
Jan 8, 2025

It would be good to have the option to export the images to media directory and add a proper image link in a markdown code.

When we use RAG with DOCX manuals we currently use pandoc with image export. In some cases it does not make any sense to describe pictures as they may be some schemas or diagrams. In that case we provide a link to LLM and then it is displayed in the final answer.

CarlZhang12 · 2025-10-20T14:25:28Z

CarlZhang12
Oct 20, 2025

this is my tmp method to get docx images outputs, maybe good for you:

from markitdown import MarkItDown,StreamInfo
import mammoth
from pathlib import Path
import tempfile

docx_file = "/path/to/docx"
img_dir = Path('image')
img_dir.mkdir(exist_ok=True)

def convert_image(image):
    with image.open() as image_bytes:
        img_name = Path(image_bytes.name).name
        img_path = img_dir.joinpath(img_name)
        with open(img_path,"wb") as f:
            f.write(image_bytes.read())
    return {
        "src":f"{img_path}"
    }
html_content = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
md = MarkItDown()

with tempfile.NamedTemporaryFile() as tmp:    
    tmp.write(html_content.value.encode('utf-8'))
    tmp.flush()
    res = md.convert(tmp.name,stream_info=StreamInfo(mimetype='text/html'))
    Path(docx_file.replace('docx','md')).write_text(res.text_content, encoding='utf-8')

2 replies

tqye2000 Oct 29, 2025

Many thanks for the example. However, I think there is a minor bug though. It should be:

with tempfile.NamedTemporaryFile() as tmp:    
    tmp.write(html_content.value.encode('utf-8'))
    tmp.flush()
    
res = md.convert(tmp.name,stream_info=StreamInfo(mimetype='text/html'))
Path(docx_file.replace('docx','md')).write_text(res.text_content, encoding='utf-8')

Namely, the tmp file needs to be closed before the md process read it.

CarlZhang12 Oct 31, 2025

Many thanks for the example. However, I think there is a minor bug though. It should be:
with tempfile.NamedTemporaryFile() as tmp:    
    tmp.write(html_content.value.encode('utf-8'))
    tmp.flush()
    
res = md.convert(tmp.name,stream_info=StreamInfo(mimetype='text/html'))
Path(docx_file.replace('docx','md')).write_text(res.text_content, encoding='utf-8')
Namely, the tmp file needs to be closed before the md process read it.

I think postprocess extraction works better: markitdown with --keep-data-uris to get md output first and then use pattern match string base64_pattern = r'data:image/([a-zA-Z0-9]+);base64,([a-zA-Z0-9+/]+={0,2})' to get base64 img content and replace base64 str with local reference
this would keep original data pipeline unchanged and have better control for local image filename generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image export #269

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Image export #269

Uh oh!

pizang Jan 8, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

CarlZhang12 Oct 20, 2025

Uh oh!

tqye2000 Oct 29, 2025

Uh oh!

CarlZhang12 Oct 31, 2025

pizang
Jan 8, 2025

Replies: 1 comment 2 replies

CarlZhang12
Oct 20, 2025