Replies: 1 comment 2 replies
-
|
this is my tmp method to get docx images outputs, maybe good for you: from markitdown import MarkItDown,StreamInfo
import mammoth
from pathlib import Path
import tempfile
docx_file = "/path/to/docx"
img_dir = Path('image')
img_dir.mkdir(exist_ok=True)
def convert_image(image):
with image.open() as image_bytes:
img_name = Path(image_bytes.name).name
img_path = img_dir.joinpath(img_name)
with open(img_path,"wb") as f:
f.write(image_bytes.read())
return {
"src":f"{img_path}"
}
html_content = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
md = MarkItDown()
with tempfile.NamedTemporaryFile() as tmp:
tmp.write(html_content.value.encode('utf-8'))
tmp.flush()
res = md.convert(tmp.name,stream_info=StreamInfo(mimetype='text/html'))
Path(docx_file.replace('docx','md')).write_text(res.text_content, encoding='utf-8') |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It would be good to have the option to export the images to media directory and add a proper image link in a markdown code.
When we use RAG with DOCX manuals we currently use pandoc with image export. In some cases it does not make any sense to describe pictures as they may be some schemas or diagrams. In that case we provide a link to LLM and then it is displayed in the final answer.
Beta Was this translation helpful? Give feedback.
All reactions