Skip to content

feat(epub): Add EPUB support#123

Closed
0xRaduan wants to merge 8 commits intomicrosoft:mainfrom
0xRaduan:add-epub-support
Closed

feat(epub): Add EPUB support#123
0xRaduan wants to merge 8 commits intomicrosoft:mainfrom
0xRaduan:add-epub-support

Conversation

@0xRaduan
Copy link

Addresses #88.

Adds new converter + new test.

# Convert content
content_md = []
h = html2text.HTML2Text()
h.body_width = 0 # Don't wrap lines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, could you check if this can use existing HtmlConverter

class HtmlConverter(DocumentConverter):
"""Anything with content type text/html"""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not html
extension = kwargs.get("file_extension", "")
if extension.lower() not in [".html", ".htm"]:
return None
result = None
with open(local_path, "rt", encoding="utf-8") as fh:
result = self._convert(fh.read())
return result
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
"""Helper function that converts and HTML string."""
# Parse the string
soup = BeautifulSoup(html_content, "html.parser")
# Remove javascript and style blocks
for script in soup(["script", "style"]):
script.extract()
# Print only the main content
body_elm = soup.find("body")
webpage_text = ""
if body_elm:
webpage_text = _CustomMarkdownify().convert_soup(body_elm)
else:
webpage_text = _CustomMarkdownify().convert_soup(soup)
assert isinstance(webpage_text, str)
return DocumentConverterResult(
title=None if soup.title is None else soup.title.string,
text_content=webpage_text,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@gagb
Copy link
Contributor

gagb commented Dec 20, 2024

@0xRaduan love this PR. We already have a dependency for HTML to text (markdownify) in the HTML convertor. Can you check if that would be sufficient?

@gagb gagb added the awaiting op response The PR is awaiting response/edits from the original poster. label Dec 20, 2024
@0xRaduan
Copy link
Author

0xRaduan commented Jan 9, 2025

Hey @gagb, sorry was on a long vacation, going to take a look right now...

@samuelfernandez
Copy link

Thanks for this, really looking forward to it!

@0xRaduan 0xRaduan requested a review from gagb January 22, 2025 18:50
@dgiagio
Copy link

dgiagio commented Feb 23, 2025

Can we have an update on this? Thanks

Copy link
Author

@0xRaduan 0xRaduan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-review


return "![%s](%s%s)" % (alt, src, title_part)

def convert_em(self, el: Any, text: str, convert_as_inline: bool) -> str:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed that it doesn't have an tag, and that's used in Epub as far as I know

@0xRaduan
Copy link
Author

cc. @gagb - do you think we can merge this?

i resolved all the merge conflicts as far as i can see

@0xRaduan
Copy link
Author

or also cc. @afourney, since I see you've been merging the latest PRs into main

@0xRaduan
Copy link
Author

@gagb - does this still await my response? any timeline for getting this merged?

@afourney
Copy link
Member

@0xRaduan Apologies for the delay. We're a super small team, with several large projects (e.g., AutoGen). I'll work on getting this in, and conflicts resolved, this weekend.

@afourney afourney removed the awaiting op response The PR is awaiting response/edits from the original poster. label Mar 15, 2025
@afourney
Copy link
Member

Ok, on second glance, EbookLib is AGPL -- which is very strong copyleft. I'm not clear we can include it here. I can look for an alternative, or I can help you set it up as a 3rd party plugin that you can host. LMK

@afourney
Copy link
Member

@0xRaduan I adapted this PR to not use ebooklib (as per above discussion). Admittedly, rather vibe-coded.

Please have a look at #1131 and let me know if it suits your need.

afourney added a commit that referenced this pull request Mar 17, 2025
* Adapted #123 to not use epublib.
* Updated README.md
@afourney
Copy link
Member

Closed in #1131

@0xRaduan
Copy link
Author

Thanks, closing this PR.

@0xRaduan 0xRaduan closed this Mar 23, 2025
@mad768063
Copy link

  • Symferopolskaya 2L, Днепр, 49005, Украина
Copy link

@mad768063 mad768063 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symferopolskaya 2L, Днепр, 49005, Украина

"charset-normalizer",
"openai",
"ebooklib",
"azure-ai-documentintelligence",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

8 participants