Skip to content

Add configurable Azure Document Intelligence analysis features via CLI and kwargs#1561

Open
kei-yamazaki wants to merge 1 commit intomicrosoft:mainfrom
kei-yamazaki:doc-intel-features
Open

Add configurable Azure Document Intelligence analysis features via CLI and kwargs#1561
kei-yamazaki wants to merge 1 commit intomicrosoft:mainfrom
kei-yamazaki:doc-intel-features

Conversation

@kei-yamazaki
Copy link

Summary

This PR makes Azure Document Intelligence analysis features configurable while preserving the existing default behavior.

What changed

  • Added CLI support for custom DI analysis features:
    • --docintel-feature (repeatable and comma-separated)
  • Added Python kwargs support:
    • MarkItDown(..., docintel_features=[...])
    • per-call override via convert(..., docintel_features=[...])
  • Switched feature handling to SDK constants (DocumentAnalysisFeature) for type safety
  • Added normalization for common feature input forms (e.g. FORMULAS, ocr_high_resolution, DocumentAnalysisFeature.STYLE_FONT)
  • Added validation for invalid feature names (raises ValueError)
  • Kept default behavior when features are not specified:
    • OCR-capable formats: FORMULAS, OCR_HIGH_RESOLUTION, STYLE_FONT
    • .docx, .pptx, .xlsx, .html: no analysis features
  • Updated README:
    • documented CLI feature option
    • documented default feature behavior when not specified
    • removed Python feature-customization sample section per latest doc direction

Why

  • Users need to control DI analysis features depending on accuracy/cost/performance requirements.
  • The previous implementation had fixed features and limited flexibility.
  • Using SDK constants improves safety and reduces string-based errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant