RAG chunking strategy for HL7/FHIR clinical docs — semantic vs fixed-size splitting? #200622
Replies: 2 comments
-
|
💬 Your Product Feedback Has Been Submitted 🎉 Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users. Here's what you can expect moving forward ⏩
Where to look to see what's shipping 👀
What you can do in the meantime 💻
As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities. Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐ |
Beta Was this translation helpful? Give feedback.
-
|
You're hitting the exact problem that makes clinical RAG harder than general-domain RAG — medical records have both narrative and structured data that need different handling.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion Type
Product Feedback
Discussion Content
We're building a RAG pipeline over clinical documentation (discharge summaries, HL7 v2 messages, FHIR resources) for a telemedicine retrieval feature, and I'm stuck on chunking strategy.
The problem:
Fixed-size chunking (e.g. 512 tokens with overlap) keeps splitting FHIR resources mid-context — a MedicationRequest gets separated from its linked Condition, or a discharge summary's assessment section gets cut off from the plan section right where the clinically relevant link is. Retrieval quality tanks on anything requiring cross-reference.
Semantic chunking (splitting on embedding similarity shifts) handles narrative text better but struggles with the structured/tabular parts of HL7 messages — segments like OBX (observation) or RXE (pharmacy) don't have the semantic "flow" that similarity-based splitters expect, so boundaries end up arbitrary anyway.
What I've tried:
Fixed-size (512 tokens, 50 overlap) — fast, predictable, but breaks clinical relationships
Semantic splitting via sentence-transformer similarity thresholds — better on free-text notes, inconsistent on structured segments
Hybrid: chunk by FHIR resource boundary first, then semantic-split within large resources — best results so far, but adds real complexity in the ingestion pipeline
Beta Was this translation helpful? Give feedback.
All reactions