RAG chunking strategy for HL7/FHIR clinical docs — semantic vs fixed-size splitting? #200622

Arpitmishra1-1 · 2026-07-01T04:39:39Z

Arpitmishra1-1
Jul 1, 2026

Discussion Type

Product Feedback

Discussion Content

We're building a RAG pipeline over clinical documentation (discharge summaries, HL7 v2 messages, FHIR resources) for a telemedicine retrieval feature, and I'm stuck on chunking strategy.
The problem:
Fixed-size chunking (e.g. 512 tokens with overlap) keeps splitting FHIR resources mid-context — a MedicationRequest gets separated from its linked Condition, or a discharge summary's assessment section gets cut off from the plan section right where the clinically relevant link is. Retrieval quality tanks on anything requiring cross-reference.
Semantic chunking (splitting on embedding similarity shifts) handles narrative text better but struggles with the structured/tabular parts of HL7 messages — segments like OBX (observation) or RXE (pharmacy) don't have the semantic "flow" that similarity-based splitters expect, so boundaries end up arbitrary anyway.

What I've tried:

Fixed-size (512 tokens, 50 overlap) — fast, predictable, but breaks clinical relationships
Semantic splitting via sentence-transformer similarity thresholds — better on free-text notes, inconsistent on structured segments
Hybrid: chunk by FHIR resource boundary first, then semantic-split within large resources — best results so far, but adds real complexity in the ingestion pipeline

2026-07-01T04:40:18Z

github-actions[bot]
Bot Jul 1, 2026

💬 Your Product Feedback Has Been Submitted 🎉

Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users.

Here's what you can expect moving forward ⏩

Your input will be carefully reviewed and cataloged by members of our product teams.
- Due to the high volume of submissions, we may not always be able to provide individual responses.
- Rest assured, your feedback will help chart our course for product improvements.
Other users may engage with your post, sharing their own perspectives or experiences.
GitHub staff may reach out for further clarification or insight.
- We may 'Answer' your discussion if there is a current solution, workaround, or roadmap/changelog post related to the feedback.

Where to look to see what's shipping 👀

Read the Changelog for real-time updates on the latest GitHub features, enhancements, and calls for feedback.
Explore our Product Roadmap, which details upcoming major releases and initiatives.

What you can do in the meantime 💻

Upvote and comment on other user feedback Discussions that resonate with you.
Add more information at any point! Useful details include: use cases, relevant labels, desired outcomes, and any accompanying screenshots.

As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities.

Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐

0 replies

roohan-514 · 2026-07-01T12:45:22Z

roohan-514
Jul 1, 2026

You're hitting the exact problem that makes clinical RAG harder than general-domain RAG — medical records have both narrative and structured data that need different handling.
Your hybrid approach (FHIR resource boundary first, then semantic-split within) is the right direction. A few refinements:

For FHIR resources — never break at resource boundaries

Each FHIR resource (MedicationRequest, Condition, Observation) should be a minimum atomic chunk
If a resource is short (<200 tokens), pad it with context from the parent encounter/bundle
If a resource is long (e.g., a narrative-heavy ClinicalImpression), semantic-split within it but keep the resource ID + patient ID as metadata on every sub-chunk
Key insight: store FHIR resource references (basedOn, reasonReference) as metadata per chunk — when retrieval finds a MedicationRequest chunk, you can also retrieve its linked Condition via the reference

For HL7 v2 messages — segment is the atom, not the message

HL7 messages are pipe-delimited with no natural "paragraph" boundaries, so semantic splitting fails
Split by segment groups: treat each segment group (OBR+OBX* for observations, RXE+RXC* for pharmacy) as one chunk
Use MSH (message header) fields like Patient ID and Message Type as metadata
Pad each segment group chunk with the MSH header (patient context) for standalone retrievability

For discharge summaries/narrative — section-aware splitting

Use regex/section markers (e.g., "Assessment:", "Plan:", "Discharge Medications:") as hard split points
Don't let sections merge — "Assessment" and "Plan" belong to different clinical reasoning steps
Within a long section, semantic-split at 512-token windows but keep section header on every sub-chunk

Recommended pipeline architecture:
Document -> Detect type (FHIR JSON / HL7 pipe / Narrative text)
-> FHIR: extract resources, make atomic chunks, inject linked refs as metadata
-> HL7: split by segment groups, pad with MSH header
-> Narrative: section-aware split, then semantic within section
-> Embed + Store (with resource ID, patient ID, section, doc type as metadata filters)
Retrieval trick that helped me:

Always retrieve chunks from the same patient encounter together — filter by patient ID + encounter ID before similarity search
Use a two-stage retriever: first keyword search on medical codes (ICD-10, LOINC, RxNorm), then vector search on narrative
Cross-reference resolution: when a chunk mentions reasonReference: Condition/123, include that Condition chunk in the context window even if it's not in top-k

On your semantic vs fixed-size dilemma:

For structured data (FHIR, HL7): fixed-size or segment-based > semantic
For narrative: section-aware > semantic > fixed-size
Your hybrid approach is the industry best practice right now — complexity is worth it for clinical accuracy. A wrong chunk = wrong diagnosis suggestion = unacceptable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

RAG chunking strategy for HL7/FHIR clinical docs — semantic vs fixed-size splitting? #200622

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

RAG chunking strategy for HL7/FHIR clinical docs — semantic vs fixed-size splitting? #200622

Uh oh!

Arpitmishra1-1 Jul 1, 2026

Discussion Type

Discussion Content

What I've tried:

Replies: 2 comments

Uh oh!

github-actions[bot] Bot Jul 1, 2026

Uh oh!

roohan-514 Jul 1, 2026

Arpitmishra1-1
Jul 1, 2026

github-actions[bot]
Bot Jul 1, 2026

roohan-514
Jul 1, 2026