Add handling of empty strings by Ingvarstep · Pull Request #325 · urchade/GLiNER

Ingvarstep · 2026-01-26T18:36:53Z

Fix: Handle Empty Strings in Inference Methods

Summary

This PR fixes the #316 issue. The fix ensures that empty inputs are gracefully handled by filtering them out before processing and mapping results back to the original indices.

Problem

Issue Description

When using GLiNER models (particularly knowledgator/gliner-x-base), inference fails with an IndexError when the input list contains empty strings or whitespace-only strings.

Reproduction

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-base")

# The presence of the empty string triggers the error
texts = ["Email CEO to approve budget", ""]
labels = ["person", "organization", "action"]

predictions = model.inference(texts, labels, batch_size=16)

Error Traceback

Traceback (most recent call last):
  File "issue_repro.py", line 10, in <module>
    predictions = model.inference(texts, labels, batch_size=16)
  File ".../gliner/model.py", line 1290, in inference
    start_text_idx = start_token_idx_to_text_idx[start_token_idx]
IndexError: list index out of range

Root Cause

Empty strings produce no tokens during preprocessing, resulting in empty token index mappings. When the decoder attempts to access these mappings, it raises an IndexError because the indices don't exist.

Solution

Implementation Approach

The fix implements a filter-process-remap strategy:

Filter: Remove empty/whitespace-only strings before processing and maintain index mapping
Process: Run inference only on valid (non-empty) texts
Remap: Map predictions back to original input indices, inserting empty entity lists for filtered inputs

Key Changes

1. New Helper Methods in `BaseEncoderGLiNER`

_filter_valid_texts() (lines 1185-1204)

def _filter_valid_texts(self, texts: List[str]) -> Tuple[List[str], List[int]]:
    """Filter out empty or whitespace-only strings from input texts.

    Args:
        texts: List of input texts.

    Returns:
        Tuple containing:
            - valid_texts: List of non-empty texts
            - valid_to_orig_idx: Mapping from valid text index to original text index
    """
    valid_texts = []
    valid_to_orig_idx = []

    for i, text in enumerate(texts):
        if isinstance(text, str) and text.strip():
            valid_texts.append(text)
            valid_to_orig_idx.append(i)

    return valid_texts, valid_to_orig_idx

_map_entities_to_original() (lines 1206-1250)

def _map_entities_to_original(
    self,
    outputs: List[List[Any]],
    valid_to_orig_idx: List[int],
    all_start_token_idx_to_text_idx: List[List[int]],
    all_end_token_idx_to_text_idx: List[List[int]],
    valid_texts: List[str],
    num_original_texts: int,
) -> List[List[Dict[str, Any]]]:
    """Map entity predictions back to original text indices.

    Returns:
        List of entity predictions aligned with original input.
    """

2. Updated `inference()` Method

The inference() method in BaseEncoderGLiNER (lines 1289-1381) now:

Filters empty strings using _filter_valid_texts()
Returns early if no valid texts remain
Processes only valid texts
Maps results back using _map_entities_to_original()

@torch.no_grad()
def inference(
    self,
    texts: Union[str, List[str]],
    labels: List[str],
    flat_ner: bool = True,
    threshold: float = 0.5,
    multi_label: bool = False,
    batch_size: int = 8,
    packing_config: Optional[InferencePackingConfig] = None,
    **external_inputs,
) -> List[List[Dict[str, Any]]]:
    # ...

    # Filter out empty/whitespace-only strings
    valid_texts, valid_to_orig_idx = self._filter_valid_texts(texts)

    # Early exit: nothing valid to process
    if not valid_texts:
        return [[] for _ in texts]

    # Process only valid texts
    # ...

    # Map results back to original indices
    all_entities = self._map_entities_to_original(
        outputs,
        valid_to_orig_idx,
        all_start_token_idx_to_text_idx,
        all_end_token_idx_to_text_idx,
        valid_texts,
        len(texts),
    )

    return all_entities

3. Model Types Updated

The fix has been applied to all model types that inherit from BaseEncoderGLiNER:

✅ UniEncoderSpanGLiNER
✅ UniEncoderTokenGLiNER
✅ BiEncoderSpanGLiNER
✅ BiEncoderTokenGLiNER
✅ UniEncoderSpanDecoderGLiNER
✅ UniEncoderSpanRelexGLiNER (relation extraction)

For relation extraction models, a similar _process_relations() method was added to handle relation mapping.

Expected Behavior

After this fix:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-base")

texts = [
    "Barack Obama was born in Hawaii.",  # Valid text
    "",                                   # Empty string
    "   ",                                # Whitespace only
    "Steve Jobs founded Apple Inc.",     # Valid text
]
labels = ["person", "organization", "location"]

predictions = model.inference(texts, labels)

# Expected output structure:
# [
#     [{'start': 0, 'end': 12, 'text': 'Barack Obama', 'label': 'person', ...}, ...],  # Index 0
#     [],  # Index 1 - empty for empty string
#     [],  # Index 2 - empty for whitespace
#     [{'start': 0, 'end': 10, 'text': 'Steve Jobs', 'label': 'person', ...}, ...],  # Index 3
# ]

assert len(predictions) == len(texts)  # ✅ Length matches input
assert predictions[1] == []             # ✅ Empty string gets empty list
assert predictions[2] == []             # ✅ Whitespace gets empty list

Ingvarstep added 2 commits January 26, 2026 20:22

Add handling of empty strings

c724b37

update loading of flash models

5da4835

Ingvarstep merged commit c152cfc into urchade:main Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling of empty strings#325

Add handling of empty strings#325
Ingvarstep merged 2 commits into
urchade:mainfrom
Ingvarstep:fix/empty_object

Ingvarstep commented Jan 26, 2026

Labels

1 participant

Conversation