Skip to content

Add handling of empty strings#325

Merged
Ingvarstep merged 2 commits into
urchade:mainfrom
Ingvarstep:fix/empty_object
Jan 27, 2026
Merged

Add handling of empty strings#325
Ingvarstep merged 2 commits into
urchade:mainfrom
Ingvarstep:fix/empty_object

Conversation

@Ingvarstep

Copy link
Copy Markdown
Collaborator

Fix: Handle Empty Strings in Inference Methods

Summary

This PR fixes the #316 issue. The fix ensures that empty inputs are gracefully handled by filtering them out before processing and mapping results back to the original indices.

Problem

Issue Description

When using GLiNER models (particularly knowledgator/gliner-x-base), inference fails with an IndexError when the input list contains empty strings or whitespace-only strings.

Reproduction

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-base")

# The presence of the empty string triggers the error
texts = ["Email CEO to approve budget", ""]
labels = ["person", "organization", "action"]

predictions = model.inference(texts, labels, batch_size=16)

Error Traceback

Traceback (most recent call last):
  File "issue_repro.py", line 10, in <module>
    predictions = model.inference(texts, labels, batch_size=16)
  File ".../gliner/model.py", line 1290, in inference
    start_text_idx = start_token_idx_to_text_idx[start_token_idx]
IndexError: list index out of range

Root Cause

Empty strings produce no tokens during preprocessing, resulting in empty token index mappings. When the decoder attempts to access these mappings, it raises an IndexError because the indices don't exist.

Solution

Implementation Approach

The fix implements a filter-process-remap strategy:

  1. Filter: Remove empty/whitespace-only strings before processing and maintain index mapping
  2. Process: Run inference only on valid (non-empty) texts
  3. Remap: Map predictions back to original input indices, inserting empty entity lists for filtered inputs

Key Changes

1. New Helper Methods in BaseEncoderGLiNER

_filter_valid_texts() (lines 1185-1204)

def _filter_valid_texts(self, texts: List[str]) -> Tuple[List[str], List[int]]:
    """Filter out empty or whitespace-only strings from input texts.

    Args:
        texts: List of input texts.

    Returns:
        Tuple containing:
            - valid_texts: List of non-empty texts
            - valid_to_orig_idx: Mapping from valid text index to original text index
    """
    valid_texts = []
    valid_to_orig_idx = []

    for i, text in enumerate(texts):
        if isinstance(text, str) and text.strip():
            valid_texts.append(text)
            valid_to_orig_idx.append(i)

    return valid_texts, valid_to_orig_idx

_map_entities_to_original() (lines 1206-1250)

def _map_entities_to_original(
    self,
    outputs: List[List[Any]],
    valid_to_orig_idx: List[int],
    all_start_token_idx_to_text_idx: List[List[int]],
    all_end_token_idx_to_text_idx: List[List[int]],
    valid_texts: List[str],
    num_original_texts: int,
) -> List[List[Dict[str, Any]]]:
    """Map entity predictions back to original text indices.

    Returns:
        List of entity predictions aligned with original input.
    """

2. Updated inference() Method

The inference() method in BaseEncoderGLiNER (lines 1289-1381) now:

  1. Filters empty strings using _filter_valid_texts()
  2. Returns early if no valid texts remain
  3. Processes only valid texts
  4. Maps results back using _map_entities_to_original()
@torch.no_grad()
def inference(
    self,
    texts: Union[str, List[str]],
    labels: List[str],
    flat_ner: bool = True,
    threshold: float = 0.5,
    multi_label: bool = False,
    batch_size: int = 8,
    packing_config: Optional[InferencePackingConfig] = None,
    **external_inputs,
) -> List[List[Dict[str, Any]]]:
    # ...

    # Filter out empty/whitespace-only strings
    valid_texts, valid_to_orig_idx = self._filter_valid_texts(texts)

    # Early exit: nothing valid to process
    if not valid_texts:
        return [[] for _ in texts]

    # Process only valid texts
    # ...

    # Map results back to original indices
    all_entities = self._map_entities_to_original(
        outputs,
        valid_to_orig_idx,
        all_start_token_idx_to_text_idx,
        all_end_token_idx_to_text_idx,
        valid_texts,
        len(texts),
    )

    return all_entities

3. Model Types Updated

The fix has been applied to all model types that inherit from BaseEncoderGLiNER:

  • UniEncoderSpanGLiNER
  • UniEncoderTokenGLiNER
  • BiEncoderSpanGLiNER
  • BiEncoderTokenGLiNER
  • UniEncoderSpanDecoderGLiNER
  • UniEncoderSpanRelexGLiNER (relation extraction)

For relation extraction models, a similar _process_relations() method was added to handle relation mapping.

Expected Behavior

After this fix:

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-base")

texts = [
    "Barack Obama was born in Hawaii.",  # Valid text
    "",                                   # Empty string
    "   ",                                # Whitespace only
    "Steve Jobs founded Apple Inc.",     # Valid text
]
labels = ["person", "organization", "location"]

predictions = model.inference(texts, labels)

# Expected output structure:
# [
#     [{'start': 0, 'end': 12, 'text': 'Barack Obama', 'label': 'person', ...}, ...],  # Index 0
#     [],  # Index 1 - empty for empty string
#     [],  # Index 2 - empty for whitespace
#     [{'start': 0, 'end': 10, 'text': 'Steve Jobs', 'label': 'person', ...}, ...],  # Index 3
# ]

assert len(predictions) == len(texts)  # ✅ Length matches input
assert predictions[1] == []             # ✅ Empty string gets empty list
assert predictions[2] == []             # ✅ Whitespace gets empty list
@Ingvarstep Ingvarstep merged commit c152cfc into urchade:main Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant