Skip to content

Commit e3da119

Browse files
committed
gliner: add experiment script
- add reproducible gliner2 evaluation script with as-of support\n- expand README with metrics tables and reproducibility notes\n- scope pytorch cpu index to linux and refresh lockfile
1 parent aa16b08 commit e3da119

4 files changed

Lines changed: 350 additions & 37 deletions

File tree

‎README.md‎

Lines changed: 46 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ A Flat Data attempt at historically documenting GitHub statuses.
44

55
## About
66

7-
This project builds the **missing GitHub status page**: a historical mirror that shows actual uptime
7+
This project builds the **"missing GitHub status page"**: a historical mirror that shows actual uptime
88
percentages and incidents across the entire platform, plus per-service uptime based on the incident data.
99
It reconstructs timelines from the Atom feed history and turns them into structured outputs and a static site.
1010

@@ -39,7 +39,7 @@ Enrich incidents with impact level by scraping the incident pages (cached):
3939
uv run python scripts/extract_incidents.py --out out --enrich-impact
4040
```
4141

42-
Infer missing components with GLiNER2 (used only when the incident page lacks affected components):
42+
Infer missing components with GLiNER2 (used only when the incident page lacks "affected components"):
4343

4444
```
4545
uv run python scripts/extract_incidents.py --out out --infer-components gliner2
@@ -62,9 +62,9 @@ Serve the repo root with any static server to view it locally.
6262

6363
## GLiNER2 component inference
6464

65-
Some incident pages do not list affected components. In those cases we use GLiNER2 as a fallback:
65+
Some incident pages do not list "affected components". In those cases we use GLiNER2 as a fallback:
6666

67-
- Input text: incident title + nonResolved updates.
67+
- Input text: incident title + non-Resolved updates.
6868
- Labels: the 10 GitHub services with short descriptions.
6969
- Thresholded inference (default: 0.75 confidence).
7070
- Final filter: the label must also appear via explicit service aliases in the text.
@@ -73,21 +73,51 @@ This keeps HTML tags as the source of truth and uses ML only to fill gaps.
7373

7474
### GLiNER2 experiment (evaluation + audit)
7575

76-
To validate the fallback approach, an experiment was run:
76+
To validate the fallback approach, an experiment is run that produces:
7777

78-
- **Audit**: every GLiNER2tagged incident is written with text evidence snippets.
79-
- **Evaluation**: GLiNER2 predictions are compared against incidents that *do* have HTML affected components.
78+
- **Audit**: every GLiNER2-tagged incident with text evidence snippets.
79+
- **Evaluation**: GLiNER2 predictions compared against incidents that *do* have HTML "affected components".
8080

81-
Latest results (threshold 0.75, alias filter on, non‑Resolved text only):
81+
Reproduce the experiment at a fixed time point (numbers will change as new data arrives):
8282

83-
- Precision: **0.95**
84-
- Recall: **0.884**
85-
- Exact match rate: **0.786**
86-
- Evaluated incidents: **444**
83+
```
84+
uv run python scripts/run_gliner_experiment.py --as-of 2026-01-08 --output-dir out
85+
```
8786

88-
Summary: the fallback is high‑precision and mostly conservative. Most errors are **false negatives**
89-
(missing a true component), while false positives are typically “extra” components inferred from
90-
multi‑service incident titles.
87+
Outputs are written to:
88+
89+
- `out/gliner2_audit.jsonl` (tagged incidents + evidence snippets)
90+
- `out/gliner2_eval.json` (metrics, per-label breakdown, sample mismatches)
91+
92+
Latest results (as-of 2026-01-08, threshold 0.75, alias filter on, non-Resolved text only):
93+
94+
| Metric | Value |
95+
|---|---:|
96+
| Evaluated incidents | 446 |
97+
| Predicted non-empty | 418 |
98+
| Precision | 0.950 |
99+
| Recall | 0.883 |
100+
| Exact match rate | 0.785 |
101+
| Audit count (missing-tag incidents) | 51 |
102+
103+
Per-label precision/recall (top-level service components):
104+
105+
| Label | Precision | Recall | TP | FP | FN |
106+
|---|---:|---:|---:|---:|---:|
107+
| Git Operations | 0.968 | 0.909 | 60 | 2 | 6 |
108+
| Webhooks | 0.938 | 0.918 | 45 | 3 | 4 |
109+
| API Requests | 0.915 | 0.915 | 54 | 5 | 5 |
110+
| Issues | 1.000 | 0.286 | 22 | 0 | 55 |
111+
| Pull Requests | 0.948 | 0.979 | 92 | 5 | 2 |
112+
| Actions | 0.958 | 0.947 | 161 | 7 | 9 |
113+
| Packages | 0.917 | 0.971 | 33 | 3 | 1 |
114+
| Pages | 0.855 | 0.964 | 53 | 9 | 2 |
115+
| Codespaces | 0.982 | 0.982 | 110 | 2 | 2 |
116+
| Copilot | 1.000 | 0.966 | 57 | 0 | 2 |
117+
118+
Summary: the fallback is high-precision and mostly conservative. Most errors are **false negatives**
119+
(missing a true component), while false positives are typically "extra" components inferred from
120+
multi-service incident titles.
91121

92122
## Outputs
93123

@@ -103,6 +133,6 @@ The automation workflow writes to `parsed/`.
103133
Incident records include optional `impact` and `components` fields when enrichment is enabled.
104134
Service components are sourced as follows:
105135

106-
- **Primary**: the incident page affected components section (if present).
136+
- **Primary**: the incident page "affected components" section (if present).
107137
- **Fallback**: GLiNER2 schema-driven extraction from the incident title + non-resolved updates, filtered
108138
by explicit service aliases to avoid generic matches.

‎pyproject.toml‎

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ requires-python = ">=3.11"
77
dependencies = [
88
"gliner2>=1.2.3",
99
"torch==2.9.1+cpu ; sys_platform == 'linux'",
10+
"torch==2.9.1 ; sys_platform != 'linux'",
1011
]
1112

1213
[tool.uv]
@@ -21,4 +22,6 @@ name = "pytorch-cpu"
2122
url = "https://download.pytorch.org/whl/cpu"
2223

2324
[tool.uv.sources]
24-
torch = { index = "pytorch-cpu" }
25+
torch = [
26+
{ index = "pytorch-cpu", marker = "sys_platform == 'linux'" },
27+
]

‎scripts/run_gliner_experiment.py‎

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
#!/usr/bin/env python3
2+
import argparse
3+
import json
4+
import os
5+
import re
6+
from collections import defaultdict
7+
from datetime import datetime
8+
9+
import extract_incidents as ei
10+
11+
12+
def parse_iso(value):
13+
if not value:
14+
return None
15+
value = value.strip()
16+
if not value:
17+
return None
18+
if len(value) == 10 and value[4] == "-" and value[7] == "-":
19+
value = f"{value}T00:00:00Z"
20+
return datetime.fromisoformat(value.replace("Z", "+00:00"))
21+
22+
23+
def incident_text(incident):
24+
parts = [incident.get("title") or ""]
25+
for update in incident.get("updates") or []:
26+
if update.get("status") == "Resolved":
27+
continue
28+
message = update.get("message")
29+
if message:
30+
parts.append(message)
31+
return " ".join(part.strip() for part in parts if part.strip())
32+
33+
34+
def build_alias_patterns():
35+
return {
36+
label: [re.compile(pattern, re.IGNORECASE) for pattern in patterns]
37+
for label, patterns in ei.COMPONENT_ALIASES.items()
38+
}
39+
40+
41+
def find_evidence(text, label, alias_patterns):
42+
for pattern in alias_patterns.get(label, []):
43+
match = pattern.search(text)
44+
if match:
45+
start = max(match.start() - 40, 0)
46+
end = min(match.end() + 40, len(text))
47+
return text[start:end]
48+
return None
49+
50+
51+
def infer_components(model, text, threshold):
52+
if not text:
53+
return [], {}
54+
result = model.extract_entities(text, ei.COMPONENT_SCHEMA, include_confidence=True)
55+
entities = result.get("entities", {}) if isinstance(result, dict) else {}
56+
components, confidences = ei.select_components_from_entities(entities, threshold)
57+
components = ei.filter_components_by_alias(components, text)
58+
if not components:
59+
return [], {}
60+
return components, {label: confidences.get(label) for label in components}
61+
62+
63+
def main():
64+
parser = argparse.ArgumentParser(
65+
description="Evaluate GLiNER2 component inference against HTML-tagged incidents."
66+
)
67+
parser.add_argument(
68+
"--incidents",
69+
default="parsed/incidents.jsonl",
70+
help="Path to incidents JSONL (default: parsed/incidents.jsonl)",
71+
)
72+
parser.add_argument(
73+
"--output-dir",
74+
default="out",
75+
help="Directory to write audit/eval files (default: out)",
76+
)
77+
parser.add_argument(
78+
"--as-of",
79+
help="Only include incidents published on or before this ISO date/time (UTC).",
80+
)
81+
parser.add_argument(
82+
"--model",
83+
default="fastino/gliner2-base-v1",
84+
help="GLiNER2 model name (default: fastino/gliner2-base-v1)",
85+
)
86+
parser.add_argument(
87+
"--threshold",
88+
type=float,
89+
default=0.75,
90+
help="Minimum confidence threshold for components (default: 0.75)",
91+
)
92+
args = parser.parse_args()
93+
94+
if ei.GLiNER2 is None:
95+
raise SystemExit("GLiNER2 is not installed. Run `uv add gliner2` first.")
96+
97+
model = ei.get_gliner_model(args.model)
98+
alias_patterns = build_alias_patterns()
99+
cutoff = parse_iso(args.as_of)
100+
101+
incidents = []
102+
with open(args.incidents, "r", encoding="utf-8") as handle:
103+
for line in handle:
104+
line = line.strip()
105+
if not line:
106+
continue
107+
incident = json.loads(line)
108+
if cutoff:
109+
published = incident.get("published_at")
110+
if published and parse_iso(published) > cutoff:
111+
continue
112+
incidents.append(incident)
113+
114+
os.makedirs(args.output_dir, exist_ok=True)
115+
116+
audit_path = os.path.join(args.output_dir, "gliner2_audit.jsonl")
117+
eval_path = os.path.join(args.output_dir, "gliner2_eval.json")
118+
119+
audit_count = 0
120+
with open(audit_path, "w", encoding="utf-8") as handle:
121+
for inc in incidents:
122+
if inc.get("components") and inc.get("components_source") != "gliner2":
123+
continue
124+
text = incident_text(inc)
125+
components, confidences = infer_components(model, text, args.threshold)
126+
if not components:
127+
continue
128+
evidence = {
129+
label: find_evidence(text, label, alias_patterns) for label in components
130+
}
131+
handle.write(
132+
json.dumps(
133+
{
134+
"id": inc.get("id"),
135+
"url": inc.get("url"),
136+
"title": inc.get("title"),
137+
"components": components,
138+
"components_confidence": confidences,
139+
"evidence": evidence,
140+
},
141+
ensure_ascii=True,
142+
)
143+
)
144+
handle.write("\n")
145+
audit_count += 1
146+
147+
truth_pool = [
148+
inc
149+
for inc in incidents
150+
if inc.get("components") and inc.get("components_source") != "gliner2"
151+
]
152+
153+
metrics = {
154+
"total": len(truth_pool),
155+
"predicted_non_empty": 0,
156+
"exact_match": 0,
157+
"tp": 0,
158+
"fp": 0,
159+
"fn": 0,
160+
}
161+
per_label = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0})
162+
examples_fp = []
163+
examples_fn = []
164+
165+
for inc in truth_pool:
166+
text = incident_text(inc)
167+
predicted, _ = infer_components(model, text, args.threshold)
168+
truth = inc.get("components") or []
169+
170+
pred_set = set(predicted)
171+
truth_set = set(truth)
172+
173+
if pred_set:
174+
metrics["predicted_non_empty"] += 1
175+
176+
if pred_set == truth_set:
177+
metrics["exact_match"] += 1
178+
179+
tp = pred_set & truth_set
180+
fp = pred_set - truth_set
181+
fn = truth_set - pred_set
182+
183+
metrics["tp"] += len(tp)
184+
metrics["fp"] += len(fp)
185+
metrics["fn"] += len(fn)
186+
187+
for label in tp:
188+
per_label[label]["tp"] += 1
189+
for label in fp:
190+
per_label[label]["fp"] += 1
191+
for label in fn:
192+
per_label[label]["fn"] += 1
193+
194+
if fp and len(examples_fp) < 5:
195+
examples_fp.append(
196+
{
197+
"title": inc.get("title"),
198+
"url": inc.get("url"),
199+
"predicted": sorted(pred_set),
200+
"truth": sorted(truth_set),
201+
}
202+
)
203+
if fn and len(examples_fn) < 5:
204+
examples_fn.append(
205+
{
206+
"title": inc.get("title"),
207+
"url": inc.get("url"),
208+
"predicted": sorted(pred_set),
209+
"truth": sorted(truth_set),
210+
}
211+
)
212+
213+
precision = (
214+
metrics["tp"] / (metrics["tp"] + metrics["fp"]) if (metrics["tp"] + metrics["fp"]) else 0
215+
)
216+
recall = (
217+
metrics["tp"] / (metrics["tp"] + metrics["fn"]) if (metrics["tp"] + metrics["fn"]) else 0
218+
)
219+
exact_match_rate = metrics["exact_match"] / metrics["total"] if metrics["total"] else 0
220+
221+
report = {
222+
"model": args.model,
223+
"threshold": args.threshold,
224+
"as_of": args.as_of,
225+
"metrics": {
226+
**metrics,
227+
"precision": precision,
228+
"recall": recall,
229+
"exact_match_rate": exact_match_rate,
230+
},
231+
"per_label": dict(per_label),
232+
"examples": {"false_positive": examples_fp, "false_negative": examples_fn},
233+
"audit_count": audit_count,
234+
}
235+
236+
with open(eval_path, "w", encoding="utf-8") as handle:
237+
json.dump(report, handle, indent=2, ensure_ascii=True)
238+
239+
print(f"Wrote audit: {audit_path}")
240+
print(f"Wrote eval: {eval_path}")
241+
print(
242+
f"Precision {precision:.3f} | Recall {recall:.3f} | Exact match {exact_match_rate:.3f} "
243+
f"| Audit {audit_count} | Evaluated {metrics['total']}"
244+
)
245+
246+
247+
if __name__ == "__main__":
248+
main()

0 commit comments

Comments
 (0)