Optimizing the Knowledge Graph → Prediction Interface
Applying academic hyperparameter optimization to competitive intelligence prediction in the micro-drama vertical. Two experiments, 17 companies, 30 optimization trials.
The Problem
The SBPI (Structural Brand Power Index) semantic layer tracks 17 micro-drama companies across 5 scoring dimensions, producing weekly composite scores stored as RDF triples in an Oxigraph knowledge graph. The prediction engine tries to forecast which companies will move up, down, or stay stable next week.
Experiment 1 revealed a critical failure: the KG-augmented prediction method performed identically to the naive persistence baseline (23.5% directional accuracy). Every company was predicted as "stable" with 0.5 confidence. The knowledge graph existed, but the interface between the graph and the prediction logic was misconfigured.
The problem was not the knowledge graph's content or the prediction algorithm's logic. It was the interface between them — the 12 hardcoded parameters that translate graph signals into predictions. These parameters were set by intuition, never tuned against actual outcomes.
The Solution
We adapted the methodology from Markovick et al. (2025), "Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning", which demonstrated that systematic hyperparameter optimization of the KG→reasoning interface yields significant accuracy gains across multiple benchmarks.
Their paper optimized 6 parameters (chunk size, search type, top-k, prompt templates) using Tree-structured Parzen Estimation (TPE) across 50 trials. We mapped this approach to our domain-specific context: 12 parameters controlling how SPARQL-queried graph data becomes directional predictions.
Experiment Pipeline
Experiment Timeline
| Week | Event | Data Points |
|---|---|---|
W10-2026 | First week of SBPI data loaded into Oxigraph | 17 companies, 5 dimensions each |
W11-2026 | Second week loaded — first transition pair available | Training pair 1: W10→W11 |
W12-2026 | Third week loaded — Experiment 1 evaluated | Training pair 2: W11→W12 |
2026-03-24 | Experiment 1 results published (evaluation-log.json) | 4 methods evaluated |
2026-03-25 | Experiment 2 optimization run (30 TPE trials) | Best config: 69.9% accuracy |
| Ongoing | Nightly auto-optimization (re-runs when new week data arrives) | Compounding improvement |
What Happens Next
The optimizer runs nightly at 6:13 AM as part of the SBPI pipeline. When new week data is loaded (W13, W14, etc.), it automatically re-optimizes with a larger training set. With only 2 transition pairs today, the training data is thin. As weeks accumulate, the optimizer gains more signal, confidence intervals tighten, and the system self-improves.
Each new week of data makes the optimizer better at predicting the next week. The configuration that works for a 3-week window may differ from the configuration that works for a 12-week window. Continuous re-optimization captures this drift automatically.
KG-LLM Interface Optimization
Adapted from Markovick et al. (2025). 12-parameter search space, TPE optimization via Optuna, multi-signal voting system with bootstrap confidence intervals.
Research Foundation
The Paper
"Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning"
Markovick, Obradović, Hajdu, Pavlović (2025)
arXiv:2505.24478v1
The authors used the Cognee framework to build KG-augmented QA systems, then systematically optimized 6 interface parameters using TPE across 50 trials. Key finding: default configurations leave 10–30% accuracy on the table.
Their 6 Parameters
| Parameter | Controls |
|---|---|
chunk_size | How text is segmented for graph construction |
search_type | Text search vs. graph traversal vs. hybrid |
top_k | Number of retrieved context chunks |
qa_system_prompt | How the LLM reasons over retrieved context |
graph_construction_prompt | How entities/relations are extracted |
task_getter_type | Whether summaries are included with chunks |
Parameter Mapping: Paper → SBPI
The paper's KG system uses LLM-based QA over unstructured text. Our system uses SPARQL queries over structured RDF data. The abstraction level is different, but the principle is identical: the interface between the knowledge representation and the reasoning logic has tunable parameters that dramatically affect output quality.
| Paper Parameter | → | SBPI Parameter | Why This Mapping |
|---|---|---|---|
chunk_size |
→ | direction_threshold |
Both control granularity — how much signal constitutes a meaningful unit |
search_type |
→ | anomaly_contributes |
Both toggle between retrieval strategies (text vs. graph; momentum-only vs. multi-signal) |
top_k |
→ | divergence_weight, tier_proximity_weight |
Both control how many signals participate in the final answer |
qa_system_prompt |
→ | confidence_base, magnitude_bonus_*, consistency_bonus |
Both define the reasoning formula that produces a confidence score |
graph_construction_prompt |
→ | mean_reversion_rate |
Both are structural parameters about how the graph's topology informs predictions |
task_getter_type |
→ | anomaly_contributes |
Both toggle extra context availability (summaries; dimension anomaly signals) |
The 12-Parameter Search Space
| # | Parameter | Range | Default | Optimized | Change |
|---|---|---|---|---|---|
| 1 | direction_threshold | 0.1 – 2.0 | 0.500 | 1.295 | +159% |
| 2 | confidence_base | 0.40 – 0.80 | 0.600 | 0.443 | -26% |
| 3 | magnitude_thresh_1 | 1.0 – 5.0 | 3.000 | 3.020 | +1% |
| 4 | magnitude_thresh_2 | 3.0 – 8.0 | 5.000 | 5.076 | +2% |
| 5 | consistency_thresh | 0.5 – 4.0 | 2.000 | 1.980 | -1% |
| 6 | magnitude_bonus_1 | 0.02 – 0.20 | 0.100 | 0.120 | +20% |
| 7 | magnitude_bonus_2 | 0.02 – 0.20 | 0.100 | 0.136 | +36% |
| 8 | consistency_bonus | 0.01 – 0.15 | 0.050 | 0.040 | -20% |
| 9 | mean_reversion_rate | 0.01 – 0.30 | 0.100 | 0.257 | +157% |
| 10 | anomaly_contributes | bool | False | True | enabled |
| 11 | divergence_weight | 0.0 – 1.0 | 0.000 | 0.180 | new signal |
| 12 | tier_proximity_weight | 0.0 – 1.0 | 0.000 | 0.096 | new signal |
Key Findings
The optimized threshold (1.295) is 2.6x the default (0.5). The original was classifying normal score fluctuations as directional movement. A delta of 0.8 points is noise, not signal. The optimizer learned this from the data.
Rate increased from 0.10 to 0.257 (+157%). In this market, companies do revert toward their tier midpoints — and faster than the default assumed. The optimizer says: trust structural gravity more.
Dimension divergence and tier proximity were disabled by default. The optimizer enabled both with modest weights (0.18 and 0.10). Even weak extra signals improve the voting system when they're directionally correct.
Only 2 week-over-week transitions (W10→W11, W11→W12) — the paper used 24. The 95% CI on mean trial score is [0.621, 0.645]. These findings will stabilize as more weeks accumulate.
Multi-Signal Voting System
The optimized predictor combines four signal types through a weighted voting mechanism:
Each signal votes for a direction (up/down/stable). The direction with the highest total weight wins. The final confidence is the winning vote share, clamped to [0.30, 0.95].
Optimization Trials: Score Distribution
30 trials plotted by score. Green dot = best trial (0.6986). Dashed line = best score.
Metrics Alignment with Paper
The paper evaluated using Exact Match (EM), F1, and DeepEval Correctness. We mapped these to metrics appropriate for time-series directional prediction:
| Paper Metric | SBPI Equivalent | What It Measures |
|---|---|---|
| Exact Match (EM) | Directional Accuracy | Did we get the direction right? (up/down/stable) |
| F1 Score | Mean Absolute Error (MAE) | How close was the predicted delta to the actual delta? |
| DeepEval Correctness | Brier Score | Was the stated confidence calibrated? (lower = better) |
Nightly Automation
# Scheduled via launchd at 6:13 AM daily # Phase 1: Insights python scheduler/nightly-insights.py --schedule nightly --output file # Phase 2: Experiment 1 (record + evaluate predictions) python etl/prediction_experiment.py --record --evaluate # Phase 3: Experiment 2 (KG interface optimization) python experiment/kg_interface_optimizer.py --nightly # --nightly detects new week data automatically # If new data: runs full TPE optimization (30+ trials) # If no new data: reports current best config
Baseline Prediction Methods
Four prediction strategies evaluated against W12-2026 actuals. 17 companies. The experiment that revealed the KG-augmented method was no better than guessing "stable."
The Four Methods
1. Persistence
Predict that nothing changes. Every company stays "stable" with delta = 0 and confidence = 0.50. This is the simplest possible baseline — the null hypothesis.
2. Naive Momentum
If a company went up last week, predict it goes up again. Uses single-week delta direction with slightly elevated confidence (0.55). No multi-week signal aggregation.
3. Mean Reversion
Predict that each company's score will move toward the midpoint of its current tier (Dominant: 92.5, Strong: 77, Emerging: 62, Niche: 47, Limited: 20). Gap closure rate: 10% per week.
4. KG-Augmented
Query the Oxigraph knowledge graph for momentum signals (2+ consecutive same-direction weeks), dimension anomalies, and tier proximity. Apply a hardcoded confidence formula. The most sophisticated method — and the most disappointing.
Results: W12-2026 Evaluation
| Method | Dir. Accuracy | MAE | Brier Score | Verdict |
|---|---|---|---|---|
| Persistence | 23.5% (4/17) | 1.803 | 0.250 | Only hits the 4 actually-stable companies |
| Naive Momentum | 23.5% (4/17) | 1.803 | 0.279 | Same hits as persistence (no signal in 1-week trend) |
| Mean Reversion | 47.1% (8/17) | 2.107 | 0.250 | Best Exp 1 method — upward bias happened to match a rising week |
| KG-Augmented | 23.5% (4/17) | 1.803 | 0.250 | Identical to persistence — no momentum signals found |
| Optimized KG (Exp 2) | 69.9% | — | — | Multi-signal voting with tuned parameters |
Why KG-Augmented Failed
The KG-augmented method requires detecting 2 consecutive same-direction weeks as a momentum signal. With the default direction threshold of 0.5, most week-over-week deltas fell into "stable" — which means no two consecutive non-stable weeks were detected. No momentum → no signal → default to persistence.
This is the exact failure mode the Markovick paper warns about: "The interface between the knowledge representation and the reasoning component is as important as either component alone."
Company-Level Predictions (W12-2026)
| Company | Actual Direction | Actual Delta | Persistence | Mean Reversion | KG-Aug |
|---|---|---|---|---|---|
| Amazon | ↓ down | -2.60 | stable ✗ | up ✗ | stable ✗ |
| Both Worlds / Freeli | — stable | 0.00 | stable ✓ | up ✗ | stable ✓ |
| CandyJar | — stable | 0.00 | stable ✓ | up ✗ | stable ✓ |
| Col Belive | ↑ up | +3.15 | stable ✗ | up ✓ | stable ✗ |
| Disney | ↑ up | +2.30 | stable ✗ | up ✓ | stable ✗ |
| DramaBox | ↑ up | +4.00 | stable ✗ | up ✓ | stable ✗ |
| GoodShort | ↑ up | +1.70 | stable ✗ | up ✓ | stable ✗ |
| iQIYI | ↑ up | +1.20 | stable ✗ | up ✓ | stable ✗ |
| JioHotstar | ↑ up | +3.95 | stable ✗ | up ✓ | stable ✗ |
| Klip | ↓ down | -2.65 | stable ✗ | up ✗ | stable ✗ |
| Lifetime / A&E | ↑ up | +1.35 | stable ✗ | up ✓ | stable ✗ |
| Mansa | ↑ up | +1.85 | stable ✗ | up ✓ | stable ✗ |
| Netflix | ↓ down | -2.00 | stable ✗ | up ✗ | stable ✗ |
| ReelShort | ↓ down | -2.05 | stable ✗ | up ✗ | stable ✗ |
| RTP | — stable | 0.00 | stable ✓ | up ✗ | stable ✓ |
| Verza TV | — stable | 0.00 | stable ✓ | up ✗ | stable ✓ |
| Viu | ↓ down | -1.85 | stable ✗ | up ✗ | stable ✗ |
W12-2026 was an upward-biased week (8 up, 5 down, 4 stable). Mean reversion's inherent upward bias (all companies predicted "up" toward tier midpoints) happened to align with this pattern, explaining its 47.1% score.
What Experiment 1 Taught Us
- The KG exists but doesn't speak. 2,588 triples in the store. The prediction engine queries them. But the hardcoded thresholds filter out all meaningful signal.
- Mean reversion is a surprisingly strong baseline in a market with clear tier structure. Companies do gravitate toward their tier midpoints.
- Single-week momentum is useless. Naive momentum and persistence are statistically identical, suggesting that 1-week deltas carry no directional information.
- The problem is configuration, not architecture. The graph, the queries, the prediction logic, and the evaluation framework all work. The parameters connecting them were wrong.
Methodology & System Architecture
How the SBPI semantic layer stores, queries, predicts, and optimizes. From RDF triples to TPE trials.
System Architecture
Knowledge Graph
- Store: Oxigraph 0.5.6 (Rust-based RDF/SPARQL engine, port 7878)
- Triples: 2,588 as of W12-2026
- Ontology: Custom
sbpi:namespace (https://shurai.com/ontology/sbpi#) - Entity types: Company, Week, ScoreRecord, DimensionScore, Tier, Prediction
- 5 dimensions: Distribution, Engagement, Community, Content, Social
SPARQL Queries Used
| Query | Purpose | Returns |
|---|---|---|
get_weeks | All week labels in store | ["W10-2026", "W11-2026", "W12-2026"] |
get_scores_for_week | All company composites for a week | slug, name, composite, delta, tier |
get_dimension_scores | Per-dimension breakdown for anomaly detection | slug, dimension, value, composite |
weekly-movers.rq | Companies with biggest deltas | Nightly insights digest |
dimension-anomalies.rq | Dimension-composite gaps > 20 | Anomaly alerts |
predictive-signals.rq | Momentum patterns for prediction | Signal candidates |
TPE (Tree-structured Parzen Estimation)
TPE is a Bayesian optimization algorithm that models the objective function as two density functions: one over "good" configurations (l) and one over "bad" configurations (g). It maximizes the ratio l(x)/g(x) to propose the next trial.
Why TPE Over Grid Search or Random Search
| Method | Trials Needed | Handles Interactions | Best For |
|---|---|---|---|
| Grid Search | 312 = 531,441 | No | 1–3 parameters |
| Random Search | ~100–500 | Partially | Quick exploration |
| TPE (Optuna) | 30–50 | Yes | 12+ parameters with interactions |
Implementation
# Optuna TPE sampler with seed for reproducibility study = optuna.create_study( direction="maximize", sampler=optuna.samplers.TPESampler(seed=42) ) # Each trial samples a 12-parameter configuration def objective(trial): config = TrialConfig( direction_threshold=trial.suggest_float("direction_threshold", 0.1, 2.0), confidence_base=trial.suggest_float("confidence_base", 0.40, 0.80), # ... 10 more parameters ) # Evaluate across all training week pairs score = average_directional_accuracy(config, train_weeks) return score study.optimize(objective, n_trials=30)
Bootstrap Confidence Intervals
Following the paper's methodology, we use non-parametric bootstrap resampling (1,000 resamples, 95% CI) rather than assuming normality. This is appropriate for small sample sizes and non-Gaussian score distributions.
def bootstrap_ci(values, n_resamples=1000, ci=0.95): means = [] for _ in range(n_resamples): sample = random.choices(values, k=len(values)) means.append(mean(sample)) means.sort() alpha = (1 - ci) / 2 return means[int(alpha * n_resamples)], means[int((1 - alpha) * n_resamples)]
Evaluation Protocol
- Training: All available week transitions (currently W10→W11, W11→W12). The optimizer sees both input scores and actual outcomes.
- Metric: Directional accuracy (did the prediction get up/down/stable correct?). This is the primary metric. MAE and Brier are tracked but not optimized.
- Cross-validation: Not yet implemented (insufficient data). When 4+ weeks are available, leave-one-out cross-validation will replace training-set evaluation.
- Holdout: No holdout set yet. The first new week (W13) after optimization serves as a true out-of-sample test.
File Reference
| File | Lines | Purpose |
|---|---|---|
experiment/kg_interface_optimizer.py | 850 | Experiment 2: TPE optimizer, multi-signal predictor, report generator |
etl/prediction_experiment.py | 675 | Experiment 1: 4-method comparison, evaluation, recording |
etl/prediction_engine.py | 643 | Production prediction engine (hardcoded params — Exp 2 target) |
etl/store_client.py | 101 | Shared Oxigraph HTTP/direct access |
scheduler/nightly-insights.py | 286 | Nightly SPARQL query runner, markdown digest |
scheduler/weekly-prediction-cycle.py | 290 | Orchestrator: ETL → Predict → Attest → Insights → Optimize |
experiment/best-config.json | 14 | Current optimized 12-parameter config |
experiment/optimization-log.json | ~570 | All 30 trial configs and scores |
experiment/evaluation-log.json | ~880 | Experiment 1 per-company results |
Ontology Namespace
PREFIX sbpi: <https://shurai.com/ontology/sbpi#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX prov: <http://www.w3.org/ns/prov#> # Core classes sbpi:Company # 17 tracked companies sbpi:Week # Weekly measurement periods sbpi:ScoreRecord # Composite + delta per company-week sbpi:DimensionScore # Per-dimension breakdown sbpi:Tier # Dominant/Strong/Emerging/Niche/Limited sbpi:Prediction # Generated forecasts with confidence