SHUR IQ | SBPI Experiment Lab
Semantic Layer
SBPI Semantic Layer — Experiment Report

Optimizing the Knowledge Graph → Prediction Interface

Applying academic hyperparameter optimization to competitive intelligence prediction in the micro-drama vertical. Two experiments, 17 companies, 30 optimization trials.

69.9%
Optimized Accuracy
Experiment 2 best
23.5%
KG-Augmented Baseline
Experiment 1
+46.3%
Improvement
Over KG baseline
30
TPE Trials
Optuna optimizer

The Problem

The SBPI (Structural Brand Power Index) semantic layer tracks 17 micro-drama companies across 5 scoring dimensions, producing weekly composite scores stored as RDF triples in an Oxigraph knowledge graph. The prediction engine tries to forecast which companies will move up, down, or stay stable next week.

Experiment 1 revealed a critical failure: the KG-augmented prediction method performed identically to the naive persistence baseline (23.5% directional accuracy). Every company was predicted as "stable" with 0.5 confidence. The knowledge graph existed, but the interface between the graph and the prediction logic was misconfigured.

Core Insight

The problem was not the knowledge graph's content or the prediction algorithm's logic. It was the interface between them — the 12 hardcoded parameters that translate graph signals into predictions. These parameters were set by intuition, never tuned against actual outcomes.

The Solution

We adapted the methodology from Markovick et al. (2025), "Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning", which demonstrated that systematic hyperparameter optimization of the KG→reasoning interface yields significant accuracy gains across multiple benchmarks.

Their paper optimized 6 parameters (chunk size, search type, top-k, prompt templates) using Tree-structured Parzen Estimation (TPE) across 50 trials. We mapped this approach to our domain-specific context: 12 parameters controlling how SPARQL-queried graph data becomes directional predictions.

Experiment Pipeline

SPARQL
Parameterize
Predict
Score
Optimize (TPE)
Report

Experiment Timeline

WeekEventData Points
W10-2026First week of SBPI data loaded into Oxigraph17 companies, 5 dimensions each
W11-2026Second week loaded — first transition pair availableTraining pair 1: W10→W11
W12-2026Third week loaded — Experiment 1 evaluatedTraining pair 2: W11→W12
2026-03-24Experiment 1 results published (evaluation-log.json)4 methods evaluated
2026-03-25Experiment 2 optimization run (30 TPE trials)Best config: 69.9% accuracy
OngoingNightly auto-optimization (re-runs when new week data arrives)Compounding improvement

What Happens Next

The optimizer runs nightly at 6:13 AM as part of the SBPI pipeline. When new week data is loaded (W13, W14, etc.), it automatically re-optimizes with a larger training set. With only 2 transition pairs today, the training data is thin. As weeks accumulate, the optimizer gains more signal, confidence intervals tighten, and the system self-improves.

Compounding Advantage

Each new week of data makes the optimizer better at predicting the next week. The configuration that works for a 3-week window may differ from the configuration that works for a 12-week window. Continuous re-optimization captures this drift automatically.

Experiment 2

KG-LLM Interface Optimization

Adapted from Markovick et al. (2025). 12-parameter search space, TPE optimization via Optuna, multi-signal voting system with bootstrap confidence intervals.

Research Foundation

The Paper

"Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning"

Markovick, Obradović, Hajdu, Pavlović (2025)
arXiv:2505.24478v1

The authors used the Cognee framework to build KG-augmented QA systems, then systematically optimized 6 interface parameters using TPE across 50 trials. Key finding: default configurations leave 10–30% accuracy on the table.

Their 6 Parameters

ParameterControls
chunk_sizeHow text is segmented for graph construction
search_typeText search vs. graph traversal vs. hybrid
top_kNumber of retrieved context chunks
qa_system_promptHow the LLM reasons over retrieved context
graph_construction_promptHow entities/relations are extracted
task_getter_typeWhether summaries are included with chunks

Parameter Mapping: Paper → SBPI

The paper's KG system uses LLM-based QA over unstructured text. Our system uses SPARQL queries over structured RDF data. The abstraction level is different, but the principle is identical: the interface between the knowledge representation and the reasoning logic has tunable parameters that dramatically affect output quality.

Paper ParameterSBPI ParameterWhy This Mapping
chunk_size direction_threshold Both control granularity — how much signal constitutes a meaningful unit
search_type anomaly_contributes Both toggle between retrieval strategies (text vs. graph; momentum-only vs. multi-signal)
top_k divergence_weight, tier_proximity_weight Both control how many signals participate in the final answer
qa_system_prompt confidence_base, magnitude_bonus_*, consistency_bonus Both define the reasoning formula that produces a confidence score
graph_construction_prompt mean_reversion_rate Both are structural parameters about how the graph's topology informs predictions
task_getter_type anomaly_contributes Both toggle extra context availability (summaries; dimension anomaly signals)

The 12-Parameter Search Space

#ParameterRangeDefaultOptimizedChange
1direction_threshold0.1 – 2.00.5001.295 +159%
2confidence_base0.40 – 0.800.6000.443 -26%
3magnitude_thresh_11.0 – 5.03.0003.020 +1%
4magnitude_thresh_23.0 – 8.05.0005.076 +2%
5consistency_thresh0.5 – 4.02.0001.980 -1%
6magnitude_bonus_10.02 – 0.200.1000.120 +20%
7magnitude_bonus_20.02 – 0.200.1000.136 +36%
8consistency_bonus0.01 – 0.150.0500.040 -20%
9mean_reversion_rate0.01 – 0.300.1000.257 +157%
10anomaly_contributesboolFalseTrue enabled
11divergence_weight0.0 – 1.00.0000.180 new signal
12tier_proximity_weight0.0 – 1.00.0000.096 new signal

Key Findings

Finding 1: Direction Threshold Was Too Sensitive

The optimized threshold (1.295) is 2.6x the default (0.5). The original was classifying normal score fluctuations as directional movement. A delta of 0.8 points is noise, not signal. The optimizer learned this from the data.

Finding 2: Mean Reversion Is Underweighted

Rate increased from 0.10 to 0.257 (+157%). In this market, companies do revert toward their tier midpoints — and faster than the default assumed. The optimizer says: trust structural gravity more.

Finding 3: Anomaly Signals Matter

Dimension divergence and tier proximity were disabled by default. The optimizer enabled both with modest weights (0.18 and 0.10). Even weak extra signals improve the voting system when they're directionally correct.

Caveat: Small Training Set

Only 2 week-over-week transitions (W10→W11, W11→W12) — the paper used 24. The 95% CI on mean trial score is [0.621, 0.645]. These findings will stabilize as more weeks accumulate.

Multi-Signal Voting System

The optimized predictor combines four signal types through a weighted voting mechanism:

Momentum
0.44 conf
Mean Reversion
0.40 weight
Dim. Divergence
0.18 weight
Tier Proximity
0.10 weight

Each signal votes for a direction (up/down/stable). The direction with the highest total weight wins. The final confidence is the winning vote share, clamped to [0.30, 0.95].

Optimization Trials: Score Distribution

0.70 0.65 0.60 0.55

30 trials plotted by score. Green dot = best trial (0.6986). Dashed line = best score.

0.6986
Best Trial
Trial #28
0.5754
Worst Trial
Trial #4
0.6335
Mean Score
σ = 0.0332
[.621, .645]
95% CI
Bootstrap, n=1000

Metrics Alignment with Paper

The paper evaluated using Exact Match (EM), F1, and DeepEval Correctness. We mapped these to metrics appropriate for time-series directional prediction:

Paper MetricSBPI EquivalentWhat It Measures
Exact Match (EM) Directional Accuracy Did we get the direction right? (up/down/stable)
F1 Score Mean Absolute Error (MAE) How close was the predicted delta to the actual delta?
DeepEval Correctness Brier Score Was the stated confidence calibrated? (lower = better)

Nightly Automation

# Scheduled via launchd at 6:13 AM daily
# Phase 1: Insights
python scheduler/nightly-insights.py --schedule nightly --output file

# Phase 2: Experiment 1 (record + evaluate predictions)
python etl/prediction_experiment.py --record --evaluate

# Phase 3: Experiment 2 (KG interface optimization)
python experiment/kg_interface_optimizer.py --nightly
#   --nightly detects new week data automatically
#   If new data: runs full TPE optimization (30+ trials)
#   If no new data: reports current best config
Experiment 1

Baseline Prediction Methods

Four prediction strategies evaluated against W12-2026 actuals. 17 companies. The experiment that revealed the KG-augmented method was no better than guessing "stable."

The Four Methods

1. Persistence

Predict that nothing changes. Every company stays "stable" with delta = 0 and confidence = 0.50. This is the simplest possible baseline — the null hypothesis.

2. Naive Momentum

If a company went up last week, predict it goes up again. Uses single-week delta direction with slightly elevated confidence (0.55). No multi-week signal aggregation.

3. Mean Reversion

Predict that each company's score will move toward the midpoint of its current tier (Dominant: 92.5, Strong: 77, Emerging: 62, Niche: 47, Limited: 20). Gap closure rate: 10% per week.

4. KG-Augmented

Query the Oxigraph knowledge graph for momentum signals (2+ consecutive same-direction weeks), dimension anomalies, and tier proximity. Apply a hardcoded confidence formula. The most sophisticated method — and the most disappointing.

Results: W12-2026 Evaluation

Persistence
23.5%
Naive Momentum
23.5%
Mean Reversion
47.1%
KG-Augmented
23.5%
Optimized (Exp 2)
69.9%
MethodDir. AccuracyMAEBrier ScoreVerdict
Persistence23.5% (4/17)1.8030.250Only hits the 4 actually-stable companies
Naive Momentum23.5% (4/17)1.8030.279Same hits as persistence (no signal in 1-week trend)
Mean Reversion47.1% (8/17)2.1070.250Best Exp 1 method — upward bias happened to match a rising week
KG-Augmented23.5% (4/17)1.8030.250Identical to persistence — no momentum signals found
Optimized KG (Exp 2)69.9%Multi-signal voting with tuned parameters

Why KG-Augmented Failed

Root Cause Analysis

The KG-augmented method requires detecting 2 consecutive same-direction weeks as a momentum signal. With the default direction threshold of 0.5, most week-over-week deltas fell into "stable" — which means no two consecutive non-stable weeks were detected. No momentum → no signal → default to persistence.

This is the exact failure mode the Markovick paper warns about: "The interface between the knowledge representation and the reasoning component is as important as either component alone."

Company-Level Predictions (W12-2026)

CompanyActual DirectionActual DeltaPersistenceMean ReversionKG-Aug
Amazon↓ down-2.60stable ✗up ✗stable ✗
Both Worlds / Freeli— stable0.00stable ✓up ✗stable ✓
CandyJar— stable0.00stable ✓up ✗stable ✓
Col Belive↑ up+3.15stable ✗up ✓stable ✗
Disney↑ up+2.30stable ✗up ✓stable ✗
DramaBox↑ up+4.00stable ✗up ✓stable ✗
GoodShort↑ up+1.70stable ✗up ✓stable ✗
iQIYI↑ up+1.20stable ✗up ✓stable ✗
JioHotstar↑ up+3.95stable ✗up ✓stable ✗
Klip↓ down-2.65stable ✗up ✗stable ✗
Lifetime / A&E↑ up+1.35stable ✗up ✓stable ✗
Mansa↑ up+1.85stable ✗up ✓stable ✗
Netflix↓ down-2.00stable ✗up ✗stable ✗
ReelShort↓ down-2.05stable ✗up ✗stable ✗
RTP— stable0.00stable ✓up ✗stable ✓
Verza TV— stable0.00stable ✓up ✗stable ✓
Viu↓ down-1.85stable ✗up ✗stable ✗

W12-2026 was an upward-biased week (8 up, 5 down, 4 stable). Mean reversion's inherent upward bias (all companies predicted "up" toward tier midpoints) happened to align with this pattern, explaining its 47.1% score.

What Experiment 1 Taught Us

  1. The KG exists but doesn't speak. 2,588 triples in the store. The prediction engine queries them. But the hardcoded thresholds filter out all meaningful signal.
  2. Mean reversion is a surprisingly strong baseline in a market with clear tier structure. Companies do gravitate toward their tier midpoints.
  3. Single-week momentum is useless. Naive momentum and persistence are statistically identical, suggesting that 1-week deltas carry no directional information.
  4. The problem is configuration, not architecture. The graph, the queries, the prediction logic, and the evaluation framework all work. The parameters connecting them were wrong.
Technical Reference

Methodology & System Architecture

How the SBPI semantic layer stores, queries, predicts, and optimizes. From RDF triples to TPE trials.

System Architecture

CSV Data
sbpi_to_rdf.py
Oxigraph (RDF)
SPARQL Queries
Predictions
TPE Optimizer

Knowledge Graph

SPARQL Queries Used

QueryPurposeReturns
get_weeksAll week labels in store["W10-2026", "W11-2026", "W12-2026"]
get_scores_for_weekAll company composites for a weekslug, name, composite, delta, tier
get_dimension_scoresPer-dimension breakdown for anomaly detectionslug, dimension, value, composite
weekly-movers.rqCompanies with biggest deltasNightly insights digest
dimension-anomalies.rqDimension-composite gaps > 20Anomaly alerts
predictive-signals.rqMomentum patterns for predictionSignal candidates

TPE (Tree-structured Parzen Estimation)

TPE is a Bayesian optimization algorithm that models the objective function as two density functions: one over "good" configurations (l) and one over "bad" configurations (g). It maximizes the ratio l(x)/g(x) to propose the next trial.

Why TPE Over Grid Search or Random Search

MethodTrials NeededHandles InteractionsBest For
Grid Search312 = 531,441No1–3 parameters
Random Search~100–500PartiallyQuick exploration
TPE (Optuna)30–50Yes12+ parameters with interactions

Implementation

# Optuna TPE sampler with seed for reproducibility
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=42)
)

# Each trial samples a 12-parameter configuration
def objective(trial):
    config = TrialConfig(
        direction_threshold=trial.suggest_float("direction_threshold", 0.1, 2.0),
        confidence_base=trial.suggest_float("confidence_base", 0.40, 0.80),
        # ... 10 more parameters
    )
    # Evaluate across all training week pairs
    score = average_directional_accuracy(config, train_weeks)
    return score

study.optimize(objective, n_trials=30)

Bootstrap Confidence Intervals

Following the paper's methodology, we use non-parametric bootstrap resampling (1,000 resamples, 95% CI) rather than assuming normality. This is appropriate for small sample sizes and non-Gaussian score distributions.

def bootstrap_ci(values, n_resamples=1000, ci=0.95):
    means = []
    for _ in range(n_resamples):
        sample = random.choices(values, k=len(values))
        means.append(mean(sample))
    means.sort()
    alpha = (1 - ci) / 2
    return means[int(alpha * n_resamples)], means[int((1 - alpha) * n_resamples)]

Evaluation Protocol

  1. Training: All available week transitions (currently W10→W11, W11→W12). The optimizer sees both input scores and actual outcomes.
  2. Metric: Directional accuracy (did the prediction get up/down/stable correct?). This is the primary metric. MAE and Brier are tracked but not optimized.
  3. Cross-validation: Not yet implemented (insufficient data). When 4+ weeks are available, leave-one-out cross-validation will replace training-set evaluation.
  4. Holdout: No holdout set yet. The first new week (W13) after optimization serves as a true out-of-sample test.

File Reference

FileLinesPurpose
experiment/kg_interface_optimizer.py850Experiment 2: TPE optimizer, multi-signal predictor, report generator
etl/prediction_experiment.py675Experiment 1: 4-method comparison, evaluation, recording
etl/prediction_engine.py643Production prediction engine (hardcoded params — Exp 2 target)
etl/store_client.py101Shared Oxigraph HTTP/direct access
scheduler/nightly-insights.py286Nightly SPARQL query runner, markdown digest
scheduler/weekly-prediction-cycle.py290Orchestrator: ETL → Predict → Attest → Insights → Optimize
experiment/best-config.json14Current optimized 12-parameter config
experiment/optimization-log.json~570All 30 trial configs and scores
experiment/evaluation-log.json~880Experiment 1 per-company results

Ontology Namespace

PREFIX sbpi: <https://shurai.com/ontology/sbpi#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>

# Core classes
sbpi:Company    # 17 tracked companies
sbpi:Week       # Weekly measurement periods
sbpi:ScoreRecord # Composite + delta per company-week
sbpi:DimensionScore # Per-dimension breakdown
sbpi:Tier       # Dominant/Strong/Emerging/Niche/Limited
sbpi:Prediction # Generated forecasts with confidence