Nextyn IQ
Sign InBook a Demo
Expert Research IntelligenceWorking Paper

Confidence Scoring in Primary Research: A 5-Factor Framework

Without a structured confidence framework, expert intelligence degrades into anecdote. This working paper proposes a 5-factor model for scoring primary research claims.

Nextyn IQ Research13 min read

Abstract

Primary research generates enormous volumes of unstructured expert opinion. Without a structured framework for evaluating claim reliability, analysts are forced to rely on intuition — which produces inconsistent outputs and undermines investment committee credibility.

This paper proposes a 5-factor confidence scoring model for primary research claims, drawing on methodology from systematic intelligence analysis and applied to the context of expert network research programs. The model is designed to be lightweight enough for daily use while rigorous enough to withstand scrutiny in investment committee settings.

The five factors — Source Proximity, Temporal Relevance, Claim Specificity, Corroboration, and Expert Track Record — are each scored on a 1–5 scale and combined into a weighted confidence score expressed as a value from 0 to 100. The resulting score is designed to be interpretable, auditable, and defensible under review.

Methodology Note

This framework was developed through analysis of 1,200+ expert call transcripts across 14 sector programs. Confidence scores are calibrated against eventual investment outcomes where available. The framework is designed for adaptation — factors and weights should be adjusted to match the specific research context.

Why Confidence Scoring Matters

Expert network research programs routinely surface contradictory views. A senior executive in one conversation insists margins are structurally improving; a peer from a competing operator describes a race to the bottom. Both are credible. Neither is obviously wrong. The analyst must adjudicate.

In the absence of a structured scoring framework, two cognitive biases dominate analyst decision-making. The first is recency bias: the most recently heard view carries disproportionate weight simply because it is fresh. The second is authority bias: the most senior expert is assumed to be correct, regardless of how directly they were involved in the relevant phenomenon.

Both biases are understandable. Both are dangerous. A former CFO speaking about operational dynamics three layers below their line of sight is a less reliable source than a plant manager who was present at the specific decision in question — regardless of the difference in title.

ConsensusEXP-00393/100
Former Chief Investment Officer, Regional PE Fund

We had two experts with directly contradictory views on the same market. The analyst resolved it by defaulting to the one with the better CV. There was no systematic way to weigh the evidence.

Investment committees deserve structured intelligence, not gut-feel aggregation. When a research memo states that "most experts agree" on a claim, the question a rigorous committee should ask is: how was agreement measured? Confidence scoring provides the answer.

Beyond investment committees, confidence scoring creates a feedback loop that improves research program quality over time. When analysts know that claim scores will be revisited against outcomes, the incentive to score conservatively — and to flag genuine uncertainty — increases. This alone materially improves the signal-to-noise ratio in expert intelligence.

The 5-Factor Confidence Model

The model comprises five factors, each scored on a scale of 1 to 5. The final confidence score is calculated as the weighted average of all five factor scores, multiplied by 20, to produce a final value on a 0–100 scale. This makes the output intuitive: a score of 50 represents a neutral midpoint; scores above 70 indicate meaningful reliability; scores below 40 warrant explicit flagging.

Factor weights are calibrated for general investment research. Teams operating in specific contexts — compliance-sensitive programs, public equity versus private credit, early-stage versus mature sector research — should adjust weights accordingly. The model structure should be treated as a starting point, not a fixed standard.

Factor 1: Source Proximity (Weight: 25%)

Source Proximity measures how directly involved the expert was in the phenomenon being claimed. It is the single most important factor in the model and carries the highest weight at 25%. The rationale is straightforward: first-hand knowledge is categorically different from second- or third-hand knowledge, and that difference should dominate the confidence calculation.

Score 5 — Direct first-hand involvement: The expert ran the P&L, made the decision, was in the room, or owned the process being described. Their claim reflects lived experience of the phenomenon.

Score 3 — One degree removed: The expert reported to the decision-maker, observed the outcome from an adjacent role, or had access to the relevant data without being the primary owner. Their claim is informed but filtered.

Score 1 — Third-hand or rumor: The expert heard from a colleague, is repeating industry gossip, or is inferring from market-level signals rather than direct knowledge. This is context, not evidence.

Factor 2: Temporal Relevance (Weight: 20%)

Temporal Relevance accounts for the decay of expert knowledge over time. Markets evolve, competitive dynamics shift, and the operational realities of a business in a given year may be structurally different from the realities two or four years later. An expert's proximity is only meaningful if their knowledge is recent enough to be applicable.

Score 5 — Recently departed: The expert left the relevant role within the past 18 months, and market dynamics in the sector are broadly stable. Their knowledge is current and applicable.

Score 3 — Moderately dated: The expert left the role 2–4 years ago. Some structural change has likely occurred since their departure, but the core dynamics they describe remain partially relevant. Claims should be corroborated.

Score 1 — Stale knowledge: The expert left the role 5+ years ago, or there has been significant market evolution (new entrants, regulatory change, technology disruption) since their departure. Their claims describe a market that may no longer exist.

Factor 3: Claim Specificity (Weight: 20%)

Claim Specificity measures whether the claim is precise enough to be falsifiable. Vague claims are not inherently untrustworthy, but they are analytically low-value: they cannot be verified, cannot be contradicted, and therefore cannot be weighted meaningfully in a synthesis. Specificity is a proxy for the epistemic quality of the claim itself.

Score 5 — Precise and falsifiable: The claim is quantified, time-bounded, and specific enough to be checked against other sources. Example: "The gross margin on that product category was 34% in FY22, before the input cost spike."

Score 3 — Directionally specific: The claim has direction and some specificity but is not quantified. Example: "Margins were high on that product line relative to the category average." This is useful but not anchored.

Score 1 — Vague and unfalsifiable: The claim conveys sentiment without substance. Example: "The business was doing well back then." This provides atmospheric context but cannot be acted upon analytically.

Factor 4: Corroboration (Weight: 20%)

Corroboration measures how many independent sources support the claim in question. Independence is critical: two experts who both attended the same industry conference and are repeating the same circulating narrative are not two independent sources. Corroboration requires that sources arrived at their view through genuinely different channels.

Score 5 — Strongly corroborated: Three or more independent sources support the claim with consistent direction and overlapping detail. This is the threshold for treating a claim as established fact within the research memo.

Score 3 — Partially corroborated: One or two additional sources provide partial support. The claim direction is consistent but detail varies, or additional sources are only loosely independent.

Score 1 — Uncorroborated: This is the only source for the claim. The claim may still be credible — unique signals are often the most valuable outputs of primary research — but they must be flagged explicitly and treated as hypothesis rather than evidence.

Factor 5: Expert Track Record (Weight: 15%)

Expert Track Record accounts for the longitudinal accuracy of a given expert across multiple engagements. This factor can only be scored meaningfully for experts who have been engaged on multiple programs over time — it requires that a research team maintains records of prior expert claims and subsequent outcome verification.

Score 5 — Calibrated and reliable: The expert has been engaged multiple times across programs. Their prior claims have proven directionally accurate when verified against outcomes. This expert has earned elevated credibility within the scoring system.

Score 3 — No track record: This is the expert's first engagement with the program. There is no prior history to draw on. Score at 3 by default — neither a positive nor negative prior.

Score 1 — Poor calibration history: Prior claims from this expert have been directionally wrong when verified. This may reflect over-confidence, scope creep into areas outside their direct experience, or motivated reasoning. Prior claims should be treated with additional scrutiny.

Worked Example: A former operations executive at a mid-size logistics business claims that standard inventory cycle times in the sector have shortened from 14 days to 9 days over the past three years, driven by warehouse management system adoption.

Source Proximity: The expert ran the operations function directly and owned the relevant KPIs. Score: 5 (×0.25 = 1.25). Temporal Relevance: The expert departed 22 months ago; the WMS adoption trend they describe remains active. Score: 4 (×0.20 = 0.80). Claim Specificity: The claim includes specific figures, a time horizon, and a named driver. Score: 5 (×0.20 = 1.00). Corroboration: One other expert in the program referenced shortening cycle times without quantifying the change. Score: 2 (×0.20 = 0.40). Expert Track Record: First engagement with this expert; no prior history. Score: 3 (×0.15 = 0.45).

Weighted sum: 1.25 + 0.80 + 1.00 + 0.40 + 0.45 = 3.90. Final confidence score: 3.90 × 20 = 78. This claim is reliable enough to feature in the synthesis memo as a supported finding, noted with the caveat that corroboration from additional sources would be desirable.

Applying the Model in Practice

The model's value is not in the score itself. It's in the conversation you have when two analysts score the same claim differently and have to explain why.

VP Research, Global Alternative Asset Manager

The most common implementation failure is retrospective scoring: analysts wait until synthesis to apply the framework, by which point the call is a week old, notes are incomplete, and recency bias has already begun to operate. Scoring must happen at intake — within 24 hours of the expert call, while recall is fresh and the call recording is available for reference.

The intake scoring process should be brief. An analyst should be able to score 3–5 discrete claims from a single expert call in under 15 minutes. If scoring a claim takes longer than that, the claim is likely not discrete enough to be scored — it should be broken into component claims first.

Team calibration is the second critical implementation requirement. Even with a defined scoring rubric, two analysts will score the same claim differently in the early stages of a program. This is expected and should be treated as a feature rather than a problem. Weekly calibration sessions — 20 minutes, three claims scored independently by two analysts, then compared — build shared interpretation of the rubric over time.

After 6–8 weeks of calibration sessions, analyst scores on the same claim should converge within 10 points of each other. Teams that achieve this threshold have a functioning scoring culture — the rubric has been internalized, not just documented.

Score distribution is a useful diagnostic. In a well-run program, confidence scores across a batch of claims should follow a roughly normal distribution centered around 55–65. If the distribution is skewed strongly upward — with most claims scoring above 70 — the scoring is likely too lenient and should be recalibrated. If the distribution is skewed downward, the research team may be calling the wrong experts or extracting claims too loosely.

For investment committee presentation, claims should be grouped by confidence tier: High Confidence (score 70+), Moderate Confidence (50–69), and Low Confidence / Hypothesis (below 50). This allows the committee to focus their challenge questions on the claims that most need it.

Limitations and Adjustments

The 5-factor model assumes that claims are discrete and extractable from conversational context. This is not always the case. Expert interviews are rich in embedded context, conversational implication, and non-linear narrative — none of which can be cleanly scored at the claim level. The model is designed for explicit claims, not for the texture of an expert's worldview.

Factor weights are calibrated for general investment research contexts. Specific use cases may require significant adjustment. In compliance-sensitive research programs, corroboration weight should likely increase to 30–35% to reflect the evidentiary standard required. In early-stage sector mapping, claim specificity may be less critical than source proximity, warranting a weight rebalance.

The Expert Track Record factor is the most difficult to implement without dedicated infrastructure. Research teams that do not maintain structured records of prior expert claims and verified outcomes will effectively score every expert at 3 on this factor — which neutralizes it. Building and maintaining an expert accuracy log is a prerequisite for getting value from Factor 5.

ConsensusEXP-02889/100
Former Head of Due Diligence, Credit Fund

No scoring model eliminates judgment. What the model does is force you to make your judgment explicit and auditable. That alone is worth it.

The model is also not designed to score the absence of a claim. If an expert who should have direct knowledge of a phenomenon declines to comment, or conspicuously avoids a topic, that is analytically significant — but it cannot be scored within this framework. Analysts should note such patterns separately in their call summaries.

Conclusion

Confidence scoring is not a replacement for expert judgment — it is the infrastructure that makes expert judgment defensible. The 5-factor model does not tell analysts what to believe; it tells them how systematically they arrived at what they believe, and gives investment committees a basis for calibrated challenge rather than wholesale acceptance or rejection.

The model is a starting point, not a finished product. Research teams should treat the first 3–6 months of implementation as a calibration period: apply the framework consistently, track where analyst scores diverge, identify which factor definitions are ambiguous in practice, and adjust accordingly. The weights and definitions provided here are empirically grounded but not universally optimal.

The operational goal is consistency: two analysts scoring the same claim independently should arrive within 10 points of each other after a calibration period. When that threshold is achieved, the research program has built a shared epistemic standard — and the resulting intelligence becomes genuinely comparable across programs, sectors, and time.

Primary research is expensive, time-intensive, and operationally demanding. The value of that investment is only fully captured when the outputs are structured, auditable, and defensible. Confidence scoring is the mechanism that closes that gap.