What does the score actually mean?

The score is a 0-100 normalisation of a 7-criterion rubric. Each criterion scores 0, 1, or 2. The KR-level criteria (Outcome Form and Measurability) apply per Key Result, so a set with three KRs has more points in play than a set with one. The raw points are normalised to a 0-100 percentage.

Score ranges map to four tiers: 0-33 is Critical issues (core structural failures, the OKR cannot be tracked as written), 34-55 is Weak (gaps that will cause problems mid-quarter), 56-77 is Strong (solid foundation, minor sharpening needed), 78-100 is Excellent (committable as written). See the full methodology for how each criterion is scored and what the tier names mean in practice.

What if the LLM gives a bad rewrite?

Rewrites are starting points, not final answers. The rubric is the source of truth: if a suggested rewrite still contains an output verb or lacks a baseline, it is still broken regardless of how polished it sounds. You can re-run the diagnosis on any rewrite by copying it into the input field. The rule engine scores it locally in under a second with no key required.

LLMs occasionally produce fluent but structurally weak OKRs. The discipline is to apply the rubric to the rewrite, not to accept it because it reads well. A rewrite that scores below 56 needs another pass.

How much does Coach mode cost per session?

Approximately $0.02-$0.04 for a 6-10 turn session using GPT-4o-mini. Claude Sonnet costs slightly more per session, roughly $0.08-$0.15 for the same conversation length. The rule-engine pre-score in Diagnose mode is free regardless of key. Only the LLM analysis step consumes API credit, at roughly $0.001-$0.015 per analysis depending on provider and model.

Coach mode is bounded: it sends your full conversation history on each turn, so longer sessions cost more. A session that goes past 15 turns will cost noticeably more than one that reaches a draft in 6. The cost note in the input panel shows an indicative estimate before you start.

What makes an OKR bad?

The most reliable signal is Key Results that describe work instead of results. "Launch the onboarding redesign," "Migrate to the new platform," "Complete user research": those are tasks. They belong in a sprint backlog. A Key Result should describe what changes for a real person after the work is done, not the work itself.

The second most common failure is vagueness that passes for ambition. "Improve customer satisfaction" sounds like a goal. It is not. Without a baseline, a target, and a data source, it is a field name. You could hit it or miss it and never know which. A rubric check catches both problems inside 60 seconds, which is faster than most team OKR review conversations.

Can an Objective be measurable?

Yes, and that is actually a sign of a well-written one. The OKR framework reserves numbers for Key Results, but there is no rule against an Objective that includes a specific, observable condition. "Cut median PR cycle time so teams can ship daily by end of Q3" is directional and measurable.

The distinction that matters is between the Objective and the KRs in terms of level of abstraction. The Objective states the desired state. The KRs prove it was reached. If your Objective is specific enough to function as its own KR, consider whether it is actually describing the outcome rather than the aspiration. That is not wrong; it is just a different structure than the classic format.

What's wrong with "launch X" as a Key Result?

"Launch X" is Output-as-KR. It describes an action your team takes, not a change that happens in the world because you took it. The test is simple: if you complete the KR and nothing changes for any real user, it is not a KR.

Every launch is a bet on an outcome. Write the outcome instead. If you are launching a self-service billing portal, the outcome might be "customers who change their own plan without contacting support, from 12% to 45%." Now you can tell at week 8 whether the launch actually worked. A "launch by date" KR gives you a binary green on delivery day and no information on whether the bet paid off.

The fix is one question: what will be different for users after this ships? Write that.

How do you avoid vanity metrics?

Name the actor and the specific action. "Increase engagement by 25%" fails because engagement is undefined. Engagement of what, by whom, on which surface, compared to when? A vanity metric is any number that can grow while the thing you actually care about stays flat or gets worse.

The replacement test: can you imagine a plausible scenario where this metric goes up and the business gets worse? If yes, it is a vanity metric. Pageviews go up when you publish low-quality content that attracts the wrong audience. Email open rates go up when you send panic-inducing subject lines that erode trust. Swap the vanity metric for the behaviour change it was supposed to proxy: "blog readers who start a free trial, from 1.4% to 3.2%" cannot be gamed the same way.

I have seen teams celebrate a 40% engagement increase in a quarter where net revenue retention dropped. They were measuring the wrong thing. The rubric flags this pattern as Vanity Metric and asks you to name the actor.

Why does ambition matter for an OKR?

OKRs that are guaranteed to succeed are planning theatre. If the target is set at a level the team would reach anyway without changing how they work, the OKR is not driving anything. It is just a progress report in a different format.

Ambition matters because it forces a conversation about what would have to be true for this to happen. "Triple the activation rate" requires the team to rethink the onboarding funnel. "Grow activation by 5%" permits incremental tweaking. The first conversation is more interesting and more likely to produce meaningful change.

The calibration question is: if we hit 70% of this, would we be satisfied? If yes, the target is probably too low. If hitting 70% would feel like a significant achievement, the target is probably in the right range.

Can a team have too many OKRs?

Yes, and most teams do. Three to five Key Results per Objective is the practical ceiling. Beyond that, the set stops being a prioritisation mechanism and starts being a commitment catalogue. If everything is a Key Result, nothing is.

The more useful question is whether each KR passes the so-what test: if this KR turns red, does the team drop everything to investigate? If the answer is "we would notice but carry on," the KR is not important enough to be in the set. Teams that write seven or eight KRs are usually hedging, not planning. They are including everything that might be relevant rather than committing to what definitely is.

A set of three sharp KRs that genuinely cover the Objective is worth more than seven KRs where two do the heavy lifting and five are there for coverage.

How often should we score OKRs?

Weekly check-ins on KR progress, formal scoring at the mid-point and end of the cycle. The weekly check is not a scoring exercise; it is a signal check. Are the numbers moving? If not, why not? The mid-point score is where you decide whether to adjust scope, targets, or approach. The end-of-cycle score is the retrospective input.

The failure mode is teams that skip the mid-point review and discover at the end of the quarter that two KRs were never measurable because the instrumentation was never built. A mid-cycle check forces that conversation at week 6 rather than week 13.

Scoring format matters less than cadence. A 0-1 confidence rating in a weekly standup is more useful than a quarterly scoring ceremony that no one reads.

Is "score 7 out of 10" a good Key Result?

No. "Score 7 out of 10" is almost always a Vanity Metric or a Placeholder in disguise. The first question is: 7 out of 10 on what? If it is an NPS survey or CSAT form, that might be a real metric with a real baseline. If it is a manager's subjective rating of a process, it is not a metric at all.

The deeper problem is that point-scale scores on non-standardised instruments are gameable and not actionable. If you score 6.8 instead of 7, what do you do differently? If you hit 7, what changed? The metric needs to be tied to a specific actor, a specific behaviour, and a specific source before it functions as a KR.

The test I apply: could two different team members independently verify this score using the same data source? If yes, it might work. If one person would give it a 6.5 and another a 7.2 using the same evidence, the metric is not specific enough to be a KR.