Story Sizing

Status: Complete
Category: Delivery
Default enforcement: Advisory
Author: PushBackLog team

Summary

Story sizing is the practice of estimating the relative effort or complexity of work items as a team, to support sprint planning and flow management. The number produced is useful, but it is secondary to the conversation that produces it: sizing discussions surface hidden complexity, expose untested assumptions, align understanding of scope, and reveal disagreements before they become mid-sprint surprises.

This practice is Advisory because the specific technique (story points, T-shirts, Fibonacci) matters less than the underlying discipline of collaborative estimation with well-defined acceptance criteria.

Rationale

Estimation as a communication tool

The primary value of sizing a story is not the number. It is the conversation that happens when the team disagrees. When one engineer estimates 2 points and another estimates 13, something important is happening: they have a different mental model of what the story involves. That disagreement, surfaced and resolved before work begins, saves hours of course-correction mid-sprint.

Teams that skip sizing because stories seem “obvious” systematically deny themselves this signal. The stories that seem most obvious are often the ones with the most hidden dependencies and unspoken assumptions.

Relative estimation

Good sizing is relative, not absolute. Story points (or any relative unit) describe complexity and uncertainty compared to other stories the team has sized before — not predicted calendar time. A 3-point story isn’t “3 hours”; it’s “about as complex as the last story we agreed was a 3”. This distinction matters because:

Humans are poor at estimating absolute duration but reasonably good at relative comparison
Relative estimates are honest about uncertainty; hour-based estimates create a false precision that always proves wrong
Teams that size in hours spend sprints arguing about estimation accuracy rather than improving their processes

The Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) is common because the increasing gaps encode increasing uncertainty: we can predict a 2 more accurately than an 8, and this feels right. The numbers encourage teams to bucket stories rather than pretend to precision they don’t have.

Velocity as a planning tool, not a performance metric

Once a team has sized consistently for several sprints, velocity — the average number of points completed per sprint — becomes a reliable planning tool. The team can forecast how much work it can take on per sprint with reasonable confidence. This is the compounding return on consistent sizing: better plans, more predictable delivery, fewer overloaded sprints.

Velocity is not a performance metric to be optimised or compared across teams. A team with velocity 40 and another with velocity 20 cannot be meaningfully compared without knowing how each team sizes. Using velocity as a productivity measure destroys its utility as a planning tool.

The DoR dependency

Sizing a story you don’t understand is not estimation; it is guessing. Meaningful sizing requires the story to meet the Definition of Ready first: clear context, stated acceptance criteria, and identified dependencies. Teams that try to size under-specified stories end up with either false consensus (everyone picks 3 because they don’t want to admit they don’t know) or paralysis.

Guidance

Common sizing techniques

Planning Poker

The most widely used technique. Each team member independently selects a card with their estimate, then all reveal simultaneously. This prevents anchoring — if one person says “5” first, everyone else is biased toward that number. Simultaneous reveal forces independent thinking.

Steps:

Product owner or facilitator reads the story aloud
Team asks clarifying questions
Everyone selects an estimate privately (cards face-down)
All reveal simultaneously
Outliers (highest and lowest) explain their reasoning
Discuss, then re-estimate until convergence

No story should be sized without first passing the DoR. If clarifying questions reveal the story isn’t well-enough defined, pause and refine it before sizing.

T-shirt sizes

Simpler than Fibonacci — XS, S, M, L, XL. Useful for early backlog grooming when stories are rough and the team isn’t ready to debate the difference between 5 and 8. T-shirt sizes can be mapped to a numeric scale later when stories are refined.

Bucket system

Useful for large backlogs. Establish buckets (e.g. 1, 2, 3, 5, 8, 13) and rapidly bin stories into buckets. Less discussion per story, faster throughput. Best for initial prioritisation rather than sprint-level precision.

What to estimate

Size the story as a whole, including all the work needed to meet the Definition of Done:

Implementation
Tests
Code review
Documentation updates
Deployment and smoke-test on staging

Teams that size only the “coding” part and then wonder why stories take longer than estimated are forgetting that every story includes non-coding work.

When to re-estimate

Resize a story when:

Its scope changes materially during refinement
A dependency is discovered that wasn’t known during the original sizing
A technical spike reveals the original estimate was wrong

Do not resize during a sprint to adjust for mid-sprint discoveries. If a story turns out larger than estimated, flag it, split it if possible, or carry it forward — don’t retroactively adjust your velocity.

Handling disagreement

Persistent disagreement (outliers who don’t converge after discussion) is usually a signal that:

The story needs more refinement (most common)
The team has genuinely different technical approaches in mind (worth discussing)
The story is too large and should be split

The resolution is almost never “pick the average”. Either discuss until genuine consensus is reached, or defer the story for further refinement.

Examples

Example — Planning poker round

Story: As a user, I want to receive an email confirmation after signing up so that I know my account was created.

AC: Email sent within 30 seconds of successful registration; contains user’s email address, a link to the dashboard, and a note about verifying their email.

Developer	Estimate	Reasoning
Alex	3	Straightforward — email service is already wired up
Priya	5	We need to template the email, test delivery, handle retries if the queue fails
Jordan	8	Doesn’t email HTML rendering need separate testing across clients?

Result: Discussion reveals Jordan’s concern about email client compatibility — a legitimate requirement that wasn’t in the AC. Team agrees to scope the story to plain-text with a single template, no HTML. Re-estimates converge on 3.

The sizing conversation prevented scope creep and surfaced a valid design decision before coding started.

Example — T-shirt to Fibonacci evolution

Sprint	Approach	Notes
Sprint 1–3	T-shirt: S/M/L	New team, stories still rough, building shared vocabulary
Sprint 4+	Fibonacci: 1–13	Team has shared calibration; switching to points for velocity tracking

Example — Stories that fail sizing

Problem	Symptom	What to do
Story too large	Team can’t agree; everyone estimates 13+	Split the story
Story too vague	Estimates vary from 1 to 13 with no clear reasoning	Return to backlog for refinement
Technical complexity unknown	Team wants a spike first	Create a time-boxed spike story (size the spike, not the unknown work)

Anti-patterns

1. Treating story points as hours

“A story point equals one hour” (or half a day, or any fixed duration) destroys the value of relative estimation. The moment points are mapped to time, teams start sandbagging estimates and managers start using velocity as a time sheet. Story points are a planning unit, not a time unit.

2. Sizing without understanding acceptance criteria

Estimating a story with no AC is guessing, not sizing. The estimate will be wrong and the discussion will be short. The DoR exists precisely to prevent this. If a story doesn’t have acceptance criteria, don’t size it — refine it first.

3. Not revealing simultaneously (anchoring)

One person says “I’d say about a 5” and suddenly everyone else picks 5 too. This is anchoring, and it eliminates the value of independent estimation. Always reveal simultaneously, every time.

4. Skipping “obvious” stories

“This is clearly a 1, let’s not waste a card on it.” This is how 1-point stories turn into 5-point stories mid-sprint. Size everything. If it really is a 1 after a 30-second conversation, that’s fine — but have the conversation.

5. Using velocity as a performance benchmark

Comparing team A’s velocity to team B’s, or this sprint’s velocity to last sprint’s, as a measure of productivity is a misuse of the metric. Velocity is only meaningful in context: a team’s own historical baseline, with consistent sizing practices and consistent team composition. Velocity inflation (artificially high estimates) is a direct response to velocity-as-performance-pressure.

6. Sizing tasks, not stories

Sizing individual tasks (“write the unit tests” = 1 point) misses the point. Size user stories — units of user value — not implementation tasks. Tasks are the internal decomposition; stories are the planning unit.

7. No calibration story

Without a reference point, sizing is arbitrary. Most teams benefit from keeping a “reference story” — a past story of known complexity that they can compare new stories against. “How does this compare to the login page we built in sprint 2?” grounds estimates in shared team experience.

Part of the PushBackLog Best Practices Library. Suggest improvements →

Story Sizing

Story Sizing

Tags

Summary

Rationale

Estimation as a communication tool

Relative estimation

Velocity as a planning tool, not a performance metric

The DoR dependency

Guidance

Common sizing techniques

Planning Poker

T-shirt sizes

Bucket system

What to estimate

When to re-estimate

Handling disagreement

Examples

Example — Planning poker round

Example — T-shirt to Fibonacci evolution

Example — Stories that fail sizing

Anti-patterns

1. Treating story points as hours

2. Sizing without understanding acceptance criteria

3. Not revealing simultaneously (anchoring)

4. Skipping “obvious” stories

5. Using velocity as a performance benchmark

6. Sizing tasks, not stories

7. No calibration story

Related practices