Story Sizing
Status: Complete
Category: Delivery
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: delivery, planning
- Methodology: Agile, Scrum
- Skillset: any
- Technology: generic
- Stage: refinement
Summary
Story sizing is the practice of estimating the relative effort or complexity of work items as a team, to support sprint planning and flow management. The number produced is useful, but it is secondary to the conversation that produces it: sizing discussions surface hidden complexity, expose untested assumptions, align understanding of scope, and reveal disagreements before they become mid-sprint surprises.
This practice is Advisory because the specific technique (story points, T-shirts, Fibonacci) matters less than the underlying discipline of collaborative estimation with well-defined acceptance criteria.
Rationale
Estimation as a communication tool
The primary value of sizing a story is not the number. It is the conversation that happens when the team disagrees. When one engineer estimates 2 points and another estimates 13, something important is happening: they have a different mental model of what the story involves. That disagreement, surfaced and resolved before work begins, saves hours of course-correction mid-sprint.
Teams that skip sizing because stories seem “obvious” systematically deny themselves this signal. The stories that seem most obvious are often the ones with the most hidden dependencies and unspoken assumptions.
Relative estimation
Good sizing is relative, not absolute. Story points (or any relative unit) describe complexity and uncertainty compared to other stories the team has sized before — not predicted calendar time. A 3-point story isn’t “3 hours”; it’s “about as complex as the last story we agreed was a 3”. This distinction matters because:
- Humans are poor at estimating absolute duration but reasonably good at relative comparison
- Relative estimates are honest about uncertainty; hour-based estimates create a false precision that always proves wrong
- Teams that size in hours spend sprints arguing about estimation accuracy rather than improving their processes
The Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) is common because the increasing gaps encode increasing uncertainty: we can predict a 2 more accurately than an 8, and this feels right. The numbers encourage teams to bucket stories rather than pretend to precision they don’t have.
Velocity as a planning tool, not a performance metric
Once a team has sized consistently for several sprints, velocity — the average number of points completed per sprint — becomes a reliable planning tool. The team can forecast how much work it can take on per sprint with reasonable confidence. This is the compounding return on consistent sizing: better plans, more predictable delivery, fewer overloaded sprints.
Velocity is not a performance metric to be optimised or compared across teams. A team with velocity 40 and another with velocity 20 cannot be meaningfully compared without knowing how each team sizes. Using velocity as a productivity measure destroys its utility as a planning tool.
The DoR dependency
Sizing a story you don’t understand is not estimation; it is guessing. Meaningful sizing requires the story to meet the Definition of Ready first: clear context, stated acceptance criteria, and identified dependencies. Teams that try to size under-specified stories end up with either false consensus (everyone picks 3 because they don’t want to admit they don’t know) or paralysis.
Guidance
Common sizing techniques
Planning Poker
The most widely used technique. Each team member independently selects a card with their estimate, then all reveal simultaneously. This prevents anchoring — if one person says “5” first, everyone else is biased toward that number. Simultaneous reveal forces independent thinking.
Steps:
- Product owner or facilitator reads the story aloud
- Team asks clarifying questions
- Everyone selects an estimate privately (cards face-down)
- All reveal simultaneously
- Outliers (highest and lowest) explain their reasoning
- Discuss, then re-estimate until convergence
No story should be sized without first passing the DoR. If clarifying questions reveal the story isn’t well-enough defined, pause and refine it before sizing.
T-shirt sizes
Simpler than Fibonacci — XS, S, M, L, XL. Useful for early backlog grooming when stories are rough and the team isn’t ready to debate the difference between 5 and 8. T-shirt sizes can be mapped to a numeric scale later when stories are refined.
Bucket system
Useful for large backlogs. Establish buckets (e.g. 1, 2, 3, 5, 8, 13) and rapidly bin stories into buckets. Less discussion per story, faster throughput. Best for initial prioritisation rather than sprint-level precision.
What to estimate
Size the story as a whole, including all the work needed to meet the Definition of Done:
- Implementation
- Tests
- Code review
- Documentation updates
- Deployment and smoke-test on staging
Teams that size only the “coding” part and then wonder why stories take longer than estimated are forgetting that every story includes non-coding work.
When to re-estimate
Resize a story when:
- Its scope changes materially during refinement
- A dependency is discovered that wasn’t known during the original sizing
- A technical spike reveals the original estimate was wrong
Do not resize during a sprint to adjust for mid-sprint discoveries. If a story turns out larger than estimated, flag it, split it if possible, or carry it forward — don’t retroactively adjust your velocity.
Handling disagreement
Persistent disagreement (outliers who don’t converge after discussion) is usually a signal that:
- The story needs more refinement (most common)
- The team has genuinely different technical approaches in mind (worth discussing)
- The story is too large and should be split
The resolution is almost never “pick the average”. Either discuss until genuine consensus is reached, or defer the story for further refinement.
Examples
Example — Planning poker round
Story: As a user, I want to receive an email confirmation after signing up so that I know my account was created.
AC: Email sent within 30 seconds of successful registration; contains user’s email address, a link to the dashboard, and a note about verifying their email.
| Developer | Estimate | Reasoning |
|---|---|---|
| Alex | 3 | Straightforward — email service is already wired up |
| Priya | 5 | We need to template the email, test delivery, handle retries if the queue fails |
| Jordan | 8 | Doesn’t email HTML rendering need separate testing across clients? |
Result: Discussion reveals Jordan’s concern about email client compatibility — a legitimate requirement that wasn’t in the AC. Team agrees to scope the story to plain-text with a single template, no HTML. Re-estimates converge on 3.
The sizing conversation prevented scope creep and surfaced a valid design decision before coding started.
Example — T-shirt to Fibonacci evolution
| Sprint | Approach | Notes |
|---|---|---|
| Sprint 1–3 | T-shirt: S/M/L | New team, stories still rough, building shared vocabulary |
| Sprint 4+ | Fibonacci: 1–13 | Team has shared calibration; switching to points for velocity tracking |
Example — Stories that fail sizing
| Problem | Symptom | What to do |
|---|---|---|
| Story too large | Team can’t agree; everyone estimates 13+ | Split the story |
| Story too vague | Estimates vary from 1 to 13 with no clear reasoning | Return to backlog for refinement |
| Technical complexity unknown | Team wants a spike first | Create a time-boxed spike story (size the spike, not the unknown work) |
Anti-patterns
1. Treating story points as hours
“A story point equals one hour” (or half a day, or any fixed duration) destroys the value of relative estimation. The moment points are mapped to time, teams start sandbagging estimates and managers start using velocity as a time sheet. Story points are a planning unit, not a time unit.
2. Sizing without understanding acceptance criteria
Estimating a story with no AC is guessing, not sizing. The estimate will be wrong and the discussion will be short. The DoR exists precisely to prevent this. If a story doesn’t have acceptance criteria, don’t size it — refine it first.
3. Not revealing simultaneously (anchoring)
One person says “I’d say about a 5” and suddenly everyone else picks 5 too. This is anchoring, and it eliminates the value of independent estimation. Always reveal simultaneously, every time.
4. Skipping “obvious” stories
“This is clearly a 1, let’s not waste a card on it.” This is how 1-point stories turn into 5-point stories mid-sprint. Size everything. If it really is a 1 after a 30-second conversation, that’s fine — but have the conversation.
5. Using velocity as a performance benchmark
Comparing team A’s velocity to team B’s, or this sprint’s velocity to last sprint’s, as a measure of productivity is a misuse of the metric. Velocity is only meaningful in context: a team’s own historical baseline, with consistent sizing practices and consistent team composition. Velocity inflation (artificially high estimates) is a direct response to velocity-as-performance-pressure.
6. Sizing tasks, not stories
Sizing individual tasks (“write the unit tests” = 1 point) misses the point. Size user stories — units of user value — not implementation tasks. Tasks are the internal decomposition; stories are the planning unit.
7. No calibration story
Without a reference point, sizing is arbitrary. Most teams benefit from keeping a “reference story” — a past story of known complexity that they can compare new stories against. “How does this compare to the login page we built in sprint 2?” grounds estimates in shared team experience.
Related practices
Part of the PushBackLog Best Practices Library. Suggest improvements →