Knowledge Management
Status: Complete
Category: Management
Default enforcement: Advisory
Author: PushBackLog team
Tags
- Topic: management, documentation, culture
- Skillset: engineering-management, engineering
- Technology: generic
- Stage: planning, operations
Summary
Knowledge management is the practice of systematically capturing, organising, sharing, and maintaining the information that engineers and teams need to do their work effectively. The failure mode is not ignorance — it is knowledge that exists but is inaccessible: buried in a Slack thread from two years ago, stored in one senior engineer’s head, or written in a stale wiki that nobody trusts. Effective knowledge management prevents silos, accelerates onboarding, reduces key-person risk, and makes the team resilient to turnover.
Rationale
Knowledge silos are a reliability risk
When a critical system is understood only by one engineer, that engineer is a single point of failure. Their absence — holiday, illness, resignation — creates an operational gap. Teams with good knowledge management have collective understanding: any team member can diagnose a production issue, deploy the service, or explain its data model. This is not achieved through documentation alone; it requires deliberate practices including pair programming, code review, written decisions, and shared runbooks.
Undiscoverable knowledge is the same as missing knowledge
The common failure is not that knowledge was never written — it is that it was written somewhere no one ever looks: a Jira comment from 2021, a Google Doc shared with one person, a Slack thread from the channel history. Effective knowledge management requires both content creation (writing it down) and discoverability (making it findable). A single entry point, search, and consistent structure matter as much as volume of content.
Guidance
The knowledge management stack
| Layer | What lives here | Tooling | Maintenance |
|---|---|---|---|
| Code | Current system behaviour | Git repository | Automated |
| ADRs | Why decisions were made | Git repository (/docs/adr/) | Written with decisions |
| Runbooks | How to operate and recover | Git repository (/docs/runbooks/) | Updated with incidents |
| API docs | How to use interfaces | Co-located with code | Updated with code |
| Team wiki | How the team works, processes, meeting notes | Notion / Confluence | Maintained by team |
| Onboarding guide | How to get started | Git or wiki | Reviewed quarterly |
Preventing knowledge silos
Pair programming and code review are the primary mechanisms for spreading knowledge across the team — not documentation. Documentation captures state; pairing transfers understanding and reasoning. See Pair & Mob Programming and Code Review Best Practices.
Bus factor is the minimum number of people who must be hit by a bus for a project to become incapable of proceeding. Track it:
# Bus Factor Audit (quarterly)
| Component | Primary owner | Secondary (knows it well) | Bus factor |
|----------------|--------------|--------------------------|------------|
| Auth service | Christy | Marlene | 2 |
| Billing module | Todd | - | ⚠️ 1 |
| Infra/Terraform| Marcus | - | ⚠️ 1 |
Bus factor 1 is a risk. Plan pairing sessions or documentation sprints to address it.
Writing for discoverability
The most important choice is where to put information such that it will be found. Rules of thumb:
- Operational procedures belong in the repository alongside the system they describe
- Decision rationale belongs in ADRs linked from the relevant code or PR
- Team processes (sprint ceremonies, escalation paths, on-call setup) belong in the team wiki
- Temporary context (where are we on this project, what’s the current plan) belongs in Jira/Linear tickets and sprint docs — not in permanent knowledge stores
Avoid “general” wikis with unclear organisation. Every page needs a clear home; if it’s hard to decide where something goes, the structure is wrong.
Documentation maintenance cadence
Documentation that is never reviewed becomes incorrect over time. Define a maintenance cadence:
| Document type | Review cadence | Trigger for out-of-cycle review |
|---|---|---|
| Runbooks | After every incident that uses them | Service change affecting the procedure |
| Onboarding guide | Quarterly | New hire feedbacks during onboarding |
| Architecture docs | When architecture changes | New ADR |
| Security procedures | Quarterly | Security incident or audit |
| Team wiki pages | Annually | ”Document health check” sprint |
Assign page ownership. A page with no named owner will not be maintained.
Onboarding as a knowledge management signal
A new engineer’s onboarding experience is the best test of knowledge management quality. If they can read the documentation and be productive in one week, the knowledge is accessible and current. If they need hours of guided help from a senior engineer, knowledge is concentrated and undocumented.
After every new hire onboarding, run a retrospective:
- What was missing from the documentation?
- What was confusing?
- What took longer than it should have?
Treat onboarding friction as actionable findings. File tickets. Fix the docs.
Lightweight knowledge capture habits
Sustainable knowledge management is not about writing long documents. It is about making small captures habitual:
- After debugging a non-obvious issue: write one paragraph in the runbook describing the symptom, root cause, and fix
- After every architecture decision: write a one-page ADR
- When explaining something on Slack for the second time: convert the explanation to a wiki page and link to it
- When a new engineer asks a question you didn’t have an answer for: document it after you find the answer
The compounding effect of small, consistent captures is a well-documented team knowledge base.
Review checklist
- Bus factor audit conducted quarterly — components with bus factor 1 have a mitigation plan
- Runbooks exist for all production services and are updated after incidents
- New engineer onboarding guide is reviewed and updated quarterly
- Knowledge is stored in a consistent, discoverable location — not in personal notes, email, or Slack DMs
- Every piece of documentation has a named owner responsible for its accuracy
- Onboarding retrospective is conducted after every new hire