PushBackLog

Knowledge Management

Advisory enforcement Complete by PushBackLog team
Topic: management Topic: documentation Topic: culture Skillset: engineering-management Skillset: engineering Technology: generic Stage: planning Stage: operations

Knowledge Management

Status: Complete
Category: Management
Default enforcement: Advisory
Author: PushBackLog team


Tags

  • Topic: management, documentation, culture
  • Skillset: engineering-management, engineering
  • Technology: generic
  • Stage: planning, operations

Summary

Knowledge management is the practice of systematically capturing, organising, sharing, and maintaining the information that engineers and teams need to do their work effectively. The failure mode is not ignorance — it is knowledge that exists but is inaccessible: buried in a Slack thread from two years ago, stored in one senior engineer’s head, or written in a stale wiki that nobody trusts. Effective knowledge management prevents silos, accelerates onboarding, reduces key-person risk, and makes the team resilient to turnover.


Rationale

Knowledge silos are a reliability risk

When a critical system is understood only by one engineer, that engineer is a single point of failure. Their absence — holiday, illness, resignation — creates an operational gap. Teams with good knowledge management have collective understanding: any team member can diagnose a production issue, deploy the service, or explain its data model. This is not achieved through documentation alone; it requires deliberate practices including pair programming, code review, written decisions, and shared runbooks.

Undiscoverable knowledge is the same as missing knowledge

The common failure is not that knowledge was never written — it is that it was written somewhere no one ever looks: a Jira comment from 2021, a Google Doc shared with one person, a Slack thread from the channel history. Effective knowledge management requires both content creation (writing it down) and discoverability (making it findable). A single entry point, search, and consistent structure matter as much as volume of content.


Guidance

The knowledge management stack

LayerWhat lives hereToolingMaintenance
CodeCurrent system behaviourGit repositoryAutomated
ADRsWhy decisions were madeGit repository (/docs/adr/)Written with decisions
RunbooksHow to operate and recoverGit repository (/docs/runbooks/)Updated with incidents
API docsHow to use interfacesCo-located with codeUpdated with code
Team wikiHow the team works, processes, meeting notesNotion / ConfluenceMaintained by team
Onboarding guideHow to get startedGit or wikiReviewed quarterly

Preventing knowledge silos

Pair programming and code review are the primary mechanisms for spreading knowledge across the team — not documentation. Documentation captures state; pairing transfers understanding and reasoning. See Pair & Mob Programming and Code Review Best Practices.

Bus factor is the minimum number of people who must be hit by a bus for a project to become incapable of proceeding. Track it:

# Bus Factor Audit (quarterly)

| Component      | Primary owner | Secondary (knows it well) | Bus factor |
|----------------|--------------|--------------------------|------------|
| Auth service   | Christy      | Marlene                  | 2          |
| Billing module | Todd         | -                        | ⚠️ 1      |
| Infra/Terraform| Marcus       | -                        | ⚠️ 1      |

Bus factor 1 is a risk. Plan pairing sessions or documentation sprints to address it.

Writing for discoverability

The most important choice is where to put information such that it will be found. Rules of thumb:

  • Operational procedures belong in the repository alongside the system they describe
  • Decision rationale belongs in ADRs linked from the relevant code or PR
  • Team processes (sprint ceremonies, escalation paths, on-call setup) belong in the team wiki
  • Temporary context (where are we on this project, what’s the current plan) belongs in Jira/Linear tickets and sprint docs — not in permanent knowledge stores

Avoid “general” wikis with unclear organisation. Every page needs a clear home; if it’s hard to decide where something goes, the structure is wrong.

Documentation maintenance cadence

Documentation that is never reviewed becomes incorrect over time. Define a maintenance cadence:

Document typeReview cadenceTrigger for out-of-cycle review
RunbooksAfter every incident that uses themService change affecting the procedure
Onboarding guideQuarterlyNew hire feedbacks during onboarding
Architecture docsWhen architecture changesNew ADR
Security proceduresQuarterlySecurity incident or audit
Team wiki pagesAnnually”Document health check” sprint

Assign page ownership. A page with no named owner will not be maintained.

Onboarding as a knowledge management signal

A new engineer’s onboarding experience is the best test of knowledge management quality. If they can read the documentation and be productive in one week, the knowledge is accessible and current. If they need hours of guided help from a senior engineer, knowledge is concentrated and undocumented.

After every new hire onboarding, run a retrospective:

  • What was missing from the documentation?
  • What was confusing?
  • What took longer than it should have?

Treat onboarding friction as actionable findings. File tickets. Fix the docs.

Lightweight knowledge capture habits

Sustainable knowledge management is not about writing long documents. It is about making small captures habitual:

  • After debugging a non-obvious issue: write one paragraph in the runbook describing the symptom, root cause, and fix
  • After every architecture decision: write a one-page ADR
  • When explaining something on Slack for the second time: convert the explanation to a wiki page and link to it
  • When a new engineer asks a question you didn’t have an answer for: document it after you find the answer

The compounding effect of small, consistent captures is a well-documented team knowledge base.

Review checklist

  • Bus factor audit conducted quarterly — components with bus factor 1 have a mitigation plan
  • Runbooks exist for all production services and are updated after incidents
  • New engineer onboarding guide is reviewed and updated quarterly
  • Knowledge is stored in a consistent, discoverable location — not in personal notes, email, or Slack DMs
  • Every piece of documentation has a named owner responsible for its accuracy
  • Onboarding retrospective is conducted after every new hire