Building reliable systems? Learn about our Staff Augmentation services.

Read also: Observability Implementation Guide: Logs, Metrics, Traces

Site Reliability Engineering (SRE) is not about hiring people with “SRE” in their title. It is about applying engineering discipline to operations problems — measuring reliability objectively, managing risk with error budgets, automating toil, and building systems that recover gracefully from failures. This checklist translates SRE theory into actionable implementation steps.

Phase 1: Define SLIs and SLOs

Without SLIs and SLOs, reliability discussions are subjective. “The system feels slow” is not actionable. “p99 latency exceeded the 500ms SLO for 3 hours, consuming 40% of our monthly error budget” is.

Choosing SLIs

Select SLIs that reflect the user’s experience, not internal system metrics.

Service typePrimary SLIs
API/web serviceAvailability (% successful requests), Latency (p50, p95, p99), Error rate
Data pipelineFreshness (time since last successful run), Correctness (% records processed accurately), Coverage (% expected data present)
Storage systemDurability (% data retained), Availability (% successful read/write ops), Latency
Batch processingThroughput (jobs completed per hour), Success rate (% jobs completed without error), Completion time

SLI implementation checklist

  • Identify the 3-5 most critical user journeys for each service
  • For each journey, select 1-2 SLIs that directly measure user experience
  • Instrument SLI measurement at the point closest to the user (load balancer, API gateway — not the application server)
  • Exclude planned maintenance windows from SLI calculations
  • Validate SLI data accuracy — compare measured SLIs with actual user reports

Setting SLOs

  • Start with historical data — what has the service actually achieved in the last 90 days?
  • Set the SLO slightly below historical performance — leave room for normal variance
  • Use rolling windows (e.g., 30-day rolling) rather than calendar months — calendar boundaries create perverse incentives
  • Define SLOs for each critical SLI:
SLIExample SLOError budget (30 days)
Availability99.9% of requests succeed43 minutes downtime or 43,200 failed requests per 43.2M
Latency99% of requests < 200ms1% of requests can exceed 200ms
Latency (tail)99.9% of requests < 1,000ms0.1% of requests can exceed 1s
Data freshness99.5% of pipeline runs complete within 1 hour3.6 hours of staleness per month
  • Document SLOs in a central, accessible location — every team member should know the targets
  • Review and adjust SLOs quarterly — tighten if consistently exceeded, relax if consistently missed without user complaints

Phase 2: Error Budget Policy

Error budgets transform reliability from a vague priority into a quantitative mechanism for balancing features and stability.

Error budget policy checklist

  • Define what happens at each error budget threshold:
Budget remainingPolicy
> 50%Normal feature velocity. Deploy freely.
25-50%Increased caution. All deploys require additional review. Prioritize reliability fixes in the next sprint.
10-25%Reliability freeze. Only deploy bug fixes and reliability improvements. No new features until budget recovers.
< 10%Full freeze. Roll back recent changes if they contributed to budget consumption. Post-incident review for all budget-consuming events.
  • Get organizational buy-in — product managers and engineering leadership must agree to the policy before the first SLO is set
  • Automate error budget tracking — dashboards showing current budget, burn rate, and projected depletion
  • Review error budget status in weekly engineering meetings
  • Conduct error budget reviews monthly — what consumed the budget? What will we do differently?

Phase 3: Incident Management

When things break — and they will — a structured incident response minimizes impact and maximizes learning.

Incident response process

  • Define severity levels with clear criteria:
SeverityCriteriaResponse timeExample
SEV1Complete service outage or data loss5 minutesProduction database down, payment processing failed
SEV2Significant degradation affecting many users15 minutesAPI latency 10x normal, 50% error rate
SEV3Partial degradation affecting some users1 hourOne region affected, non-critical feature broken
SEV4Minor issue, no user impactNext business dayMonitoring gap, non-critical alert firing

Incident roles

  • Incident Commander (IC) — coordinates response, makes decisions, communicates status
  • Operations Lead — executes technical investigation and remediation
  • Communications Lead — updates stakeholders, status page, and affected users
  • Rotate roles across the team — do not let the same person be IC every time

During the incident

  • Create a dedicated communication channel (Slack channel, video call)
  • IC declares incident severity and assigns roles within 5 minutes
  • Operations Lead focuses on mitigation first (restore service), root cause second
  • Communications Lead posts status updates every 15-30 minutes
  • Document all actions taken in real-time — this becomes the incident timeline
  • Escalate if the incident is not mitigated within the expected timeframe for its severity

After the incident

  • Conduct a blameless post-incident review (PIR) within 48 hours
  • Document: timeline, root cause, impact (duration, affected users, error budget consumed), what went well, what could be improved
  • Identify action items with owners and deadlines
  • Track action item completion — unfinished PIR actions are the leading cause of repeat incidents
  • Share PIR summaries across the organization — incidents are learning opportunities

Phase 4: Toil Reduction

Toil is the enemy of engineering productivity. Every hour spent on manual, repetitive operations is an hour not spent on automation, reliability, and innovation.

Toil identification

  • Track how operations team members spend their time for 2 weeks
  • Categorize activities as:
    • Engineering — writing code, designing systems, improving automation
    • Toil — manual, repetitive, automatable tasks
    • Overhead — meetings, planning, documentation
  • Identify the top 5 toil categories by time spent

Common toil categories and automation strategies

ToilAutomation approach
Manual scalingAuto-scaling policies based on metrics
Manual deploymentsCI/CD pipeline with automated rollback
Certificate rotationcert-manager or ACME automation
Configuration changesConfiguration-as-code with PR-based workflow
User access provisioningSCIM/SSO integration, self-service IAM
Routine capacity planningAutomated forecasting based on growth trends
Repetitive alert triageAuto-remediation runbooks, self-healing scripts

Toil reduction checklist

  • Set a target: SREs spend no more than 50% of time on toil (Google’s benchmark)
  • Prioritize automation by ROI: time saved per week × weeks per year / implementation effort
  • Automate the most frequent manual task first
  • Build self-service tools for operations that development teams request repeatedly
  • Measure toil reduction monthly — track the trend, not just the absolute number
  • Celebrate toil elimination — make it a visible metric in team reviews

Phase 5: Capacity Planning

Reliable systems do not run at 100% capacity. Capacity planning ensures headroom for traffic spikes and growth.

Capacity planning checklist

  • Monitor resource utilization trends: CPU, memory, storage, network, database connections
  • Set capacity thresholds: alert at 70%, plan expansion at 80%, emergency at 90%
  • Forecast growth based on historical trends and business projections (6-12 month horizon)
  • Conduct load testing quarterly — validate that the system handles 2x expected peak traffic
  • Document capacity limits per service — what breaks first and at what load?
  • Plan for failure — if one availability zone goes down, can the remaining zones handle the full load?
  • Review infrastructure costs monthly — over-provisioning is expensive, under-provisioning is risky

Phase 6: On-Call and Sustainability

On-call is not a punishment — it is a responsibility that must be sustainable and fair.

On-call structure

  • Minimum 2 people in every on-call rotation — primary and secondary
  • Rotation length: 1 week (longer rotations cause burnout)
  • Maximum 2 incidents per on-call shift that require waking up — if exceeding this, the system needs reliability investment
  • Compensate on-call fairly — time off in lieu or financial compensation
  • Provide clear escalation paths — the on-call engineer should never feel alone

On-call sustainability

  • Track on-call load: pages per shift, time to acknowledge, time to resolve
  • Review on-call load monthly — high page frequency indicates systemic issues
  • New team members shadow on-call for 2 rotations before going primary
  • Maintain up-to-date runbooks for every common alert
  • Conduct quarterly on-call retrospectives — is the load sustainable? Are runbooks helpful?

How ARDURA Consulting Supports SRE Implementation

Building an SRE practice requires experienced reliability engineers who have operated large-scale systems and can establish processes, tooling, and culture from the ground up. ARDURA Consulting provides the expertise:

  • 500+ senior specialists including SREs, platform engineers, and DevOps architects experienced in SLO-driven reliability, incident management, and infrastructure automation — available within 2 weeks
  • 40% cost savings compared to permanent hiring, allowing you to bring in SRE expertise for practice establishment and knowledge transfer
  • 99% client retention — engineers who stay through implementation, stabilization, and organizational adoption of SRE practices
  • 211+ completed projects including SRE practice establishment, incident response framework design, and platform reliability engineering

Whether you need an SRE architect to design your reliability framework or a team to implement and operate it, ARDURA Consulting provides the talent to make your systems reliably serve your users.