Building reliable systems? Learn about our Staff Augmentation services.
Read also: Observability Implementation Guide: Logs, Metrics, Traces
Site Reliability Engineering (SRE) is not about hiring people with “SRE” in their title. It is about applying engineering discipline to operations problems — measuring reliability objectively, managing risk with error budgets, automating toil, and building systems that recover gracefully from failures. This checklist translates SRE theory into actionable implementation steps.
Phase 1: Define SLIs and SLOs
Without SLIs and SLOs, reliability discussions are subjective. “The system feels slow” is not actionable. “p99 latency exceeded the 500ms SLO for 3 hours, consuming 40% of our monthly error budget” is.
Choosing SLIs
Select SLIs that reflect the user’s experience, not internal system metrics.
| Service type | Primary SLIs |
|---|---|
| API/web service | Availability (% successful requests), Latency (p50, p95, p99), Error rate |
| Data pipeline | Freshness (time since last successful run), Correctness (% records processed accurately), Coverage (% expected data present) |
| Storage system | Durability (% data retained), Availability (% successful read/write ops), Latency |
| Batch processing | Throughput (jobs completed per hour), Success rate (% jobs completed without error), Completion time |
SLI implementation checklist
- Identify the 3-5 most critical user journeys for each service
- For each journey, select 1-2 SLIs that directly measure user experience
- Instrument SLI measurement at the point closest to the user (load balancer, API gateway — not the application server)
- Exclude planned maintenance windows from SLI calculations
- Validate SLI data accuracy — compare measured SLIs with actual user reports
Setting SLOs
- Start with historical data — what has the service actually achieved in the last 90 days?
- Set the SLO slightly below historical performance — leave room for normal variance
- Use rolling windows (e.g., 30-day rolling) rather than calendar months — calendar boundaries create perverse incentives
- Define SLOs for each critical SLI:
| SLI | Example SLO | Error budget (30 days) |
|---|---|---|
| Availability | 99.9% of requests succeed | 43 minutes downtime or 43,200 failed requests per 43.2M |
| Latency | 99% of requests < 200ms | 1% of requests can exceed 200ms |
| Latency (tail) | 99.9% of requests < 1,000ms | 0.1% of requests can exceed 1s |
| Data freshness | 99.5% of pipeline runs complete within 1 hour | 3.6 hours of staleness per month |
- Document SLOs in a central, accessible location — every team member should know the targets
- Review and adjust SLOs quarterly — tighten if consistently exceeded, relax if consistently missed without user complaints
Phase 2: Error Budget Policy
Error budgets transform reliability from a vague priority into a quantitative mechanism for balancing features and stability.
Error budget policy checklist
- Define what happens at each error budget threshold:
| Budget remaining | Policy |
|---|---|
| > 50% | Normal feature velocity. Deploy freely. |
| 25-50% | Increased caution. All deploys require additional review. Prioritize reliability fixes in the next sprint. |
| 10-25% | Reliability freeze. Only deploy bug fixes and reliability improvements. No new features until budget recovers. |
| < 10% | Full freeze. Roll back recent changes if they contributed to budget consumption. Post-incident review for all budget-consuming events. |
- Get organizational buy-in — product managers and engineering leadership must agree to the policy before the first SLO is set
- Automate error budget tracking — dashboards showing current budget, burn rate, and projected depletion
- Review error budget status in weekly engineering meetings
- Conduct error budget reviews monthly — what consumed the budget? What will we do differently?
Phase 3: Incident Management
When things break — and they will — a structured incident response minimizes impact and maximizes learning.
Incident response process
- Define severity levels with clear criteria:
| Severity | Criteria | Response time | Example |
|---|---|---|---|
| SEV1 | Complete service outage or data loss | 5 minutes | Production database down, payment processing failed |
| SEV2 | Significant degradation affecting many users | 15 minutes | API latency 10x normal, 50% error rate |
| SEV3 | Partial degradation affecting some users | 1 hour | One region affected, non-critical feature broken |
| SEV4 | Minor issue, no user impact | Next business day | Monitoring gap, non-critical alert firing |
Incident roles
- Incident Commander (IC) — coordinates response, makes decisions, communicates status
- Operations Lead — executes technical investigation and remediation
- Communications Lead — updates stakeholders, status page, and affected users
- Rotate roles across the team — do not let the same person be IC every time
During the incident
- Create a dedicated communication channel (Slack channel, video call)
- IC declares incident severity and assigns roles within 5 minutes
- Operations Lead focuses on mitigation first (restore service), root cause second
- Communications Lead posts status updates every 15-30 minutes
- Document all actions taken in real-time — this becomes the incident timeline
- Escalate if the incident is not mitigated within the expected timeframe for its severity
After the incident
- Conduct a blameless post-incident review (PIR) within 48 hours
- Document: timeline, root cause, impact (duration, affected users, error budget consumed), what went well, what could be improved
- Identify action items with owners and deadlines
- Track action item completion — unfinished PIR actions are the leading cause of repeat incidents
- Share PIR summaries across the organization — incidents are learning opportunities
Phase 4: Toil Reduction
Toil is the enemy of engineering productivity. Every hour spent on manual, repetitive operations is an hour not spent on automation, reliability, and innovation.
Toil identification
- Track how operations team members spend their time for 2 weeks
- Categorize activities as:
- Engineering — writing code, designing systems, improving automation
- Toil — manual, repetitive, automatable tasks
- Overhead — meetings, planning, documentation
- Identify the top 5 toil categories by time spent
Common toil categories and automation strategies
| Toil | Automation approach |
|---|---|
| Manual scaling | Auto-scaling policies based on metrics |
| Manual deployments | CI/CD pipeline with automated rollback |
| Certificate rotation | cert-manager or ACME automation |
| Configuration changes | Configuration-as-code with PR-based workflow |
| User access provisioning | SCIM/SSO integration, self-service IAM |
| Routine capacity planning | Automated forecasting based on growth trends |
| Repetitive alert triage | Auto-remediation runbooks, self-healing scripts |
Toil reduction checklist
- Set a target: SREs spend no more than 50% of time on toil (Google’s benchmark)
- Prioritize automation by ROI:
time saved per week × weeks per year / implementation effort - Automate the most frequent manual task first
- Build self-service tools for operations that development teams request repeatedly
- Measure toil reduction monthly — track the trend, not just the absolute number
- Celebrate toil elimination — make it a visible metric in team reviews
Phase 5: Capacity Planning
Reliable systems do not run at 100% capacity. Capacity planning ensures headroom for traffic spikes and growth.
Capacity planning checklist
- Monitor resource utilization trends: CPU, memory, storage, network, database connections
- Set capacity thresholds: alert at 70%, plan expansion at 80%, emergency at 90%
- Forecast growth based on historical trends and business projections (6-12 month horizon)
- Conduct load testing quarterly — validate that the system handles 2x expected peak traffic
- Document capacity limits per service — what breaks first and at what load?
- Plan for failure — if one availability zone goes down, can the remaining zones handle the full load?
- Review infrastructure costs monthly — over-provisioning is expensive, under-provisioning is risky
Phase 6: On-Call and Sustainability
On-call is not a punishment — it is a responsibility that must be sustainable and fair.
On-call structure
- Minimum 2 people in every on-call rotation — primary and secondary
- Rotation length: 1 week (longer rotations cause burnout)
- Maximum 2 incidents per on-call shift that require waking up — if exceeding this, the system needs reliability investment
- Compensate on-call fairly — time off in lieu or financial compensation
- Provide clear escalation paths — the on-call engineer should never feel alone
On-call sustainability
- Track on-call load: pages per shift, time to acknowledge, time to resolve
- Review on-call load monthly — high page frequency indicates systemic issues
- New team members shadow on-call for 2 rotations before going primary
- Maintain up-to-date runbooks for every common alert
- Conduct quarterly on-call retrospectives — is the load sustainable? Are runbooks helpful?
How ARDURA Consulting Supports SRE Implementation
Building an SRE practice requires experienced reliability engineers who have operated large-scale systems and can establish processes, tooling, and culture from the ground up. ARDURA Consulting provides the expertise:
- 500+ senior specialists including SREs, platform engineers, and DevOps architects experienced in SLO-driven reliability, incident management, and infrastructure automation — available within 2 weeks
- 40% cost savings compared to permanent hiring, allowing you to bring in SRE expertise for practice establishment and knowledge transfer
- 99% client retention — engineers who stay through implementation, stabilization, and organizational adoption of SRE practices
- 211+ completed projects including SRE practice establishment, incident response framework design, and platform reliability engineering
Whether you need an SRE architect to design your reliability framework or a team to implement and operate it, ARDURA Consulting provides the talent to make your systems reliably serve your users.