04 · Capacidade 04 · Capability

Sustentação
& Plantão.

Support
& On-call.

Engenheiro com nome no plantão. Não número de chamado. Sev1 com toque mediano em 4 min (P95 7 min nos últimos 12 meses), postmortem blameless em 5 dias úteis, runbook na wiki do cliente. Contrato anual com saída em 30 dias — recontrata se for melhor que a alternativa.

An engineer with a name on-call. Not a ticket number. Sev1 median first response in 4 min (P95 7 min over the last 12 months), blameless postmortem in 5 business days, runbook in the client's wiki. Annual contract with 30-day exit — renew if it's better than the alternative.

9contas retainer

Sustentação 24×7 · ativas

4min

Toque Sev1 · mediana 12m (P95: 7 min)

94%

Postmortem ≤5 dias úteis · 12m

Catraca de nível 1. Sênior desde o primeiro alerta.

9retainer accounts

24×7 Support · active

4min

Sev1 response · 12m median (P95: 7 min)

94%

Postmortem ≤5 business days · 12m

Tier-1 turnstile. Senior from the first alert.

01. Plantão 24×7

02. Gestão de incidente

03. Postmortem & aprendizado

04. AMS & rotina técnica

05. Runbook & conhecimento

06. Hardening contínuo

07. Contrato & SLA

01. 24×7 on-call

02. Incident management

03. Postmortem & learning

04. AMS & technical routine

05. Runbook & knowledge

06. Continuous hardening

07. Contract & SLA

o que NÃO somos · catraca · NOC · help desk · ponte com fornecedor · n1 com upsell · chamado por SLA descumprido what we are NOT · turnstile · NOC · help desk · vendor bridge · tier-1 with upsell · ticket for missed SLA

01 · Plantão 24×7

Engenheiro sênior desde o primeiro alerta. Acordado, com contexto, com autonomia.

Plantão tradicional usa catraca: nível 1 abre chamado, nível 2 lê chamado, nível 3 — onde mora a expertise — vira ponte só quando o cliente já está irritado. Não compra tempo, compra fricção. Aqui o sênior é o primeiro toque, com runbook na mão e autonomia para corrigir — não só escalar.

Modelo · plantão Redgator

sev1 · sev2 · sev3 · sev4

SEV · 01 · CRÍTICO

Produção parada

Toque: mediana 4 min (P95 7 min)

Mitigação: 30 min

War room: imediata

Escalada: CTO em 15 min se preciso

SEV · 02 · ALTO

Função degradada

Toque: 15 min

Mitigação: 2 h

Resolução: 8 h

Postmortem: obrigatório

SEV · 03 · MÉDIO

Defeito sem bloqueio

Toque: próx. dia útil

Resolução: sprint corrente

Workaround: documentado

SEV · 04 · BAIXO

Cosmético / dúvida

Toque: 2 dias úteis

Resolução: backlog priorizado

Sem multa contratual

SLA descumprido em Sev1/Sev2 vira multa contratual, não desculpa em status report. Apurada por mês, com ledger público compartilhado com o cliente.

P·01
On-call rotation com nome
Calendário compartilhado: você sabe quem está no plantão da semana, com qual experiência, em qual fuso. Se prefere alguém específico para uma janela crítica (Black Friday, virada fiscal), a gente combina e fixa em contrato.
P·02
Follow-the-sun real
São Paulo, Lisboa, México. Plantão noturno não é alguém acordado às 3h tomando decisão sonolento — é o engenheiro do outro fuso, com café, lendo o mesmo runbook.
P·03
Pager fadiga monitorada
Engenheiro acordado 3 vezes na mesma semana é sintoma de alerta ruim — e backlog imediato. A meta é <1,5 página por turno. Burnout não escala.
P·04
Autoridade para corrigir
Plantonista tem credencial, runbook e janela de mudança emergencial pré-aprovada. Não escala para acordar arquiteto a fim de aplicar fix de 4 linhas. Ação registrada no audit trail, revisada na manhã seguinte.
P·05
Comunicação durante incidente
Status page interna atualizada a cada 15 min, sem jargão. Cliente recebe versão executiva (1 parágrafo, próximo update em X) por canal acordado — não é reféns de perguntar.

02 · Gestão de incidente

Linha do tempo do incidente. Sev1 ao postmortem.

Incidente bem gerido tem rito. Quem comanda (incident commander), quem comunica (comms lead), quem investiga (subject matter expert) — papéis acordados antes do alerta tocar. Fluxo abaixo é o padrão Redgator, ajustado por cliente quando ferramental ou regulação pede.

T+0

Detecção

Alerta dispara contra SLO. Pager toca. Incident channel criado automaticamente (Slack/Teams).

T+≤7min

Triagem

Plantonista assume IC, classifica severidade, abre war room. Bot puxa dashboard, último deploy, alertas correlatos.

T+15min

Mitigação

Rollback, feature flag, scale up, circuit breaker. Causa pode esperar — usuário não. Comms lead informa cliente.

T+30min

Estabilização

Sintoma resolvido. SME investiga causa raiz com calma. Status page atualizada a cada 15 min até resolução.

T+24h

Resolução

Causa raiz mitigada (não só sintoma). Fix permanente em PR ou backlog priorizado. War room fechada.

T+5d

Postmortem

Documento publicado. Ações de engenharia com owner e prazo. Apresentado em reunião quinzenal com cliente.

G·01
Incident commander designado
Plantonista assume IC. Não é o sênior mais sênior — é o que está no plantão. Sêniores entram como SME quando convocados, sem disputar comando. Hierarquia de incidente não é hierarquia de cargo.
G·02
Comms lead separado do IC
IC investiga e decide. Comms escreve update e fala com stakeholder. Em Sev1 com cliente afetado, esse desacoplamento é a diferença entre IC focado e IC sobrecarregado.
G·03
Tooling integrado
PagerDuty, Opsgenie, FireHydrant, Rootly, incident.io. Bot que abre canal, puxa dashboard, registra timeline, gera draft de postmortem. Operamos sem fanatismo — o que o cliente já tem.
G·04
Mitigação primeiro, causa depois
Restaurar serviço é prioridade absoluta. Causa raiz vira investigação após estabilização. Rollback é honesto — não fracasso. Reverteu? Documenta e segue.
G·05
Severidade reavaliada em tempo real
Sev2 que vira Sev1 sobe na hora. Sev1 que se mostra Sev2 desce, com critério. Severidade não é orgulho — é decisão operacional.
G·06
Status page externa quando faz sentido
Cliente B2B: comunicação direta basta. Cliente B2C com base larga: status page pública (Statuspage, Instatus, Better Stack) atualizada na mesma cadência da interna.

03 · Postmortem & aprendizado

Blameless. Em 5 dias úteis. Vira backlog, não PDF.

Postmortem que acusa pessoa não previne incidente — gera ocultação. Postmortem em PDF que ninguém lê é cerimônia. Postmortem útil tem três partes: linha do tempo factual, análise de causas (sistema, não humano), ações com owner e prazo. E é lido — porque vira backlog priorizado na sprint seguinte.

ESTRUTURA · POSTMORTEM REDGATOR

01
Resumo executivo
Uma página. Quem foi afetado, por quanto tempo, qual a causa em uma frase.
02
Timeline factual
UTC e horário local. Sem juízo de valor. Apenas o que aconteceu, quando, por quem.
03
Análise de causas
5 Whys ou causal tree. Foco em sistema, não em "operador errou". Toda causa humana esconde uma falha de processo.
04
O que funcionou bem
Detecção rápida, runbook útil, comunicação clara. Reforça o que vale repetir.
05
O que faltou
Honesto. Sem cobertura. Identifica gaps de telemetria, runbook, alerta, processo.
06
Ações com owner e prazo
3 a 8 ações concretas. Cada uma com responsável e data. Vira issue na sprint, não item em ata.

REGRAS DE CASA

Princípios não negociáveis.

▸
Blameless. Pessoa que apertou o botão errado teve contexto incompleto, ferramenta hostil ou processo frágil. Atacar a pessoa preserva o sistema quebrado.
▸
5 dias úteis. Mais que isso, memória já se foi. Sev1 e Sev2 obrigatório. Sev3 se padrão se repete.
▸
Escrito por quem viveu. Não por gestor que ouviu falar. Plantonista que comandou o incidente é o autor.
▸
Ação rastreada. Cada item vira issue com label de postmortem. Burndown trimestral mostra o que foi entregue.
▸
Compartilhado com o cliente. Em reunião quinzenal, com ações próprias e do cliente claramente separadas. Sem ofuscação.
▸
Buscável depois. Wiki indexada, com tags por sistema, por causa, por cliente. Próximo incidente parecido começa lendo o postmortem do parecido anterior.

INDICADORES · ÚLTIMOS 12 MESES

94%

postmortem em ≤5 dias

73%

ações concluídas em 30d

−41%

recorrência da mesma causa

37min

MTTR mediano em produção

04 · AMS & rotina técnica

Manutenção que não vira dívida. Hardening que não vira hype.

Sustentação de aplicação não é só apagar incêndio — é evitar o próximo. Banco de horas para evolução, ciclos de hardening, atualização de dependência, refatoração cirúrgica do que dá problema todo mês. Tudo registrado, tudo priorizado com o cliente — sem surpresa em fatura, sem dívida técnica acumulando em silêncio.

A·01
Banco de horas com transparência
Pacote mensal acordado em contrato (16h, 40h, 80h, 160h). Burndown semanal compartilhado: o que foi consumido, em qual demanda, por quem. Saldo positivo acumula até o fim do trimestre. Não usou? Devolve.
A·02
Backlog técnico priorizado junto
Reunião quinzenal: ações de postmortem, dependências críticas vencendo, refatoração que reduz MTTR. Cliente decide ordem com critério de impacto e custo — não em isolamento.
A·03
Patch & upgrade programado
Janela mensal acordada para SO, runtime, dependência crítica. CVE alto sai do ciclo e vira hotfix. Janela vazia? Devolve a janela.
A·04
Refatoração cirúrgica
O componente que aparece em 4 postmortem em 6 meses entra em ciclo de refatoração. Não rewrite — reescrita parcial com testes, atrás de feature flag, com rollback ensaiado.
A·05
Dependency hygiene
Renovate ou Dependabot configurado, com política: patch automático, minor com revisão, major em ciclo de hardening. Versão 4 anos atrás não envelhece bem em produção.
A·06
Documentação viva
Diagrama C4 atualizado a cada mudança arquitetural. ADR para decisão não-trivial. Onboarding de novo membro do nosso time vira teste real da documentação — se não dá pra entrar lendo, a docs precisa de melhoria.
A·07
Capacity & performance review
Trimestral. Onde está apertando: CPU, banco, fila, dependência externa. Recomendação técnica e de custo. Antecipar saturação custa menos que resolver no plantão.

05 · Runbook & conhecimento

Wiki do cliente. Não a nossa.

Conhecimento que mora só com o fornecedor é refém. Aqui o runbook nasce no Confluence, Notion, GitHub Wiki — o que o cliente já usa. Atualizado a cada incidente, lido por quem entra no plantão pela primeira vez, e — quando o contrato termina — fica. Sem dependência, sem chantagem disfarçada.

FORMATO 01

Runbook por alerta

Cada alerta no Grafana/Datadog/PagerDuty linka para a página do runbook correspondente. Diagnóstico em árvore, comandos copiáveis, critério de quando escalar.

FORMATO 02

Runbook por sistema

Visão geral, dependências, fluxo de deploy, contatos. Onboarding-ready: plantonista novo lê e consegue assumir.

FORMATO 03

Playbook de incidente

Cenários conhecidos: queda de RDS, fila estourando, latência inter-AZ. Cada um com sintoma, hipótese, ação imediata, verificação.

FORMATO 04 · DESTAQUE

ADR & decisão arquitetural

Architecture Decision Record para cada escolha não-trivial. Contexto, opções consideradas, decisão, consequências. Daqui a 3 anos alguém precisa entender por que está assim.

FORMATO 05

Diagrama C4 & mapa de dependências

Contexto, container, componente. Atualizado quando o sistema muda — não no fim do projeto. Mermaid em markdown, versionado em git.

FORMATO 06

Postmortem indexado

Tag por sistema, por causa, por cliente. Próximo incidente parecido começa lendo o anterior. Conhecimento composto.

R·01
Mora onde o cliente já está
Confluence, Notion, GitHub Wiki, GitBook, Outline. Não criamos um portal Redgator paralelo — usamos o que o time do cliente já abre.
R·02
Atualiza a cada incidente
Postmortem com ação "atualizar runbook X" é prática-padrão. Página com data de última revisão visível — se passa de 90 dias sem update, aparece em alerta de hygiene.
R·03
Lido na rotina, não só na crise
Onboarding de plantonista novo é sessão guiada lendo runbook e simulando incidente passado. Se a docs não funciona em sala calma, não vai funcionar às 3h.
R·04
Saída sem dependência
Fim de contrato: handover formal, runbook revisado, sessão de Q&A com time do cliente. 30 dias previstos em contrato. Recontrata se for melhor que alternativa.

06 · Hardening contínuo

Próximo incidente é mais barato que o último.

Hardening não é projeto separado — é hábito. A cada postmortem, a cada game day, a cada review trimestral, sai uma lista de gaps. Esses gaps entram no banco de horas com critério: o que reduz MTTR primeiro, o que aumenta SLO segundo, o que economiza FinOps em paralelo. Em 6 meses o sistema está mais resiliente sem ninguém ter feito "projeto de hardening".

H·01
Game day trimestral
Failover ensaiado em ambiente de staging. AZ derrubada, RDS promovido, dependência externa simulada como indisponível. Hipótese antes do experimento, resultado vira backlog. DR que ninguém testou não existe.
H·02
Chaos engineering quando faz sentido
LitmusChaos, AWS FIS, Gremlin. Pod kill aleatório em horário comercial só quando o sistema já tolera. Não é vandalismo — é experimento controlado, com hipótese e métrica.
H·03
Load test contra SLO
k6, Gatling, Locust. Carga simulada antes da Black Friday, da virada de safra, do lançamento. Saber onde quebra antes é mais barato que descobrir no prazo.
H·04
Revisão de alerta
Trimestral: alerta que dispara muito sem ação é ruído. Alerta que nunca dispara é dead letter. Refinamento contínuo — o pager toca menos, e quando toca, importa.
H·05
Tabletop exercise
Cenário em sala: "RDS primário cai às 14h de uma terça". Time do cliente + Redgator percorrem a árvore de decisão. Identifica gap antes do incidente real.
H·06
Threat modeling iterativo
Para sistemas com dado sensível ou regulação. STRIDE light, com atualização semestral. Saída: backlog de hardening de segurança que não compete com feature.

07 · Contrato & SLA

Multa contratual quando a gente erra. Saída em 30 dias.

Contrato de sustentação tradicional protege fornecedor: SLA frouxo, multa simbólica, lock-in de 36 meses. Aqui é diferente. SLA tem dente, multa é apurada no mês, saída tem rito de 30 dias. Se a gente está aqui, é porque está sendo melhor que a alternativa — não porque o contrato segura.

C·01
SLA por severidade, com multa
Sev1 e Sev2 com penalidade financeira por descumprimento. Apurada por mês, descontada na fatura seguinte. Não é decoração — entra no contrato com cláusula clara.
C·02
Saída em 30 dias
Aviso de 30 dias, handover formal, runbook revisado, sessão de Q&A. Sem multa de rescisão. Quem fica, fica porque vale.
C·03
Banco de horas com burndown público
O que foi consumido, em qual demanda, por quem. Saldo visível semanalmente. Não usou? Devolve. Acumula no trimestre, não vira receita não-entregue.
C·04
Reajuste previsível
IPCA + 3% no aniversário. Sem reajuste fora da época. Sem "ajuste de escopo" disfarçado. Você sabe o orçamento de plantão para os próximos 12 meses, à risca.
C·05
NPS interno trimestral
Time do cliente avalia o time da Redgator. Notas baixas viram conversa franca, não retaliada. Se a gente está enchendo o saco, queremos saber antes da renovação, não depois.
C·06
Recontratação opcional, não automática
No mês 11 conversa-se: vale renovar? Que ajuste no escopo, que mudança no banco de horas? Inércia não é estratégia — pra nenhum dos dois.

Receber proposta de plantão → Ver DevOps & SRE

O que NÃO fazemos em sustentação

Plantão não é catraca de tickets.

NÃO
Help desk de aplicação
Atender usuário final com dúvida funcional não é nosso mandato. Suporte de produto fica com o time do cliente — a gente entra quando o problema é técnico, não comportamental.
NÃO
NOC com upsell de severidade
Aquele modelo onde o nível 1 sempre acha que é Sev2, o nível 2 transforma em Sev1, e a fatura cresce conforme a fila. Aqui severidade é técnica — não comercial.
NÃO
Hot fix sem postmortem
Apagar incêndio sem entender a causa garante que o próximo incêndio vem na mesma esquina. Sev1 sem postmortem em 5 dias é falha de processo nosso — apurada em retrospectiva.
NÃO
Lock-in via conhecimento oculto
Runbook em sistema só nosso, postmortem em formato proprietário, dependência de pessoa específica. Saída em 30 dias é cláusula de contrato — e funciona porque o conhecimento mora no cliente desde o dia 1.

01 · 24×7 on-call

Senior engineer from the first alert. Awake, with context, with autonomy.

Traditional on-call uses a turnstile: tier-1 opens the ticket, tier-2 reads the ticket, tier-3 — where expertise lives — only bridges in when the client is already angry. It doesn't buy time, it buys friction. Here the senior is the first response, runbook in hand, with autonomy to fix — not only escalate.

Model · Redgator on-call

sev1 · sev2 · sev3 · sev4

SEV · 01 · CRITICAL

Production down

Response: 4 min median (P95 7 min)

Mitigation: 30 min

War room: immediate

Escalation: CTO in 15 min if needed

SEV · 02 · HIGH

Degraded function

Response: 15 min

Mitigation: 2 h

Resolution: 8 h

Postmortem: mandatory

SEV · 03 · MEDIUM

Defect without block

Response: next business day

Resolution: current sprint

Workaround: documented

SEV · 04 · LOW

Cosmetic / question

Response: 2 business days

Resolution: prioritized backlog

No contractual penalty

A missed Sev1/Sev2 SLA becomes a contractual penalty, not an excuse in the status report. Calculated monthly, with a public ledger shared with the client.

P·01
On-call rotation with a name
Shared calendar: you know who is on-call this week, with which experience, in which timezone. If you prefer someone specific for a critical window (Black Friday, fiscal closing), we arrange it and fix it in the contract.
P·02
Real follow-the-sun
São Paulo, Lisbon, Mexico. Night on-call isn't someone awakened at 3 a.m. making sleepy decisions — it is the engineer in the other timezone, with coffee, reading the same runbook.
P·03
Monitored pager fatigue
An engineer woken up 3 times in the same week is a symptom of a bad alert — and immediate backlog. Goal: <1.5 pages per shift. Burnout doesn't scale.
P·04
Authority to fix
The on-call engineer has credentials, runbook, and a pre-approved emergency change window. They don't escalate to wake up an architect to apply a 4-line fix. Action recorded in the audit trail, reviewed the next morning.
P·05
Communication during the incident
Internal status page updated every 15 min, no jargon. Client receives executive version (1 paragraph, next update in X) on the agreed channel — they're not hostage to asking.

02 · Incident management

Incident timeline. Sev1 to postmortem.

A well-managed incident has a rite. Who commands (incident commander), who communicates (comms lead), who investigates (subject matter expert) — roles agreed before the alert fires. The flow below is the Redgator default, adjusted per client when tooling or regulation demands.

T+0

Detection

Alert fires against SLO. Pager rings. Incident channel created automatically (Slack/Teams).

T+≤7min

Triage

On-call takes IC, classifies severity, opens war room. Bot pulls dashboard, last deploy, correlated alerts.

T+15min

Mitigation

Rollback, feature flag, scale up, circuit breaker. Cause can wait — users cannot. Comms lead notifies the client.

T+30min

Stabilization

Symptom resolved. SME investigates root cause calmly. Status page updated every 15 min until resolution.

T+24h

Resolution

Root cause mitigated (not just symptom). Permanent fix in PR or prioritized backlog. War room closed.

T+5d

Postmortem

Document published. Engineering actions with owner and deadline. Presented in biweekly meeting with the client.

G·01
Designated incident commander
The on-call engineer takes IC. Not the most senior of the seniors — the one on the rotation. Seniors come in as SMEs when called, without disputing command. Incident hierarchy isn't job hierarchy.
G·02
Comms lead separate from IC
IC investigates and decides. Comms writes updates and talks to stakeholders. In a Sev1 with an affected client, that decoupling is the difference between focused IC and overloaded IC.
G·03
Integrated tooling
PagerDuty, Opsgenie, FireHydrant, Rootly, incident.io. Bot opens the channel, pulls the dashboard, records the timeline, generates the postmortem draft. We operate without fanaticism — whatever the client already has.
G·04
Mitigation first, cause later
Restoring service is absolute priority. Root cause becomes investigation after stabilization. Rollback is honest — not failure. Rolled back? Document and move on.
G·05
Severity reassessed in real time
A Sev2 that turns into a Sev1 goes up on the spot. A Sev1 that proves to be Sev2 goes down, with criteria. Severity isn't pride — it is an operational decision.
G·06
External status page when it makes sense
B2B client: direct communication is enough. B2C client with a broad base: public status page (Statuspage, Instatus, Better Stack) updated at the same cadence as the internal one.

03 · Postmortem & learning

Blameless. In 5 business days. Becomes backlog, not PDF.

A postmortem that blames a person doesn't prevent incidents — it produces concealment. A PDF postmortem nobody reads is ceremony. A useful postmortem has three parts: factual timeline, cause analysis (system, not human), actions with owner and deadline. And it is read — because it becomes prioritized backlog in the next sprint.

STRUCTURE · POSTMORTEM REDGATOR

01
Executive summary
One page. Who was affected, for how long, the cause in one sentence.
02
Factual timeline
UTC and local time. No value judgement. Only what happened, when, by whom.
03
Cause analysis
5 Whys or causal tree. Focus on the system, not on "operator erred". Every human cause hides a process failure.
04
What worked well
Fast detection, useful runbook, clear communication. Reinforces what is worth repeating.
05
What was missing
Honest. No cover. Identifies gaps in telemetry, runbook, alert, process.
06
Actions with owner and deadline
3 to 8 concrete actions. Each with responsible and date. Becomes a sprint issue, not a meeting note.

HOUSE RULES

Non-negotiable principles.

▸
Blameless. The person who pressed the wrong button had incomplete context, hostile tooling or fragile process. Attacking the person preserves the broken system.
▸
5 business days. Longer than that, the memory is gone. Sev1 and Sev2 mandatory. Sev3 if a pattern repeats.
▸
Written by who lived it. Not by a manager who heard about it. The on-call who commanded the incident is the author.
▸
Tracked action. Each item becomes an issue with a postmortem label. Quarterly burndown shows what was delivered.
▸
Shared with the client. In a biweekly meeting, with our actions and client actions clearly separated. No obfuscation.
▸
Searchable later. Indexed wiki with tags by system, by cause, by client. The next similar incident starts by reading the postmortem of the previous similar one.

INDICATORS · LAST 12 MONTHS

94%

postmortem in ≤5 days

73%

actions completed in 30d

−41%

recurrence of the same cause

37min

median production MTTR

04 · AMS & technical routine

Maintenance that doesn't become debt. Hardening that doesn't become hype.

Application support isn't just putting out fires — it is preventing the next one. Time bank for evolution, hardening cycles, dependency updates, surgical refactoring of what breaks every month. All recorded, all prioritized with the client — no surprises on the invoice, no technical debt piling up in silence.

A·01
Transparent time bank
Monthly package agreed in the contract (16h, 40h, 80h, 160h). Weekly burndown shared: what was consumed, on which demand, by whom. Positive balance carries over until the end of the quarter. Didn't use? Refunded.
A·02
Technical backlog prioritized together
Biweekly meeting: postmortem actions, critical dependencies expiring, refactoring that reduces MTTR. Client decides order with impact and cost criteria — not in isolation.
A·03
Scheduled patch & upgrade
Monthly window agreed for OS, runtime, critical dependency. High CVE leaves the cycle and becomes a hotfix. Empty window? We return the window.
A·04
Surgical refactoring
The component that shows up in 4 postmortems in 6 months enters a refactor cycle. Not a rewrite — partial rewrite with tests, behind a feature flag, with rehearsed rollback.
A·05
Dependency hygiene
Renovate or Dependabot configured, with policy: patch automatic, minor reviewed, major in a hardening cycle. Version from 4 years ago doesn't age well in production.
A·06
Living documentation
C4 diagram updated on every architectural change. ADR for non-trivial decisions. Onboarding of a new member of our team becomes a real test of the documentation — if reading is not enough to get in, the docs need improvement.
A·07
Capacity & performance review
Quarterly. Where it is tight: CPU, database, queue, external dependency. Technical and cost recommendation. Anticipating saturation costs less than handling it on-call.

05 · Runbook & knowledge

The client's wiki. Not ours.

Knowledge that lives only with the vendor is hostage. Here the runbook is born in Confluence, Notion, GitHub Wiki — whatever the client already uses. Updated after every incident, read by whoever joins on-call for the first time, and — when the contract ends — it stays. No dependency, no disguised hostage-taking.

FORMAT 01

Runbook per alert

Every alert in Grafana/Datadog/PagerDuty links to its corresponding runbook page. Tree diagnosis, copyable commands, criteria for when to escalate.

FORMAT 02

Runbook per system

Overview, dependencies, deploy flow, contacts. Onboarding-ready: a new on-call engineer reads it and can take over.

FORMAT 03

Incident playbook

Known scenarios: RDS down, queue overflow, inter-AZ latency. Each one with symptom, hypothesis, immediate action, verification.

FORMAT 04 · FEATURED

ADR & architectural decision

Architecture Decision Record for every non-trivial choice. Context, options considered, decision, consequences. In 3 years' time, someone needs to understand why it is like this.

FORMAT 05

C4 diagram & dependency map

Context, container, component. Updated when the system changes — not at the end of the project. Mermaid in markdown, versioned in Git.

FORMAT 06

Indexed postmortem

Tagged by system, by cause, by client. The next similar incident starts by reading the previous one. Compound knowledge.

R·01
Lives where the client already is
Confluence, Notion, GitHub Wiki, GitBook, Outline. We don't create a parallel Redgator portal — we use what the client team already opens.
R·02
Updated after every incident
Postmortem with an action "update runbook X" is standard practice. Pages show last-review date — if it crosses 90 days without an update, it shows up in the hygiene alert.
R·03
Read in routine, not only in crisis
New on-call onboarding is a guided session reading the runbook and simulating a past incident. If the docs don't work in a calm room, they won't work at 3 a.m.
R·04
Exit without dependency
End of contract: formal handover, runbook reviewed, Q&A session with the client team. 30 days foreseen in the contract. Renew if it's better than the alternative.

06 · Continuous hardening

The next incident is cheaper than the last.

Hardening isn't a separate project — it is a habit. From every postmortem, every game day, every quarterly review, a list of gaps comes out. Those gaps enter the time bank with criteria: what reduces MTTR first, what increases SLO second, what saves FinOps in parallel. In 6 months the system is more resilient without anyone having done a "hardening project".

H·01
Quarterly game day
Failover rehearsed in staging. AZ taken down, RDS promoted, external dependency simulated as unavailable. Hypothesis before the experiment, result becomes backlog. DR that nobody tested doesn't exist.
H·02
Chaos engineering when it makes sense
LitmusChaos, AWS FIS, Gremlin. Random pod kill during business hours only when the system already tolerates it. Not vandalism — a controlled experiment with a hypothesis and metric.
H·03
Load test against SLO
k6, Gatling, Locust. Simulated load before Black Friday, fiscal closing, launch. Knowing where it breaks beforehand is cheaper than discovering it on deadline.
H·04
Alert review
Quarterly: an alert that fires often without action is noise. An alert that never fires is dead letter. Continuous refinement — the pager rings less, and when it rings, it matters.
H·05
Tabletop exercise
Scenario in a room: "Primary RDS goes down at 2 p.m. on a Tuesday". The client team + Redgator walk through the decision tree. Identifies gaps before the real incident.
H·06
Iterative threat modeling
For systems with sensitive data or regulation. STRIDE light, with semi-annual updates. Output: a security hardening backlog that doesn't compete with features.

07 · Contract & SLA

Contractual penalty when we miss. 30-day exit.

A traditional support contract protects the vendor: loose SLA, symbolic penalty, 36-month lock-in. Here it is different. SLA has teeth, the penalty is calculated monthly, the exit has a 30-day rite. If we are here, it is because we are being better than the alternative — not because the contract holds you.

C·01
SLA per severity, with penalty
Sev1 and Sev2 with a financial penalty for non-compliance. Calculated monthly, deducted from the next invoice. Not decoration — enters the contract with a clear clause.
C·02
30-day exit
30-day notice, formal handover, runbook reviewed, Q&A session. No termination penalty. Whoever stays, stays because it is worth it.
C·03
Time bank with public burndown
What was consumed, on which demand, by whom. Balance visible weekly. Didn't use? Refunded. Accumulates per quarter, doesn't become undelivered revenue.
C·04
Predictable adjustment
IPCA + 3% on the anniversary. No adjustment off-season. No disguised "scope adjustment". You know the on-call budget for the next 12 months, to the letter.
C·05
Quarterly internal NPS
The client team evaluates the Redgator team. Low scores become honest conversation, no retaliation. If we are being annoying, we want to know before renewal, not after.
C·06
Renewal optional, not automatic
In month 11 we talk: worth renewing? What scope adjustment, what change in the time bank? Inertia is not strategy — for either of us.

Receive an on-call proposal → See DevOps & SRE

What we DON'T do in support

On-call isn't a ticket turnstile.

NO
Application help desk
Answering end users with functional questions isn't our mandate. Product support stays with the client team — we enter when the problem is technical, not behavioral.
NO
NOC with severity upsell
That model where tier-1 always thinks it's a Sev2, tier-2 turns it into a Sev1, and the invoice grows with the queue. Here severity is technical — not commercial.
NO
Hot fix without postmortem
Putting out a fire without understanding the cause guarantees the next fire comes from the same corner. Sev1 without a postmortem in 5 days is our process failure — accounted for in retrospective.
NO
Lock-in via hidden knowledge
Runbook in a system only we use, postmortem in a proprietary format, dependency on a specific person. 30-day exit is a contract clause — and it works because knowledge lives in the client from day 1.

Cases recentes em SustentaçãoRecent cases in Support

2025·041 Empresa brasileira de telecomunicacoes (sob NDA) · indústria Redução de ~40% no custo Oracle RDS sem comprometer performance. Assessment de custo Oracle RDS combinando métricas AWS, comportamento de workload Oracle e classificação de criticidade. Identificou ~40% de redução de custo sem degradar produção. -40% custo Oracle RDS 2026 2025·099 Cliente Oracle (sob NDA) · indústria Oracle customer (under NDA) · industry Migration Readiness Assessment: saber antes de comprometer. Migration Readiness Assessment: know before committing. Assessment estruturado antes de comprometer migração. Avalia arquitetura, volume, downtime, dependências, performance, segurança, licenciamento, prontidão operacional. Structured assessment before committing to migration. Evaluates architecture, volume, downtime, dependencies, performance, security, licensing, and operational readiness. Go/No-Go critérios objetivos Go/No-Go objective criteria 2025 2025·091 Empresa líder em saúde e seguros nos EUA (sob NDA) · saúde Leading US healthcare and insurance enterprise (under NDA) · healthcare Migração Oracle ERP de 50 TB, com rollback < 1h. 50 TB Oracle ERP migration with sub-1-hour rollback. Migração mission-critical de um ERP Oracle de 50 TB, após uma tentativa anterior ter falhado. Janela compressa de 3 meses, rollback projetado para menos de 1h. Mission-critical 50 TB Oracle ERP migration after a previous vendor attempt failed. Compressed 3-month window, rollback designed for under 1 hour. 50 TB ERP migrado 50 TB ERP migrated 2025 2025·072 Grupo financeiro (sob NDA) · financeiro Financial group (under NDA) · finance 22 bancos Oracle, 100 TB+: modernização faseada. 22 Oracle databases, 100 TB+: phased modernization. 22 bancos Oracle, 100 TB+, planejamento de migração faseado entre Exadata e AIX. Sequência de ondas, mapa de risco, validação por workload. 22 Oracle databases, 100 TB+, phased migration planning between Exadata and AIX. Wave sequencing, risk mapping, per-workload validation. 22 bancos · 100 TB+ 22 databases · 100 TB+ 2025

Sustentação & Plantão

Plantão sem nome, postmortem sem dono, dívida acumulando? Conta o problema.

45 min com um líder de SRE. Sem pitch.

Support & On-call

Nameless on-call, postmortems without an owner, debt piling up? Tell us the problem.

45 min with an SRE lead. No pitch.

Agendar conversa → Schedule a chat →

Sustentação& Plantão.

Engenheiro sênior desde o primeiro alerta. Acordado, com contexto, com autonomia.

On-call rotation com nome

Follow-the-sun real

Pager fadiga monitorada

Autoridade para corrigir

Comunicação durante incidente

Linha do tempo do incidente. Sev1 ao postmortem.

Incident commander designado

Comms lead separado do IC

Tooling integrado

Mitigação primeiro, causa depois

Severidade reavaliada em tempo real

Status page externa quando faz sentido

Blameless. Em 5 dias úteis. Vira backlog, não PDF.

Princípios não negociáveis.

Manutenção que não vira dívida. Hardening que não vira hype.

Banco de horas com transparência

Backlog técnico priorizado junto

Patch & upgrade programado

Refatoração cirúrgica

Dependency hygiene

Documentação viva

Capacity & performance review

Wiki do cliente. Não a nossa.

Runbook por alerta

Runbook por sistema

Playbook de incidente

ADR & decisão arquitetural

Diagrama C4 & mapa de dependências

Postmortem indexado

Mora onde o cliente já está

Atualiza a cada incidente

Lido na rotina, não só na crise

Saída sem dependência

Próximo incidente é mais barato que o último.

Game day trimestral

Chaos engineering quando faz sentido

Load test contra SLO

Revisão de alerta

Tabletop exercise

Threat modeling iterativo

Multa contratual quando a gente erra. Saída em 30 dias.

SLA por severidade, com multa

Saída em 30 dias

Banco de horas com burndown público

Reajuste previsível

NPS interno trimestral

Recontratação opcional, não automática

Plantão não é catraca de tickets.

Help desk de aplicação

NOC com upsell de severidade

Hot fix sem postmortem

Lock-in via conhecimento oculto

Senior engineer from the first alert. Awake, with context, with autonomy.

On-call rotation with a name

Real follow-the-sun

Monitored pager fatigue

Authority to fix

Communication during the incident

Incident timeline. Sev1 to postmortem.

Designated incident commander

Comms lead separate from IC

Integrated tooling

Mitigation first, cause later

Severity reassessed in real time

External status page when it makes sense

Blameless. In 5 business days. Becomes backlog, not PDF.

Non-negotiable principles.

Maintenance that doesn't become debt. Hardening that doesn't become hype.

Transparent time bank

Technical backlog prioritized together

Scheduled patch & upgrade