Outage Procedures
Audience: A&E staff, service owners, and on-call engineers
Purpose: Guide for posting IT Alerts and coordinating outage communications
Quick Reference
| Contact Method | Details |
|---|---|
| outage@tamu.edu | |
| Teams | IT Operations Center team |
| Phone | Contact IOC dispatcher |
Always coordinate with the IT Operations Center (IOC) for service outages. Do not post IT Alerts independently—use the official process below.
IT Alert Process
Overview
IT Alerts notify the campus community about service disruptions, scheduled maintenance, and restored services. The IT Operations Center (IOC) manages IT Alert postings on behalf of service teams.
Submitting an IT Alert Request
Step 1: Gather Required Information
Before contacting the IOC, collect the following details:
| Field | Description | Example |
|---|---|---|
| Service Name | Official service or system name | "AggieCloud", "Exchange Online" |
| Impact Description | What users are experiencing | "Users unable to access email via Outlook" |
| Affected Population | Who is impacted | "All faculty and staff", "Engineering college" |
| Start Time | When the issue began | "2024-01-15 14:30 CST" |
| Estimated Resolution | Expected fix time (if known) | "Within 2 hours" or "Unknown" |
| Workaround | Alternative access method (if available) | "Use Outlook Web Access at outlook.office.com" |
| Contact Person | Engineer handling the issue | "John Smith, x5-1234" |
Step 2: Contact the IOC
Choose the appropriate contact method based on urgency:
Email (Standard Priority)
Send an email to outage@tamu.edu with:
Subject: [IT ALERT REQUEST] Service Name - Brief Description
Body:
Service: [Service Name]
Impact: [Description of user impact]
Affected: [Population affected]
Start Time: [When issue began]
Est. Resolution: [Expected fix time]
Workaround: [Alternative if available]
Contact: [Your name and phone]
Additional Details:
[Any other relevant information]
Microsoft Teams (Urgent)
For faster response on urgent issues:
- Navigate to the IT Operations Center team in Microsoft Teams
- Post in the appropriate channel with:
@mentionthe IOC dispatcher if critical- Include all required information
- Indicate urgency level
Phone (Critical/P1)
For critical outages affecting large populations:
- Call the IOC dispatcher directly
- Provide verbal summary of the outage
- Follow up with email documentation
For after-hours emergencies, use the on-call escalation process through the IOC.
Step 3: Monitor and Update
During the outage:
- Provide Status Updates — Send progress updates to the IOC every 30-60 minutes
- Notify of Changes — Alert IOC if scope, impact, or timeline changes
- Report Resolution — Immediately notify IOC when service is restored
IT Alert Types
| Alert Type | Description | When to Use |
|---|---|---|
| Outage | Unplanned service disruption | Service is down or degraded |
| Maintenance | Scheduled service window | Planned maintenance with expected impact |
| Security | Security-related notification | Phishing, compromise, or security incident |
| Update | Status update on existing alert | Progress report or scope change |
| Resolved | Service restoration notice | Issue has been fixed |
Escalation Procedures
Severity Levels
| Level | Definition | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Major service down, large population | Immediate | Director/CIO notification |
| P2 - High | Significant degradation, medium impact | 15 minutes | Team lead notification |
| P3 - Medium | Partial impact, workaround available | 30 minutes | Standard process |
| P4 - Low | Minor issue, limited impact | 1 hour | Standard process |
Escalation Path
Post-Incident Activities
After resolution:
- Confirm Resolution — Verify service is fully restored
- Request Resolution Alert — Ask IOC to post "Resolved" update
- Document Timeline — Record incident details for post-mortem
- Root Cause Analysis — Complete RCA for P1/P2 incidents
- Lessons Learned — Share findings with team
Post-Mortem Template
For significant outages, complete a post-mortem including:
- Incident timeline
- Root cause analysis
- Impact assessment (duration, users affected)
- What went well
- What could be improved
- Action items with owners and due dates
Related Resources
- Teams Overview — Contact information for A&E teams
- FinOps Program — Cost impact considerations
- ServiceNow Catalog — Incident ticket creation