// SOC Operations

Building a High-Performance SOC: From Alert Triage to Incident Closure

Introduction

A Security Operations Center (SOC) is the nerve center of an organization’s cybersecurity posture. Whether you are building a SOC from scratch inside a fast-growing fintech company, or maturing an existing team at an enterprise, the fundamentals remain the same: the right people, effective processes, and well-tuned technology must work in concert to detect, investigate, and respond to threats before they become breaches.

In this article, I draw on my experience deploying and operating SOC capabilities at PostEx — a fintech and logistics company — to walk through the practical realities of building a high-performance SOC. We will cover SOC tiers and team structure, SIEM deployment and tuning, alert triage workflows, escalation procedures, and the KPIs that actually matter to security managers and CISOs.

The goal of a SOC is not to generate alerts. The goal is to reduce the time between a threat entering your environment and the moment you contain it.

SOC Tier Structure and Responsibilities

Most mature SOCs operate on a tiered model that routes alerts to the appropriate level of analyst based on complexity and required expertise. This prevents your senior engineers from drowning in low-fidelity noise while ensuring high-fidelity alerts receive immediate expert attention.

Tier 1 — Alert Monitoring and Initial Triage

Tier 1 analysts are the first eyes on every alert generated by the SIEM. Their primary responsibilities include:

  • Monitoring SIEM dashboards for active alerts in real time
  • Performing initial alert enrichment — querying threat intel, OSINT, and internal asset databases
  • Determining whether an alert is a true positive, false positive, or requires escalation
  • Documenting initial findings and opening tickets for escalation
  • Executing pre-defined response playbooks for known alert types

At this tier, speed is paramount. Average time to acknowledge should be under 15 minutes for high-severity alerts. The tools Tier 1 analysts use daily include the SIEM (Splunk, ELK, or QRadar), EDR console, threat intelligence platforms, and ticketing systems like JIRA or ServiceNow.

Tier 2 — Incident Investigation and Analysis

Tier 2 analysts handle escalated incidents and perform deeper forensic analysis. They are expected to:

  • Conduct host and network forensics on impacted systems
  • Correlate events across multiple log sources and timeframes
  • Determine attack scope, root cause, and affected assets
  • Recommend or execute containment and remediation actions
  • Produce detailed incident reports for stakeholders

Tier 2 engineers should be proficient in log analysis, memory forensics, network packet analysis (Wireshark, Zeek), and malware behavior analysis. Their work feeds directly into the detection engineering pipeline by identifying detection gaps and false positive root causes.

Tier 3 — Threat Hunting and Detection Engineering

Tier 3 is proactive. Rather than waiting for alerts, Tier 3 analysts actively hunt for threats that have bypassed automated detection. They also own the detection engineering function — writing, testing, and tuning detection rules that feed the SIEM.

The cycle from Tier 3 to Tier 1 is critical: threat hunters discover novel attack techniques, detection engineers encode them as rules, and Tier 1 monitors for those rules at scale. This continuous improvement loop is what separates a mature SOC from a reactive one.

SIEM as the SOC Foundation

The SIEM is the central nervous system of any SOC. At PostEx, we deployed Splunk Enterprise to centralize log aggregation from:

  • Local network infrastructure — switches, access points, internal servers
  • Global network components — WAN links, remote sites, cloud-connected infrastructure
  • Core routers — MikroTik and Cisco devices feeding NetFlow and syslog
  • Infrastructure assets — Windows endpoints via Universal Forwarders, Linux servers via syslog-ng, VMware ESXi hosts

Prior to Splunk, we operated an ELK Stack + Wazuh SIEM that covered 500+ endpoints. The Splunk migration expanded visibility significantly, adding network device telemetry, router logs, and cloud workload data that was previously out of scope.

📈[Screenshot: Splunk Enterprise main dashboard showing event volume by source, alert timeline, and top talkers]

Critical Log Sources

Not all log sources are created equal. In our environment, the highest signal-to-noise ratio came from the following sources:

# High-priority log sources ranked by detection value
1. Windows Security Event Logs (Event IDs 4624, 4625, 4672, 4688, 4698, 4702)
2. DNS Query Logs (C2 detection, DNS tunneling)
3. Firewall/Proxy Logs (lateral movement, exfiltration detection)
4. EDR Telemetry (process execution, file writes, network connections)
5. Active Directory Logs (privilege escalation, kerberoasting)
6. NetFlow/IPFIX (internal network anomalies)
7. VPN Authentication Logs (credential stuffing, impossible travel)
8. Web Application Firewall Logs (SQLi, XSS, brute force)

Alert Triage Workflow

A well-designed alert triage workflow is the difference between an overwhelmed SOC drowning in noise and an efficient team that closes incidents quickly. The following is the workflow we implemented at PostEx:

ALERT TRIAGE WORKFLOW

[SIEM Alert Generated]
        |
        v
[Tier 1 Acknowledges < 15 min for HIGH/CRITICAL]
        |
        v
[Initial Enrichment]
  - Threat Intel lookup (IP, domain, hash)
  - Asset lookup (owner, criticality, exposure)
  - User context (role, recent activity, anomaly score)
        |
        v
[Decision]
  False Positive?  ---> [Document FP reason, tune rule, close]
  True Positive?   ---> [Escalate to Tier 2, open incident ticket]
  Uncertain?       ---> [Escalate with context, set SLA timer]
        |
        v
[Tier 2 Investigates]
  - Forensic deep-dive
  - Scope determination
  - Containment decision
        |
        v
[Containment & Remediation]
  - Isolate host / block IP / revoke credentials
  - Patch, clean, or rebuild affected systems
  - Verify containment effectiveness
        |
        v
[Post-Incident Report]
  - Timeline reconstruction
  - Root cause analysis
  - Detection gap identification
  - Lessons learned -> Detection Engineering

MITRE ATT&CK Integration

Mapping SOC operations to the MITRE ATT&CK framework provides a structured vocabulary for describing adversary behavior and measuring your detection coverage. Every detection rule in our environment is tagged with a MITRE ATT&CK technique ID.

ATT&CK TacticTechniqueDetection SourceCoverage
Initial AccessT1190 Exploit Public-Facing AppWAF, Web Logs🟢 High
ExecutionT1059 Command & Scripting InterpreterEDR, Windows Events🟢 High
PersistenceT1053 Scheduled TasksWindows Events 4698/4702🟢 High
Privilege EscalationT1078 Valid AccountsAD Logs, SIEM🟡 Medium
Defense EvasionT1070 Indicator RemovalEDR, File Monitoring🟡 Medium
Lateral MovementT1021 Remote ServicesFirewall, NetFlow🟢 High
ExfiltrationT1041 Exfil Over C2 ChannelProxy, DNS Logs🟡 Medium
C2T1071 Application Layer ProtocolDNS, HTTP Logs🟡 Medium

SOC KPIs That Matter

Measuring SOC effectiveness is critical for justifying investment and identifying improvement areas. The metrics that carry the most weight with CISOs and security managers are:

  • Mean Time to Detect (MTTD) — Average time from threat entry to alert generation. Target: <60 minutes.
  • Mean Time to Respond (MTTR) — Average time from detection to containment. Target: <4 hours for critical incidents.
  • Mean Time to Notify (MTTN) — Time from alert to analyst notification. Our automation achieved <2 minutes.
  • False Positive Rate — Percentage of alerts that are not genuine threats. We reduced this by 40% through rule tuning.
  • Dwell Time — How long an attacker was inside the environment before detection. Target: <24 hours.
  • Escalation Rate — Percentage of Tier 1 alerts escalated to Tier 2. High rates indicate rule quality issues.
  • Incidents Closed Per Analyst Per Week — Team productivity metric.

SOC Automation Opportunities

Manual triage does not scale. As your log volume grows, the only way to maintain response quality is to automate repetitive tasks. At PostEx, we implemented n8n-based playbook automation for the following use cases:

  • Alert notification — Wazuh alerts automatically routed to Telegram with enrichment data, achieving <2min MTTN
  • IP reputation enrichment — Automatic VirusTotal and AbuseIPDB lookups on every external IP in an alert
  • JIRA ticket creation — High-severity alerts automatically create JIRA tickets with pre-populated fields
  • Account lockout response — Brute force detection triggers automatic AD account review workflow
  • Hash detonation — Suspicious file hashes automatically submitted to MalwareBazaar and VirusTotal
# Example: Wazuh alert webhook to Telegram via n8n
{
  "webhook": "https://your-n8n-instance/webhook/wazuh-alerts",
  "payload": {
    "alert_level": "rule.level",
    "rule_id": "rule.id",
    "rule_description": "rule.description",
    "agent_name": "agent.name",
    "source_ip": "data.srcip",
    "timestamp": "timestamp"
  }
}

Lessons Learned

After 3+ years running SOC operations in a fintech environment, the most important lessons I have learned are:

  • Tune before you scale. Adding more log sources without tuning existing rules will bury your analysts in noise.
  • Document everything. Incident reports, playbooks, runbooks — institutional knowledge is fragile without documentation.
  • Measure your blind spots. Use ATT&CK Navigator to visualize detection coverage gaps and prioritize detection engineering work.
  • Automate the boring parts. Alert enrichment, ticketing, and notification should all be automated so analysts can focus on analysis.
  • Threat hunt regularly. Automated detection will never catch everything. Schedule structured threat hunting exercises monthly.

Conclusion

Building a high-performance SOC is a continuous journey, not a one-time project. The organizations that succeed are those that treat their SOC as a living system — constantly measuring, tuning, and improving. Whether you are standing up your first SIEM or scaling a mature operation, the principles remain constant: centralize visibility, reduce noise, automate response, and hunt for what your rules cannot see.

References: NIST SP 800-61r2 — Computer Security Incident Handling Guide | MITRE ATT&CK Framework (attack.mitre.org) | Splunk Security Essentials Documentation | Wazuh Documentation (documentation.wazuh.com)