Quick take
Your image scan is green. Congratulations. You’ve answered “what’s in this container.” You haven’t answered “what is this container doing right now.” Runtime security is how you answer that second question – through least privilege, kernel-level controls, network policy, and behavioral detection. This post walks through exactly how we do it at Decloud, with real configs.
I’ve had this conversation too many times. Someone shows me a green Trivy scan and says “we’re secure.” No. You know what’s in the image. You have no idea what it’s doing.
At Decloud we run a lot of containers. Hundreds of services. And after spending time in NATO cyber defense, I came away with one conviction that shapes everything: assume the perimeter is already breached. The question isn’t whether something gets in. It’s how much damage it can do once it’s there.
That’s runtime security. Not a product. Not a checkbox. A set of controls that limit what containers can do, watch what they’re doing, and react when behavior goes sideways.
What image scanning doesn’t catch
Static scanning is useful. I’m not dismissing it. But it’s blind to:
- Zero-days. By definition your scanner doesn’t know about them yet.
- Stolen credentials being used inside a running container.
- Lateral movement. A compromised container talking to your database.
- Data exfiltration over DNS or unexpected outbound connections.
- Runtime drift. Someone
kubectl exec’d in and installed curl. Your image is clean. Your container isn’t.
The mental model shift is important. Scanning is about contents. Runtime security is about behavior.
What “suspicious” actually looks like in production
Forget movie-hacker nonsense. In real production clusters, compromise looks boring:
- A shell spawns in a container that should only run a Go binary.
- A container starts making outbound connections to IPs it’s never talked to.
- Something writes to
/etc/passwdor drops a new binary in/tmp. - CPU spikes because someone’s mining Monero in your API pod.
- A container that’s been running for 40 days suddenly starts probing the Kubernetes API.
None of these trip an image scanner. All of them should wake someone up.
Start with the Dockerfile. Seriously.
Before you install Falco or write a single policy, harden the image itself. This is where most teams skip steps and then wonder why their runtime security is noisy.
Here’s what a production Dockerfile should look like:
FROM golang:1.15-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /server .
FROM scratch
COPY --from=builder /server /server
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
USER 65534:65534
ENTRYPOINT ["/server"]
Key things here: multi-stage build so no compiler or tools in the final image. scratch base so there’s literally no shell, no package manager, no curl, no nothing. Non-root user. If an attacker gets code execution in this container, they land in an empty filesystem with no tools and no root. Good luck.
At Decloud we moved almost all Go services to scratch or distroless. It eliminates entire categories of post-exploitation. Can’t spawn a shell if there’s no shell.
Pod security context: the non-negotiable baseline
Every pod in your cluster should have this. No exceptions.
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem is the one people push back on. “But my app writes temp files!” Fine. Mount a single emptyDir for /tmp and nothing else. Don’t give up the whole filesystem because one library wants to write a cache file.
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir:
sizeLimit: 64Mi
At Decloud, we enforce this via OPA Gatekeeper. If your pod spec requests privileged: true or doesn’t drop all capabilities, the admission controller rejects it. Period.
Seccomp: the kernel-level lockdown most people ignore
Here’s the thing about Linux containers – they share the host kernel. Every syscall your container makes goes through the same kernel as every other container on that node. Seccomp lets you whitelist which syscalls a container is allowed to make.
The default Docker seccomp profile blocks about 44 syscalls. That’s a start. But for a well-behaved Go HTTP server, you need maybe 40-50 syscalls total out of the 300+ available. Everything else should be denied.
Here’s a stripped-down seccomp profile we use for stateless API services:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": [
"accept4", "bind", "brk", "clone", "close",
"connect", "epoll_create1", "epoll_ctl", "epoll_pwait",
"exit_group", "fcntl", "fstat", "futex", "getpeername",
"getpid", "getsockname", "getsockopt", "listen",
"madvise", "mmap", "mprotect", "munmap", "nanosleep",
"openat", "read", "recvfrom", "rt_sigaction",
"rt_sigprocmask", "rt_sigreturn", "sched_yield",
"sendto", "setsockopt", "sigaltstack", "socket",
"tgkill", "write"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Apply it in your pod spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/api-server.json
This is defense in depth in the truest sense. Even if an attacker gets arbitrary code execution, they can’t call ptrace, can’t load kernel modules, can’t mount filesystems. The kernel says no. I picked this habit up from NATO work – you don’t rely on one layer. You assume each layer will fail and build the next one.
Building these profiles by hand is tedious. We generate them by running the app under strace in a staging environment, collecting the syscalls it actually uses, and then adding a small buffer. Not glamorous. Works.
Network policy: because default Kubernetes networking is terrifying
Out of the box, every pod in a Kubernetes cluster can talk to every other pod. Let that sink in. Your frontend can talk to your payment service. Your batch job can reach your secrets manager. It’s a flat network with no segmentation.
Network policies fix this. Here’s what we apply to every namespace at Decloud:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes: ["Ingress", "Egress"]
This denies everything by default. Then we whitelist specific flows:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes: ["Ingress", "Egress"]
ingress:
- from:
- podSelector:
matchLabels:
app: gateway
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
That last rule is DNS. People forget DNS and then spend an hour wondering why their app can’t resolve anything. Ask me how I know.
The egress rules are the important ones for runtime security. If a compromised container can’t make outbound connections, the attacker can’t exfiltrate data or download tools. It’s boring. It works.
Falco: watching what your containers actually do
Falco hooks into the kernel via eBPF (or a kernel module) and watches syscalls in real time. It’s the closest thing to “what is this container actually doing right now.”
Here are the rules we run in production at Decloud:
- rule: Shell in non-shell container
desc: Detect shell spawned in a container that shouldn't have one
condition: >
spawned_process and container
and proc.name in (sh, bash, zsh, dash, csh)
and not container.image.repository in (debug-tools, ops-shell)
output: >
Shell started in production container
(user=%user.name container=%container.name
image=%container.image.repository command=%proc.cmdline)
priority: CRITICAL
- rule: Write below /bin or /usr
desc: Detect new binary written to system paths
condition: >
open_write and container
and (fd.name startswith /bin/ or fd.name startswith /usr/bin/
or fd.name startswith /usr/sbin/)
output: >
Binary written in container
(file=%fd.name container=%container.name image=%container.image.repository)
priority: CRITICAL
- rule: Outbound connection to unexpected port
desc: Detect container connecting to non-standard ports
condition: >
outbound and container
and not fd.sport in (80, 443, 5432, 6379, 8080, 8443, 9090)
and not container.image.repository in (monitoring-agent)
output: >
Unexpected outbound connection
(container=%container.name port=%fd.sport dest=%fd.sip)
priority: WARNING
Two things I’ve learned the hard way about Falco rules:
Be specific. Generic rules like “alert on any process spawn” will bury you in noise within hours. You’ll disable the alerts and then you’re back to flying blind.
Tune per workload. Your Nginx sidecar legitimately spawns worker processes. Your Go API server doesn’t. Same rule, different context. We maintain per-service rule overrides and review them quarterly.
Incident response: what happens when something fires
Detection without response is a hobby, not a security program. Here’s our playbook at Decloud:
CRITICAL alert fires. PagerDuty wakes someone up. Shell in a production container or binary write to system paths – someone looks at it within minutes.
Capture state before killing.
kubectl logs,kubectl describe pod, network connection dumps. If you kill the pod first, you’ve destroyed your forensics.Network isolate. Apply a deny-all network policy to the specific pod. The container is still running but can’t talk to anything. This buys time.
Investigate. Was it a legitimate deploy? A developer debugging? Or actual compromise? Most of our critical alerts turn out to be config mistakes. That’s fine. I’d rather investigate 10 false positives than miss one real incident.
Kill and rotate. If it’s real, kill the pod, rotate every credential it had access to, and check the blast radius.
One rule we follow strictly: no automated kills on low-confidence alerts. I’ve seen teams auto-restart pods on anomaly detection. They ended up killing healthy services during a deploy because the new version had slightly different syscall patterns. Humans in the loop for destructive actions.
The compounding effect
None of these controls is a silver bullet on its own. A seccomp profile doesn’t prevent credential theft. Network policy doesn’t stop a shell spawn. Falco doesn’t harden your kernel surface.
But stack them together and the attacker’s job gets exponentially harder:
- Scratch image: no tools to use post-exploitation.
- Non-root, dropped capabilities: can’t escalate privileges.
- Read-only filesystem: can’t persist or install anything.
- Seccomp: can’t make dangerous syscalls.
- Network policy: can’t phone home or move laterally.
- Falco: any anomalous behavior triggers an alert.
Each layer assumes the previous one failed. That’s not paranoia. That’s engineering.
From my NATO days: you don’t build a castle with one wall. You build concentric rings where each ring assumes the outer one is already breached. Container runtime security is the same idea, just with YAML instead of stone.