Docker in Production: What We Learned Running Containers at …

Quick take

Docker won’t fix your ops. It will make bad ops louder and faster. But if you treat images like release artifacts, lock down your Dockerfiles, and build real observability around your containers, it’s the best deployment tool available right now.

We Broke Production Three Times in the First Month

At Dropbyke, we moved our backend services into Docker containers because we were tired of deploy scripts that worked on one machine and failed on another. The promise was simple: build once, run anywhere. That part was true. What nobody told us was how many new ways we would find to break things.

The first week, a developer pushed a container built on their laptop. It had a different glibc than the host. Segfault in production. The second week, we ran out of disk because nobody was cleaning up old images. The third week, a container ran as root and wrote a config file to a bind mount that the host process couldn’t read back. Three incidents, three categories of failure, all in 30 days.

Every one of those was our fault. Docker did exactly what we told it to do. We just hadn’t learned to tell it the right things yet.

Dockerfiles That Actually Work

Most Dockerfile tutorials produce images that are 800MB and run as root. That’s fine for a demo. It isn’t fine when you’re pushing 15 deploys a day across 4 services.

Here is the pattern we settled on for our Go services at Dropbyke:

FROM golang:1.6-alpine AS builder

RUN apk add --no-cache git ca-certificates

WORKDIR /src
COPY go.sum go.src ./
RUN go get -d ./...

COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app ./cmd/server

FROM alpine:3.3

RUN apk add --no-cache ca-certificates tzdata \
    && adduser -D -u 1000 appuser

COPY --from=builder /app /usr/local/bin/app

USER appuser
EXPOSE 8080
ENTRYPOINT ["app"]

A few things to notice. The multi-stage build keeps the final image small – under 20MB for most of our services. We copy dependency files first and run go get before copying the rest of the source. That means Docker caches the dependency layer, and rebuilds only take seconds when application code changes. We create a non-root user. We strip debug symbols with -ldflags="-s -w". We use ENTRYPOINT instead of CMD so the container process is PID 1 and receives signals properly.

For our Python services (we had one analytics pipeline that wasn’t worth rewriting), the pattern looked different:

FROM python:3.5-slim

RUN groupadd -r app && useradd -r -g app -d /app -s /sbin/nologin app

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chown -R app:app /app

USER app
EXPOSE 5000
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "wsgi:app"]

Same principles. Dependencies first. Non-root user. Minimal base image. Explicit port. No latest tag anywhere.

Image Tagging: latest Isn’t a Version

We tag every image with the git commit SHA. Not the branch name, not latest, not a date string. The commit SHA.

docker build -t dropbyke/location-svc:$(git rev-parse --short HEAD) .
docker push dropbyke/location-svc:$(git rev-parse --short HEAD)

When something breaks at 2 AM, the first question is: what is running right now? With SHA tags, you can answer that in seconds. You can diff the exact code. You can roll back to the previous SHA. You can see who merged what. latest tells you nothing.

We store image metadata as labels too:

LABEL build.sha="${GIT_SHA}" \
      build.date="${BUILD_DATE}" \
      build.ci="jenkins"

It sounds like overkill. It isn’t. During an incident last month, we traced a memory leak to a specific commit because the container label told us exactly what was deployed.

docker-compose for Local Development

Our services talk to PostgreSQL, Redis, and each other. Getting all of that running locally used to mean a two-page setup guide that was always out of date. Now it’s a single file:

version: '2'
services:
  location-svc:
    build: ./services/location
    ports:
      - "8080:8080"
    environment:
      - DATABASE_URL=postgres://dropbyke:dev@db:5432/dropbyke?sslmode=disable
      - REDIS_URL=redis://cache:6379
    depends_on:
      - db
      - cache

  trip-svc:
    build: ./services/trip
    ports:
      - "8081:8080"
    environment:
      - DATABASE_URL=postgres://dropbyke:dev@db:5432/dropbyke?sslmode=disable
      - LOCATION_SVC_URL=http://location-svc:8080
    depends_on:
      - db
      - location-svc

  db:
    image: postgres:9.5
    environment:
      - POSTGRES_USER=dropbyke
      - POSTGRES_PASSWORD=dev
      - POSTGRES_DB=dropbyke
    volumes:
      - pgdata:/var/lib/postgresql/data

  cache:
    image: redis:3.0-alpine

volumes:
  pgdata:

New engineers run docker-compose up and have a working environment in under two minutes. No installing Go. No configuring Postgres. No hunting for the right Redis version. The compose file is checked into the repo, so it stays current.

One thing we learned the hard way: depends_on doesn’t mean “wait until the service is ready.” It means “start this container after that one.” Postgres takes a few seconds to accept connections. Your application will crash if it tries to connect immediately. We added retry logic to every service’s database connection code. Not elegant, but reliable.

Networking: Stop Hardcoding IPs

Docker’s default bridge network assigns IPs dynamically. If you hardcode an IP in a config file, it will work until it doesn’t, and then you will spend an hour figuring out why service A can’t find service B.

On a single host, Docker Compose gives you DNS resolution for free. Service names resolve to container IPs. http://location-svc:8080 just works. That’s good enough for development and for small production setups.

For multi-host, the story is more complicated. We use overlay networks, which let containers on different hosts communicate as if they were on the same network. It works, but it adds latency. We measured an extra 1-2ms per hop on our setup. For most services that’s noise. For the real-time bike location tracking, it was noticeable. We kept that service on a single host with host networking.

Logs: stdout or Nothing

Every one of our services logs to stdout. No log files inside the container. No custom logging directories. stdout.

log.SetOutput(os.Stdout)
log.SetFlags(log.Ldate | log.Ltime | log.Lmicroseconds | log.LUTC)

Docker captures stdout and makes it available via docker logs. We run a log forwarder on each host that ships container logs to a central ELK stack. Each log line includes the container ID, image name, and service name as structured fields. When something goes wrong, we can filter by service, time range, and container in Kibana within seconds.

The mistake we made early was letting one service log at DEBUG level in production. It wrote 2GB of logs in an hour, filled up the Docker log driver’s buffer, and slowed down every other container on the host. We now enforce log level via environment variables and default to WARN in production.

Resource Limits Aren’t Optional

A container without resource limits will happily consume all available memory on the host and take everything else down with it. We learned this when our analytics service hit a data spike and OOM-killed the Postgres container running on the same host.

Every container gets explicit limits:

location-svc:
  mem_limit: 256m
  memswap_limit: 256m
  cpu_shares: 512

The memswap_limit matching mem_limit prevents the container from using swap. If it runs out of memory, it dies. That sounds harsh, but a container thrashing on swap is worse than a container that restarts cleanly. Our health checks pick up the restart within seconds.

Security: Run as Non-Root or Don’t Ship

I mentioned this in the Dockerfile section but it bears repeating. Every container runs as a non-root user. No exceptions. If a library or tool requires root, we find a different library or tool.

Beyond the user, we also:

Use --read-only filesystem where possible, mounting only specific writable directories for temp files
Drop all capabilities and add back only what is needed: --cap-drop=ALL --cap-add=NET_BIND_SERVICE
Never mount the Docker socket into a container (I’ve seen this in tutorials and it terrifies me)
Scan images for known vulnerabilities before pushing to our private registry

Our private registry runs behind TLS with basic auth. Every push and pull is authenticated. The registry itself runs in a container, which is a fun bit of recursion, but it works.

Health Checks and Restarts

Every service exposes a /healthz endpoint that returns 200 if the service can reach its dependencies:

func healthHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "db: %v", err)
        return
    }
    w.WriteHeader(http.StatusOK)
    fmt.Fprint(w, "ok")
}

Our load balancer polls this endpoint every 5 seconds. If a container fails three consecutive checks, it gets pulled from the rotation. Docker’s restart policy (--restart=on-failure:5) handles restarting the process. If it fails 5 times in a row, it stays down and pages us. At that point, something is fundamentally wrong and automatic restarts won’t fix it.

What I Would Tell Myself Six Months Ago

Start with one service. Not the most important one and not the least important one. Pick something that deploys frequently and has good test coverage. Containerize it, run it in production for two weeks, and fix every sharp edge you find. Then do the next one.

Don’t try to containerize your database early. We keep Postgres on a dedicated host with proper backups, WAL archiving, and monitoring. The application containers are stateless and disposable. The database is neither of those things.

Write your Dockerfiles like they will be read by someone debugging a production incident at 3 AM. Because they will be.

Treat image builds as CI artifacts. Build them in CI, tag them with the commit, push them to a registry, deploy from the registry. Never build on a developer laptop and push to production. We burned ourselves on that one.

Docker changed how we ship software at Dropbyke. Deploys went from 20 minutes of SSH-and-pray to 45 seconds of pull-and-start. But the speed only came after we did the unglamorous work of writing good Dockerfiles, setting up log aggregation, adding health checks, and enforcing resource limits. The container is the easy part. The ops around it’s the job.

Docker in Production: What We Learned Running Containers at Dropbyke