Quick take
Stop treating gRPC like REST with a different wire format. Design protos for evolution from day one, always set deadlines, use interceptors for everything cross-cutting, and test with real servers. I share the exact patterns we use at Decloud with Go code you can steal.
We moved Decloud’s internal APIs to gRPC about eight months ago. Before that, everything was REST with hand-rolled JSON serialization, version mismatches everywhere, and a weekly ritual of “why is the client sending a string where we expect an int.” Moving to gRPC fixed the type safety problem. It also introduced a whole new category of problems I didn’t anticipate.
This is what I wish someone had written down before we started. Real patterns, real Go code, real mistakes.
Why we picked gRPC (and where we didn’t)
Short version: internal service-to-service calls where latency matters and both sides are services we control. That’s it. That’s the use case.
We kept REST for:
- Anything browser-facing. gRPC-Web exists but it’s a subset and adds a proxy layer. Not worth it for us.
- Third-party integrations. Nobody wants to learn your proto schema.
- Simple CRUD admin tools.
curlis a better debugger thangrpcurland I don’t care what anyone says.
The real win was code generation. We’ve Go, Python, and a small Rust service. Generating typed clients from one .proto file eliminated an entire class of integration bugs. Before gRPC, every language had its own hand-written client that drifted independently. Now drift is a compilation error.
Proto design: get this wrong and you’ll pay for years
Field numbers are forever. I mean that literally. Once you ship a proto, those field numbers are carved into every binary that’s ever been compiled against it. Get the schema design wrong early and you’re living with it or doing a painful migration.
Here’s what our service definitions actually look like at Decloud:
syntax = "proto3";
package decloud.nodes.v1;
import "google/protobuf/timestamp.proto";
service NodeService {
rpc GetNode(GetNodeRequest) returns (Node);
rpc ListNodes(ListNodesRequest) returns (ListNodesResponse);
rpc WatchNodeStatus(WatchNodeStatusRequest) returns (stream NodeStatusEvent);
}
message GetNodeRequest {
string node_id = 1;
}
message ListNodesRequest {
int32 page_size = 1;
string page_token = 2;
NodeFilter filter = 3;
}
message ListNodesResponse {
repeated Node nodes = 1;
string next_page_token = 2;
}
message NodeFilter {
repeated string regions = 1;
NodeStatus status = 2;
}
message Node {
string id = 1;
string hostname = 2;
NodeStatus status = 3;
google.protobuf.Timestamp created_at = 4;
google.protobuf.Timestamp last_heartbeat = 5;
NodeResources resources = 6;
reserved 7, 8;
reserved "legacy_provider";
}
message NodeResources {
int64 cpu_millicores = 1;
int64 memory_bytes = 2;
int64 disk_bytes = 3;
}
enum NodeStatus {
NODE_STATUS_UNSPECIFIED = 0;
NODE_STATUS_PROVISIONING = 1;
NODE_STATUS_READY = 2;
NODE_STATUS_DRAINING = 3;
NODE_STATUS_OFFLINE = 4;
}
message WatchNodeStatusRequest {
string node_id = 1;
}
message NodeStatusEvent {
string node_id = 1;
NodeStatus previous = 2;
NodeStatus current = 3;
google.protobuf.Timestamp occurred_at = 4;
}
A few things worth noting:
Every RPC gets its own request and response messages. Even if GetNodeRequest only has one field today. You will add fields later. If you use a raw string as your request type, you’ve locked yourself out of adding filters, field masks, or anything else without a breaking change.
Enums start with _UNSPECIFIED = 0. Proto3 defaults to zero. If you put a meaningful value at zero, you can’t distinguish “the client explicitly set this” from “the client didn’t set it.” We learned this the hard way with a status enum that defaulted to ACTIVE at zero. Debugging phantom active nodes wasn’t fun.
Reserve removed fields. See that reserved 7, 8 and reserved "legacy_provider"? Those are fields we removed during a redesign. Without the reservation, someone could reuse field number 7 for a completely different type and corrupt data in old clients that still have the old schema cached.
Use Timestamp not int64. I’ve seen people use epoch millis as int64 fields. Looks fine until you’re debugging across time zones and nobody remembers whether it’s seconds or milliseconds or which epoch.
Go server implementation
Here’s a stripped-down but realistic server. This is close to what we actually run:
package main
import (
"context"
"log"
"net"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/health"
healthpb "google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/status"
pb "github.com/decloud/api/nodes/v1"
)
type nodeServer struct {
pb.UnimplementedNodeServiceServer
store NodeStore
}
func (s *nodeServer) GetNode(ctx context.Context, req *pb.GetNodeRequest) (*pb.Node, error) {
if req.GetNodeId() == "" {
return nil, status.Error(codes.InvalidArgument, "node_id is required")
}
node, err := s.store.Get(ctx, req.GetNodeId())
if err != nil {
return nil, status.Errorf(codes.Internal, "store lookup failed: %v", err)
}
if node == nil {
return nil, status.Errorf(codes.NotFound, "node %q not found", req.GetNodeId())
}
return node, nil
}
func (s *nodeServer) WatchNodeStatus(req *pb.WatchNodeStatusRequest, stream pb.NodeService_WatchNodeStatusServer) error {
if req.GetNodeId() == "" {
return status.Error(codes.InvalidArgument, "node_id is required")
}
events := s.store.Subscribe(req.GetNodeId())
defer s.store.Unsubscribe(req.GetNodeId(), events)
for {
select {
case <-stream.Context().Done():
return nil
case event, ok := <-events:
if !ok {
return nil
}
if err := stream.Send(event); err != nil {
return err
}
}
}
}
func main() {
lis, err := net.Listen("tcp", ":9090")
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
srv := grpc.NewServer(
grpc.ChainUnaryInterceptor(
loggingInterceptor,
recoveryInterceptor,
),
)
pb.RegisterNodeServiceServer(srv, &nodeServer{store: NewNodeStore()})
hsrv := health.NewServer()
healthpb.RegisterHealthServer(srv, hsrv)
hsrv.SetServingStatus("decloud.nodes.v1.NodeService", healthpb.HealthCheckResponse_SERVING)
log.Printf("listening on :9090")
if err := srv.Serve(lis); err != nil {
log.Fatalf("failed to serve: %v", err)
}
}
A few patterns here that took us a while to settle on:
Embed UnimplementedNodeServiceServer. This is the forward-compatible pattern. When you add a new RPC to the proto, the server still compiles. It returns Unimplemented for the new method until you add the handler. Without this, adding a method to the proto breaks every server binary.
Register the health service. This isn’t optional. Kubernetes liveness and readiness probes need it. Our deploy pipeline rejects services that don’t register the gRPC health check. No exceptions.
Chain interceptors. Logging, recovery, metrics – all cross-cutting concerns go in interceptors. Not in every handler. The ChainUnaryInterceptor API landed relatively recently and it’s much cleaner than the old single-interceptor pattern where you’d nest them manually.
Client patterns that don’t break at 3am
The client side is where most people get sloppy. Here’s what we enforce:
func newNodeClient(addr string) (pb.NodeServiceClient, func(), error) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
conn, err := grpc.DialContext(ctx, addr,
grpc.WithTransportCredentials(loadTLSCredentials()),
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(4*1024*1024),
),
grpc.WithChainUnaryInterceptor(
retryInterceptor(3, 100*time.Millisecond),
),
)
if err != nil {
return nil, nil, fmt.Errorf("dial %s: %w", addr, err)
}
cleanup := func() { conn.Close() }
return pb.NewNodeServiceClient(conn), cleanup, nil
}
func getNodeWithDeadline(client pb.NodeServiceClient, nodeID string) (*pb.Node, error) {
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
node, err := client.GetNode(ctx, &pb.GetNodeRequest{NodeId: nodeID})
if err != nil {
st := status.Convert(err)
switch st.Code() {
case codes.NotFound:
return nil, nil // not an error, just missing
case codes.DeadlineExceeded:
return nil, fmt.Errorf("node lookup timed out after 3s")
default:
return nil, fmt.Errorf("node lookup failed: %s", st.Message())
}
}
return node, nil
}
Always set deadlines. Every single call. No exceptions. A call without a deadline is a goroutine leak waiting to happen. We had an incident where a downstream service hung and our caller accumulated ~40,000 goroutines because nobody set a timeout. The fix was one line. The outage was three hours.
Handle status codes explicitly. Don’t just check err != nil. A NotFound is fundamentally different from Internal. Your retry policy, your alerting, your user-facing message – all different depending on the code. status.Convert(err) gives you the code. Use it.
Retry with backoff, but only on the right codes. We retry on Unavailable and DeadlineExceeded. We do not retry on InvalidArgument or NotFound. Retrying a bad request is just a faster way to burn your quota.
Interceptors: the gRPC middleware pattern
This is where gRPC really shines versus REST. Interceptors are typed, composable, and they work identically for unary and streaming RPCs. Here’s our logging interceptor:
func loggingInterceptor(
ctx context.Context,
req interface{},
info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (interface{}, error) {
start := time.Now()
resp, err := handler(ctx, req)
duration := time.Since(start)
code := codes.OK
if err != nil {
code = status.Code(err)
}
log.Printf("method=%s code=%s duration=%s",
info.FullMethod, code, duration)
return resp, err
}
func recoveryInterceptor(
ctx context.Context,
req interface{},
info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (resp interface{}, err error) {
defer func() {
if r := recover(); r != nil {
log.Printf("panic in %s: %v", info.FullMethod, r)
err = status.Errorf(codes.Internal, "internal error")
}
}()
return handler(ctx, req)
}
The recovery interceptor has saved us more than once. A nil pointer dereference in a handler used to crash the entire server. Now it returns Internal and logs the panic. The server stays up. Other RPCs keep working. We still fix the bug, but we don’t page the on-call team at 3am for a panic in a non-critical endpoint.
Error handling: the part everyone gets wrong
I’ve seen codebases where every gRPC error is codes.Internal. That’s like every HTTP response being a 500. Useless for clients. Useless for monitoring.
Our rule is simple: pick the status code that tells the client what to do.
| Situation | Code | Client action |
|---|---|---|
| Bad input | InvalidArgument | Fix the request, don’t retry |
| Missing resource | NotFound | Don’t retry, maybe create it |
| Duplicate creation | AlreadyExists | Probably idempotent, check state |
| Auth missing/expired | Unauthenticated | Re-authenticate |
| Auth valid but insufficient | PermissionDenied | Don’t retry, escalate |
| Server overloaded | Unavailable | Retry with backoff |
| Bug or unknown | Internal | Alert, escalate |
The key insight: Internal means “the server has a bug.” If you’re returning Internal for a missing resource or bad input, you’re lying to your monitoring system. Your alerts will fire for things that aren’t server bugs. Alert fatigue follows.
Proto evolution: the non-obvious rules
Adding fields is safe. Removing them isn’t. That much is obvious. The non-obvious part:
Changing a field from string to bytes is wire-compatible but semantically different. We did this once with a field that held a UUID. Wire format was identical. But the generated Go code changed from string to []byte and broke every caller at compile time. “Wire compatible” and “API compatible” are different things.
Removing a field without reserved is a time bomb. Six months later, someone reuses that field number for a different type. Old clients that haven’t recompiled send the old type. The new server interprets it as the new type. Data corruption that only manifests in production with old client versions. Good luck debugging that.
Version at the package level. When you need a breaking change:
// Old clients still work
package decloud.nodes.v1;
// New clients use v2
package decloud.nodes.v2;
Run both versions side by side. Migrate clients one at a time. Kill v1 when the last client is gone. We track this with a Grafana dashboard that shows request counts per proto package version.
Testing: skip the mocks
We don’t mock gRPC clients. We spin up a real grpc.Server in tests with bufconn:
func setupTest(t *testing.T) pb.NodeServiceClient {
t.Helper()
lis := bufconn.Listen(1024 * 1024)
srv := grpc.NewServer()
pb.RegisterNodeServiceServer(srv, &nodeServer{store: NewMemoryStore()})
go func() { srv.Serve(lis) }()
t.Cleanup(func() { srv.GracefulStop() })
conn, err := grpc.DialContext(
context.Background(), "",
grpc.WithContextDialer(func(ctx context.Context, _ string) (net.Conn, error) {
return lis.DialContext(ctx)
}),
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
if err != nil {
t.Fatalf("dial bufconn: %v", err)
}
t.Cleanup(func() { conn.Close() })
return pb.NewNodeServiceClient(conn)
}
func TestGetNode(t *testing.T) {
client := setupTest(t)
// Happy path
node, err := client.GetNode(context.Background(), &pb.GetNodeRequest{NodeId: "node-1"})
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if node.GetHostname() != "worker-01.decloud.dev" {
t.Errorf("hostname = %q, want %q", node.GetHostname(), "worker-01.decloud.dev")
}
// Not found
_, err = client.GetNode(context.Background(), &pb.GetNodeRequest{NodeId: "nonexistent"})
if status.Code(err) != codes.NotFound {
t.Errorf("code = %v, want NotFound", status.Code(err))
}
// Validation
_, err = client.GetNode(context.Background(), &pb.GetNodeRequest{})
if status.Code(err) != codes.InvalidArgument {
t.Errorf("code = %v, want InvalidArgument", status.Code(err))
}
}
bufconn gives you an in-memory listener. No ports, no flaky tests from port conflicts, no network overhead. But you still test the full gRPC stack: serialization, interceptors, status codes, deadlines. Mocking the client interface skips all of that. We found real bugs with bufconn that mocks would have hidden – serialization of oneof fields, deadline propagation through interceptors, and metadata handling.
Load balancing: the HTTP/2 gotcha
gRPC runs on HTTP/2, which multiplexes requests over a single TCP connection. This means traditional L4 load balancers don’t work. All requests go to whichever backend received the connection. One hot server, N-1 idle servers.
We use client-side balancing with a service registry. The gRPC resolver API lets you plug in your own discovery:
conn, err := grpc.Dial(
"dns:///nodes.decloud.internal:9090",
grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
The dns:/// scheme tells the resolver to use DNS. round_robin distributes across all A records. For us this works because we run headless Kubernetes services that return pod IPs.
If you’re behind Envoy or Istio, make sure they’re configured for HTTP/2, not HTTP/1.1 upgrade. I’ve seen service meshes silently downgrade gRPC to HTTP/1.1 and nobody notices until streaming breaks.
What I’d do differently
Eight months in, here’s what I’d change if we started over:
Invest in proto linting from day one. We use buf now but adopted it late. Early protos have inconsistent naming, missing field reservations, and enum values that don’t follow the TYPE_NAME_VALUE pattern. Fixing these in a live system is painful.
Start with buf instead of raw protoc. The protoc plugin ecosystem is a maze. buf handles code generation, linting, and breaking change detection in one tool. Should have started there.
Don’t over-decompose services. Our first instinct was one proto per entity. Node service. Deployment service. Billing service. Network service. That’s fine in theory. In practice, most operations touch three or four services, so every user-facing action became a cascade of RPCs. We’ve since consolidated the ones that always move together.
gRPC is a great tool for internal APIs. But it’s a tool, not a religion. Use it where it helps. Use REST where that’s simpler. The goal is shipping working software, not architectural purity.