First presented June 13th, 2025
Avid reader, writer and technologist. 10+ years of development, Cloud and consulting experience taught me how to get customers from A to Z.
Cloud Arch., SRE & DevOps Lead
π βοΈ π» πΈ π§ββοΈ π€ πΆ
Kubernetes is a platform for containerized apps
Kubernetes is highly customizable
AWS
Amazon EKS
Observability, Auto-Deploy, Security
<Your app goes here β€οΈ>
AWS
Amazon EKS
Obs, Auto-Deploy, Security
<Your app goes here β€οΈ>
Compared with ECS, it can be superior
Amazon ECS
<Your app goes here β€οΈ>
Amazon EKS
Compared with ECS, it can be superior
Amazon ECS
Stewarded by the community
Kinda portable?
Scalingggg
One-size-fits-all platform
Can be hosted in remote oil rigs
Stewarded by AWS
Bound to AWS
Kinda simple to configure
Works while within limitations
True story!
But I already use ECS!!
Amazon EKS
Compared with ECS, it can be inferior
Amazon ECS
Difficult to operate
Dev - Ops collaboration
Container commitment
Forget Windows
Anyway, there is plenty of stuff about it already
β¨ AI Generated
This is going to be a reference
and an insight about how Kubernetes might treat your apps
Design like there's no tomorrow
12-Factor apps
You want this pod to scaleβ¦
12-Factor apps
out
and in
Design like there's no tomorrow
12-Factor apps
12-Factor apps
If this app retains state locally:
Design like there's no tomorrow
12-Factor apps
Instead, store your state here
12-Factor apps
stateful backing service, a.k.a. database
Design like there's no tomorrow
12-Factor apps
Unless:
12-Factor apps
Design like there's no tomorrow
12-Factor apps
Many birds with one stone (be kind to birds):
12-Factor apps
Design like there's no tomorrow
12-Factor apps
VI. Processes
Execute the app as one or more stateless processes.
12-Factor apps
Shutdown, but with grace
12-Factor apps
12-Factor apps
Let's kill it!
SIGTERM
SIGKILL
I mean⦠your app!
π
Shutdown, but with grace
12-Factor apps
12-Factor apps
What just happened?
SIGTERM
SIGKILL
terminationGracePeriodSeconds
default: 30s
The app received no traffic but idled until K8s killed it.
π
Shutdown, but with grace
12-Factor apps
12-Factor apps
How to improve this?
SIGTERM
exit 0
π
Shutdown, but with grace
12-Factor apps
12-Factor apps
Sure, but my app runs long tasks!
Work splitting
Asynchronicity
Parallelization
Configuration
Algorithmic optimizations
More resources
Specialized hardware
π
Shutdown, but with grace
12-Factor apps
12-Factor apps
Does this help against sudden death?
π
No. Well⦠it's debatable.
Shutdown, but with grace
12-Factor apps
IX. Disposability
Maximize robustness with fast startup and graceful shutdown
12-Factor apps
π
Probes won't be alien anymore
12-Factor apps
12-Factor apps
This is your app
Some people know what's inside
DEV
OPS
To others, it's a complete black box
Probes won't be alien anymore
12-Factor apps
12-Factor apps
How do you know the app is alive and well?
Probes won't be alien anymore
12-Factor apps
12-Factor apps
Observe it from the outside!
Requests
Responses
Probes won't be alien anymore
12-Factor apps
12-Factor apps
By observing it from the outside!
Requests
Responses
Liveness
path: /health/live
"Doing well, thanks!"
Readiness
path: /health/ready
"Now open for service!"
Startup
path: /health/initialized
"Come on, there's no rush!"
Probes: What Do They Do? Do They Do Things??
Let's Find Out!
12-Factor apps
There is no situation in which you should not properly configure the probes!
Unless you don't have a choice (remember, blink three times)
Probes: What Do They Do? Do They Do Things??
Let's Find Out!
12-Factor apps
Front API service
init containers:
app startup:
45"
Probes: What Do They Do? Do They Do Things??
Let's Find Out!
12-Factor apps
CrashLoopBackoff
CrashLoopBackoff
Default settings:
/* (liveness|readiness|startup) */Probe:
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
terminationGracePeriodSeconds: 30
0"
10"
20"
30"
app:
init
Probes: What Do They Do? Do They Do Things??
Let's Find Out!
12-Factor apps
svc
β
β
β
β
β
β
β
β
β
β
load
%
ready: false
Be a good citizen
12-Factor apps
12-Factor apps
cpu
memory
others
Be a good citizen
12-Factor apps
12-Factor apps
cpu
memory
others
These metrics define the count and size of the infrastructure running the services.
Setting sensible values allows more services to share the same infrastructure and is generally more efficient.
Which values you ask?
Be a good citizen
12-Factor apps
12-Factor apps
cpu
memory
From left to right:
Some situations:
Be a good citizen
12-Factor apps
12-Factor apps
cpu
memory
spec:
containers:
- name: frontal-api
image: 123456789012.etc.com/xyz-frontal-api@sha256:2590..d8e3
resources:
requests:
cpu: "25m"
memory: "100Mi"
limits:
cpu: "1"
memory: "100Mi"
Be a good citizen
12-Factor apps
12-Factor apps
cpu
memory
What if the memory is maxed?
A game of chicken:
Be a good citizen
12-Factor apps
12-Factor apps
What about the other bars?
pod count
node storage
ip addresses, network interfaces, etc.
Be a good citizen
12-Factor apps
12-Factor apps
What about the other bars?
pod count
node storage
DiskPressure
Be a good citizen
12-Factor apps
12-Factor apps
Tips & tricks
Shocker. In all seriousness, here's a real story from a past life:
A grab-bag of advice
Follow Platform Operator instructions
Web Application Resource
8
Web Application Resource
8
Web Application Resource
8
Web Application Resource
8
Web Application Resource
8
Web Application Resource
8
Web Application Resource
8
Shocker. In all seriousness, here's a real story from a past life:
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
A grab-bag of advice
Follow Platform Operator instructions
Azure VM
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
Web
Application
Resource
8
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
π€¨
A grab-bag of advice
Follow Platform Operator instructions
1.24
A grab-bag of advice
Follow Platform Operator instructions
1.24
1.25
1.26
1.27
A grab-bag of advice
Follow Platform Operator instructions
1.27
β
OutOfMemory
A grab-bag of advice
Follow Platform Operator instructions
π₯
π₯
π₯
π₯
π₯
π₯
π₯
π₯
π₯
π₯
π₯
π₯
Production
12 Regions, 12 K8s clusters
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
-Xmx lots of RAM
-Xms lots of RAM
requests.memory: lots of RAM
limits.memory: lots of RAM
A grab-bag of advice
Follow Platform Operator instructions
Web
Application
Resource
8
β
β
β
t
mem
usage
A grab-bag of advice
Follow Platform Operator instructions
A grab-bag of advice
Follow Platform Operator instructions
Wait, you were an ops in that team. Didn't you know about it?
A grab-bag of advice
Follow Platform Operator instructions
At this point, accept your fate
Β―\_(γ)_/Β―
Kubernetes relays those logs automatically and they can be easily scraped for processing. Also, use a common and agreed upon log format to ensure the same metadata is present everywhere and pinpoint the producer of the log record.
A grab-bag of advice
Log to streams, not to files
For technical and business metrics. This data can then be used to measure the effectiveness of your services from the business perspective!
Dig deeper: Prometheus
A grab-bag of advice
Expose metrics
Example:
A deployment introduces a bug in your ComputationOptimizer service. The computations are scheduled but fail to run. No logs are being produced, you only get fairly normal CPU and memory usage. With business metrics, you would see that computation requests are being made on one end but not handled on the other.
You don't want to restart your service unless its code has changed. Updating the database, cache locations or credentials should not require a restart. A good enough strategy:
A grab-bag of advice
External configuration, retrieved dynamically
In other words: never hardcode anything.
How are partial outages counted against your SLAs?
If some external dependency takes too long to respond and is not absolutely critical to your code path, it may be wise to return an incomplete response and adapt your UI. At times, stale data can be okay.
A grab-bag of advice
Implement graceful degradation
Graceful degradation can save you from cascading failures, where app A depends on app B which depends on database D which is in maintenance.
As an example, some websites save your work regularly on your browser. This enables you to work offline and to restart exactly where you left.
Thank you