Streamline repetitive ops
100x faster with AI
Thousands of automated operations tasks and an AI platform connecting them to your alerts, tickets and chats.
Thousands of Tasks In Minutes
Our agent syncs Kubernetes and cloud accounts with automation libraries covering cloud infra, K8s apps, popular OSS and programming frameworks.
Connect With AI
Send an alert, add to a channel or assign a ticket. Our AI Platform connects text describing a problem with automation matching the solution.
Measure What Matters
Executive reports show time saved during incidents (MTTR impact), automation coverage of alerts and priority tasks that are not yet automated.
In The Library
From interruption to self-service
With traditional automation, the person who writes it is the person who runs it. Modern AI changes that.
Now when anyone on the team describes a problem, the platform suggests (and runs) automation to take the next step.
This kind of operational self-service dramatically increases the velocity of any team operationalizing new AI infrastructure, adopting Kubernetes, moving clouds, etc.
Bring on-call in-house
Replacing low-cost on-call services with automation frees up enough budget to add high-end SREs to your team.
The vast amount of automation required is hard to justify if you need to write it all yourself. It takes too long and costs too much. Leveraging our shared libraries makes the ROI simple (and fast).
Automation can cut observability cost
Some of our most popular automation libraries copy logs directly from pods and VMs into jira/github tickets. Connecting this to VSCode, Alerts and CI/CD webhooks removes the need to ingest and store 99%+ of non-prod logs.
Thousands of ops tasks automated by our experts
We add 20+ new automated tasks per month covering the infrastructure, services and tools that your team uses and maintains every day.
Check logs for errors -> if errors then collect stack traces and env vars -> paste to ticket -> if deployment then do a rolling restart. If a CI/CD job has an error then find the tests that it referenced -> find the deployments referenced by the test -> collect env vars, manifest and stack traces from deployment -> if existing ticket then paste to ticket -> if no ticket then create a new ticket. Untangling the test environment again. Collect logs -> -> grep for stack traces -> file ticket -> restart VM (AWS). Check for high error rate nginx paths -> find deployment -> check deployment resource health -> copy logs to a ticket -> restart deployment. Health check /login for http 200s -> check auth microservice logs for errors. Check postgres write-ahead log storage utilization -> add emergency capacity and escalate immediately. Helping developers with repetitive troubleshooting. Check Kubernetes Error events for Deployment -> check logs for application errors -> check CPU/mem/IO metrics -> check node for noisy neighbors -> paste all info into a ticket. Check databricks for failing job references -> if databricks job failed then check node health under Deployment -> if node health is OK then check databricks dependent deployment for Error Events. If developer says service is down -> help the developer run liveness probe check and collect recently logs and notify of find the service owner. Collect env status and pod logs -> paste into new ticket -> rolling restart deployment. Triage noisy alerts. Run test env pre-flight check -> check all Deployments are in ready state -> check transaction table has at least 1 row. Check transaction queue is <100 items deep -> if not, collect env info, deployment logs and file a ticket. Collect StatefulSet manifest -> paste to ticket. Manual health checks. Check certificate is valid -> if not, rotate certificate. Check Ingress for Warning Events. Check Ingress log for error messages. Increase cpu/memory capacity for Azure Web App. Check Ingress for paths with high rates of 500 errors. Read/write test key to Redis. Read test row from postgres -> restart VM if query returns no rows. Check kafka client latency. Restart kafka client to rebalance. Search logs and paste results to ticket. Add env vars to ServiceNow ticket. Confirm no root account logins in last 30 days. Check volume for utilization. Add emergency 10Gi storage capacity to volume. Add emergency 500 millicores CPU capacity. Compare deployment manifest to Vertical Pod Autoscaler CPU/Mem recommendations -> if misaligned then prepare manifest to align them in a PR -> email service owner. Check manifest for readiness probe configurations -> if missing then notify service owner -> if incorrect then prepare a PR with a fix and file a ticket. Check manifest for non-standard open ports -> if non-standard ports then check exception list -> if not on exception list then file a ServiceNow ticket. Check oauth login latency -> if latency is slow then restart VM -> email service owner. Check queue is less than 60% of capacity -> if queue is beyond basic capacity then check CPU / memory -> if CPU/memory is high then copy recent logs to a ticket and emergency restart process. If test body mentions vault error then do test read/write in vault test path with pod credentials -> if vault test read/write fails then try with default read only credentials -> if that fails then notify service owner. Check test DB is running and volume is not full and login string matches and no key tables locked and test user is entered in user table -> if any fail, stop running tests and notify test owner. Check CPU is not >80% for the last 5 minutes. Check memory is not >80% for the last 5 minutes. If resource utilization is over limits, open a PR for capacity increase. Check Azure metrics for http 500 rate overnight. Check logs for errors after deployment scale-up. Check for high CPU/mem after deployment scale-to-one.Run test env pre-flight check that all Deployments are in ready state and transact
Did you say this afternoon?
Install the RunWhen Local agent in your cluster to scan Kubernetes, AWS, GCP and Azure accounts.
By default it will sync the RunWhen read-only libraries. You can add more public or private libraries over time.
A scan of a typical small/medium size cluster will import several thousand tasks in a few minutes.
Collaboration increases coverage
Our platform is designed for you to import AI-ready tasks from our community, but also for anyone across your teams to add their own. A CLI command? A SQL query? A REST call, or a shell script? Engineering Assistants (with appropriate access) recommend them and use them in real time, extending their capabilities without ever changing configuration.
Where to next?
The default Assistants that come out of the box are designed for Platform/SRE teams to give to developers for Kubernetes troubleshooting. However, it doesn't stop there...
A (paid) community?
Expert authors in our community receive royalties and bounties when RunWhen customers use their automation. The community's efforts span infrastructure, cloud services and platform components alongside popular OSS components, programming languages and frameworks.
Running a lean team means you need the best engineers you can find...
Do you really want your top engineers spending time on work that someone in industry already automated?