Capacity Planning
Designing and building scalable infrastructure for tomorrows requirements.
Monitoring
Keeping a pulse on the environment, projects, prospects and their variables.
Capacity Planning
Nothing is static in a given infrastructure; things must be dynamic and rapidly evolving. How does one follow the trends and accurately build solutions for what could be tomorrows workload? A large part of this is knowing business initiatives and business strategy – where does the business want to be in 6 months? How about 2 years? How about 5 years? Knowing these types of things is what helps build an effective process to build infrastructure to support growth, or in some cases, to scale back.
Boosting Efficiency while Reducing Costs
Quite simply, capacity planning is mathematics and projections. Where were we a year ago; Where are we today; Where do we plan on being tomorrow?
All of this can be fairly accurately modeled providing a real world dollar amount of infrastructure services (whether on prem, hybrid cloud or full cloud).
It’s not always as simple as that – each application and environment may have different requirements from many different perspectives.
Not Always Linear
A perfectly linear graph of growth for any given operation is quite unlikely and therefore, the team and the environment itself needs to remain agile and dynamic. Because of that, gathering and preserving metrics on any given day is quite important to the future planning of infrastructure; whether it’s expansion or reduction.
Capacity Planning / Monitoring Overview
Utilizing a variety of different tools to monitor and plan growth and reduction is something I’ve done quite a bit of. Some of this boils down to simple mathematics, but some is quite a bit more complex than that.
- Nagios
- Prometheus
- ELK Stack
- Opsview
- Logic Monitor
- VMware Suite (vROPs, vRLI, vNI)
- Grafana
- Telegraf / InfluxDB
- Splunk
- Turbonomics
- EC2 / Autoscaling
- Datadog
- Solarwinds (Orion, Server and Application)
- Observium / LibreNMS
- PRTG
- Zabbix
- Icinga
- OpenNMS
- SNMP
- API Querying / Bespoke
- Bash / Powershell
Real World Experience – Example
In the experiences I have had, there have been numerous issues, capacity constraints and other issues remediated before they ever became a problem due to active capacity planning and monitoring. It’s not been my first role in some time to do operational monitoring (eg, monitoring issues), but I tend to remain on call as much as I can to head off any issues before they become a customer impacting issue.
Working for companies that have been in rapid and fairly unpredictable growth modes is a challenge; it’s a challenge to telegraph where and what may be needed for capacity. Typically, a best case has been adding a +-10% to any given capacity project, however, this may be unreasonable and unwarranted.
The challenges I’ve seen have mostly been around compute, network and storage constraints and capacity planning, specifically around data capacity and data access performance (eg, I/O). There are many capacity saving methods and I think each of them has a place (deduplication, compression, workload optimization, etc…).
There were two years, specifically, that the company I had worked for called out an aggressive plan to increase headcount and forecasted revenue by 50%; 25% over two years. This brought about numerous challenges that needed to be addressed – especially when we considered a normal yearly refresh / reiteration on various equipment. With this, we needed to assume a minimum of 50% over two years and so this became a 2 year project.
About 25% of the new hires that were anticipated were developers who have specific workload requirements – developers tend to use their workstations in different ways than most standard users; compiling code, debugging, database connections, application stacks and containers running to name a few.
To plan all this, we had several applications that could forecast what we needed as far as capacity in regards to network, storage, server and workstation requirements. From there, mathematically, we determined what we needed to purchase.
We reached out to our standard vendors (VARs) and received some expected lead times of various equipment; some equipment had a six month plus lead time. We determined that the items with the longest lead time needed to be ordered first as we were unsure when it would arrive and the implementation project really couldn’t kick off until we had it.
My role was building the timelines (at least a ball park; we did the best we could, but the vendors were not forthcoming with exact lead times), managing the project, the relationship with the vendors and the overall design of what we’d need. I also completed rack elevations and cabling diagrams for any net new infrastructure and any modifications to existing infrastructure.
These plans were distributed among engineering staff and ultimately executed upon.