Cloud Computing

Cloud Computing can be extremely useful in particular situations, but like many new technologies, it often gets over-adopted, and deployed in situations where it's either not appropriate or downright detrimental.

The classic case for Cloud Computing (which is still a good application of Cloud today) is the case of a start-up company. In the simplest case, a new company needs a web site to advertise their business. Purchasing a server for a simple web site will cost several thousand dollars, and there's additional cost for network bandwidth, and support staff to maintain the server. In contrast, a virtual machine on Amazon Web Services (AWS) can be brought up with no capital investment and extremely low operating costs (in fact, at Amazon's lowest level offering, your virtual machine can be free for the first year). Although you still need staff to support the software on the machine, all hardware and network maintenance is provided by Amazon. Furthermore, this is a pay-as-you-go service, so you can shut your machine down and incur no additional charges.

A similar case can be made for established businesses that need single-purpose servers that are outside their business domain, or that may have different reliability or bandwidth requirements. For example, if you run a web server company you probably want to host your own web site. If you run a manufacturing business, your experience is probably more with the computers and related hardware that run your factory floor than with web servers. You can let a web company worry about the hardware maintenance of your web server, and have your IT staff focus on the manufacturing core of your business. The cloud companies typically have better bandwidth than most individual businesses, so they're a good option for customer facing services where response time is critical. Accordingly, it saves bandwidth on the company network for normal company operations.

Beyond small business use cases, cloud is sometimes used by larger organizations for burst capacity. If you run an HPC environment that only occasionally requires many more nodes than you have on-hand, it might be more cost-effective to extend your cluster with nodes from a cloud provider than to buy more capacity that will sit idle most of the time.

You also may to able to use cloud solutions to test new hardware. Currently, Amazon offers images with GPUs, and while GPUs are mostly ubiquitous in HPC clusters these days, you may be one of the "late adopters" and need to justify the purchases to your management. The cloud is a reasonable place to get some base-level benchmarking numbers. Beyond that, hardware that's considered a bit more exotic, like FPGAs, are also now available on Amazon. While Amazon remains the clear leader in the cloud space, the smaller cloud providers have started to differentiate themselves by offering more unique and customized solutions.

Another advantage to cloud computing is "data safety," in quotes because it's ultimately up to you to ensure your data is secure. Nevertheless, the cloud providers are very good at making sure the data they store is available, accessible, and reasonably secure, and they've probably invested in much better hardware than you did. The cloud providers typically have people dedicated to data storage, and they most likely are better at doing their jobs than your poor overworked system administrator who is also responsible for your cluster, infrastructure, training, applications, desktops, etc.

The dark side of Cloud

With all that said, cloud computing isn't really the magical unicorn that it's typically depicted as in the marketing glitz of these large cloud providers. The expenses associated with computing in the cloud can be sometimes be surprising. You're also likely sharing a resource with someone else. Particularly with HPC, it's often not clear how to get a dedicated machine with dedicated network bandwidth (this seems to be a moving target on Amazon in particular, so read the documentation).

In terms of costs, the per-hour charges for cloud computing seem to be surprisingly low. Again, if you're only need the resources periodically, cloud is still a good option. However, if you regularly have 40,000 cores running at 100% utilization, cloud quickly becomes overly expensive. Additionally, you're paying not just for the compute cores, but also ingress(sometimes)/egress(always) charges to move your data to and from the cloud. You'll also be charged for storage to hold that data, for as long as you have the storage allocated to your account (whether your data is there or not). Are you going to offer your data on the Internet? Or even to your local users on a secure connection so it can be browsed online? Expect to pay for network bandwidth charges. All of these charges begin to add up quickly.

When cloud computing first came into vogue several years ago, a lot of companies jumped onto the bandwagon without a lot of thought about the costs, some even proudly declaring themselves a "cloud first" company, meaning all IT projects would be targeted to the cloud instead of local resources. While cloud computing continues to grow as a business, many companies have started to engage in "cloud repatriation" over the past few years.

The reasons for cloud repatriation are varied, but they mostly seem to center on the problem that companies are not getting the full benefits that they were promised with the cloud. Expenses are cheaper in some instances, but not all. You still have to have an IT staff; cloud computing doesn't do the work for you, they just offer the resources to do it remotely. While the connectivity (uptime) of the cloud companies is excellent, accidents do happen occasionally, and huge tracts of resources sometimes go offline. They probably have better uptime than your local resources, but if you have local resources and your Internet connection goes down, local users can still work. Not so with remote resources.

One of the largest reasons for cloud repatriation is probably still poorly planned and managed cloud migrations. Operating in a cloud environment is not the same as operating in your local data center. There's a lot of training for users and operators, a lot of application re-engineering, and a lot of planning to do before you can smoothly and successfully migrate any major or critical application to a cloud platform. There's still an issue with security. While the cloud platforms themselves are remarkably secure, your application and data security is still your responsibility, and security in a cloud environment is a different beast than security on-site. Even if you're successful in making the transition, you may find that the actual costs are still prohibitive.

Should you do cloud? Maybe. But start small, start slowly, and get an idea for what makes sense to move to a cloud platform and what doesn't, and get an idea of the real costs involved. In the end, as the not-so-old saying goes, "there is no cloud, it's just some else's computer."

The bright side of cloud

While it's true that the basis of the Cloud is "just someone else's computer," there are certainly some advantages to this kind of arrangement, as I discussed above. Beyond that, however, Cloud companies also realize they need to add more value to their offerings if they really want to convince people to move away from on-premises systems. In pursuit of that goal, Cloud companies have expanded their offerings with push-button solutions and fully managed applications and infrastructure. As an example, Amazon offers push-button deployments of popular databases. You simply select the size of the instance you want to use, and Amazon automatically deploys it and maintains it for you. An even better example may be Amazon Elastic Kubernetes Service (EKS) or (even simpler) Fargate. Managing an on-premises Kubernetes instance is notoriously difficult, a huge amount of work, and a famously troublesome time sink. With Fargate, you don't need to worry about the underlying Kubernetes or server maintenance; you can just run your containers. Going even further, Amazon provides "serverless" offerings like Lambda, which allows you to deploy a function that can be called when needed without an EC2 instance running to host the code. You get billed by the number of calls, memory used, and the length (in milliseconds) that the function executes, and Amazon will automatically scale it to fit your needs. There are similar "serverless" offerings available, including message queues, databases, and notification services.

Probably one of the best consequences of this push to provide value-added services to the cloud is a move towards loosely-coupled software architectures. I remember learning about loose coupling back in the 1980s, but rarely saw it used in production projects. Developers were aware of the advantages of loosely-coupled software, but we seemed to be stuck in the traditional architectures and design models for software. With Cloud, this is no longer just a design decision, it translates into cash savings (as well as all the normal benefits, such as fault-tolerance and scalability).

A lot of the innovation in Cloud offerings is being driven by Amazon, with the other major cloud providers mostly playing catch-up. Part of this is the influence of Netflix on AWS. Netflix was formed as a "cloud first" company and (as the story goes) tormented Amazon into building AWS into the most stable and feature rich platform around. It's beyond the scope of this article to go into the details, but it's worth searching for and reading through the story. Netflix has also been a major driver of the DevOps model (search for Chaos Monkey for a remarkable read). And, for another interesting piece of the puzzle, search for the legendary (or mythical?) Bezos API Mandate, which finally pushed loosely couple software design into the mainstream.

So, to summarize, even though cloud is "just someone else's computer," the Cloud companies have gone a long way toward making your life easier by automating all the things that your IT staff has been going to do "real soon now." AWS has already done it, cheaper than you can do it, and with a much more solid and flexible implementation. Of course all these great features will cost you money, and once you start using all the cool toys, you'll want to use more. ...and naturally, pay more.