Scheduling Algorithms

Scheduling jobs on a cluster is a complicated topic and is still an area of active research. If your cluster only runs a single type of job, or runs jobs on a regular schedule (like a 'daily roll-up'), then scheduling can be quite simple. With a mixed user base, though, the scheduling becomes more difficult. In general, all scheduling algorithms work the same way: they take a pile of submitted jobs, and run the job with the highest priority next. The trick is in how we determine a jobs priority.

There are three high-level scheduling goals that you may be concerned with:

  1. Optimize cluster use so it's as busy as possible. This one is easy to understand. A modern cluster is an expensive investment, and every minute that a portion of it is idle costs your institution money. If you invest $1 million in a 64-node cluster with a 3-year lifetime, a single day of idle time costs you $913 (not counting power, space, and cooling). That works out to about $0.60 per hour for every idle node. Technically, it's no additional cost, but "lost opportunity" that you've already paid for. The busier your cluster is, the better you're spending your money.
  2. Ensure the cluster is shared proportionately between sponsors. This one seems simple at first glance, but it quickly becomes very complicated. Let's say you have a cluster that was 60% funded by the Physics Department, and 40% funded by the Biology Department. It would seem to be an easy task to split the cluster in 60/40 portions and allocate the nodes proportionately, but what if the Biology Department only runs jobs on weekends? That means that 40% of your cluster is guaranteed to be idle all week long. What if you could let the Physics Department use 100% during the week, and give the Biology Department 100% on weekends? That would be better utilization, even if not exactly a 60/40 split. This is where Fair-Share scheduling comes in. The example is fairly simple. Now imagine you have seven departments, each with three to nine major projects, each of which is an individual contributor to your cluster, and you have to balance job run time depending on the individual contributions of each project. No matter how you tackle this problem, your users are going to complain that they're not getting their fair share, and you'll spend hours working out the algorithm by hand to show them that it really is working as intended.

  3. Something much more complicated. This may be simply a combination of the first two items above. As if Fair Share scheduling wasn't bad enough, combining it with a general access mechanism for all the other users just makes everything more complicated. More often than not, if you can't make your users happy with a simple scheduling algorithm, you end up with some combination of the schemes above plus a number of other factors to skew the job priority. This is what we typically refer to as multi-factor priority scheduling. In the Fair-Share example above, we don't want to prevent the Biology Department from running jobs during the week, we just want to give them a much lower priority. Likewise, on the weekends, we don't want to completely prevent the Physics Department from running jobs, we just want to give a much higher priority to Biology. In terms that might be much easier to understand, if the payroll department runs on your cluster, their jobs take priority over all other jobs right before payday.

Before we look at the specific scheduling algorithms, it helps to take a look at how jobs can be scheduled and then look at the specific algorithms that address problems with these approaches.

A "typical" job mix

For the sake of these discussions, let's use the following job mix for reference.

Job # User Account Time Nodes Job
123 markov phys-part 2h 2 neutro3
124 markov phys-part 2h 2 neutro3
125 julie bio-env 1h 1 biodiv
126 mendle bio-epi 4h 4 denovo

First-in, first-out

This is rarely a viable scheduling strategy, but in certain very obscure cases, it might actually be used (if every job needs 100% of network bandwidth, storage bandwidth, or some other obscure resource). In a practical sense, this is simply never used. This is the algorithm that a programmer would come up with as a "first pass" if you gave her the task of writing a scheduler in 15 minutes.

If we run our sample job mix through a "first-come, first-served" algorithm, this is what we end up with:

node
node4 126 126 126 126
node3 126 126 126 126
node2 123 123 124 124 126 126 126 126
node1 123 123 124 124 125 126 126 126 126
Time 1h 2h 3h 4h 5h 6h 7h 8h 9h

That works out to 56.25% utilization for our cluster, and it takes a total of nine hours to run all the jobs in our queue. There's obvious extra capacity available in our cluster that is being wasted by only running a single job at a time. That's where backfill comes in.

First-in, first-out with backfill

This is a fairly common scheduling algorithm for institutional clusters where the goal is to share resources among several users in a mostly fair manner. Jobs are run in the order that they're submitted, but when there's extra capacity, the job queue is scanned for any job that will fit into the "holes" created by unused nodes. This algorithm has the advantage of better utilization, but also encourages users to do a better job at specifying the requirements of their job. If a job only requires one hour to run, it may get run earlier if that time limit is specified as a requirement instead of using the default run time for the queue, which may be eight hours, 24 hours, or longer. Here's what our job queue looks like when we use this algorithm:

node
node4 124 124 126 126 126 126
node3 124 124 126 126 126 126
node2 123 123 126 126 126 126
node1 123 123 125 126 126 126 126
Time 1h 2h 3h 4h 5h 6h 7h

In this case, our utilization has increased to 89%, and we were able to run all available jobs in seven hours. It's an obvious improvement, and naturally is more pronounced with a larger job queue and more resources to schedule.

Fair-share/fair-tree

This is typically the scheduling algorithm used when a computing resource has multiple sponsors and the intent is to share the resources between the stakeholders depending on their contribution. This is also the most difficult algorithm to debug. Users will frequently complain that they're not getting their fair share. After spending a few hours working the algorithm out by hand, you can show them that things are working as designed. Then they'll say "Oh." and walk away. Be prepared to spend a lot of time explaining how things work if you implement fair-share as a scheduling algorithm.

There are multiple algorithms for fair-share scheduling, including a new-and-improved algorithm called fair-tree that is now the default for the Slurm fair-share method. This was developed by a student at BYU, and published in a paper a few years ago. The paper is a good introduction to the complexity of fair-share scheduling, and is worth reading through.

In general, a fair-share algorithm works by calculating a job's priority based on the user's "share" of the cluster, along with that user's recent usage. The "recent usage" factor is a multiplier used to adjust priority based on how much compute time the user has accrued recently, with more recent usage counting more than usage days or weeks ago. We adjust this based on a "decay factor" that makes usage count less as jobs age, so once a prior job goes beyond a particular window of time (two weeks is common) it is no longer figured into the calculation.

This is easiest to explain using a few examples. The following example uses a fairly simple fair-share algorithm that computes a job priority between 0 and 1. Let's say we have two users jack and jill, and they're running on our Raspberry Pi cluster with four compute nodes. User jack purchased for one of the nodes and user jill has purchased three of the nodes, giving us a 25%/75% split of the cluster, for a Share (S) of 0.25 and 0.75, respectively. Let's further assume that we're going to consider jobs over the past seven days, and we'll use a decay (D) factor of 0.7. For the sake of simplicity, we'll consider usage calculated on a daily basis. Then the priority (P) of any job can be calculated as:

P = S - U

The Usage factor will be calculated as how much of the cluster was used by a particular user today, yesterday (today -1), the day before yesterday (today - 2), etc. for the past 7 days, biased by the Decay factor. We will calculate usage of the cluster in one hour increments, so usage will equal (used hours)/(available hours). In other words, the cluster provides 96 usage increments each day (24 hours times 4 nodes), so a job that runs all day will give a Usage of 1, a job that runs all day on two nodes will give a Usage of 0.5 (48/96), etc. Our Usage factor can then be calculated as:

U = S * (U(today) + ( D * U(today - 1)) + (D * D * U(today - 2)) + (D * D * D * U(today - 3)) + (D * D * D * D * U(today - 4)) + (D * D * D * D * D * U(today - 5)) + (D * D * D * D * D * D * U(today - 6)) ) *

Looking at the equation, you can see how the decay factor prioritizes recent usage over usage from older jobs. If a job runs today that uses 100% of the cluster, the Usage factor is equal to the Share(0.75), so priority goes to 0. After one day, the decay factor reduces this penalty to 0.525, so the priority becomes 0.225. After two days, the penalty is 0.3675, so the Priority becomes 0.3825.

Now, let's assume that both jack and jill submit six jobs at the same time asking for all four nodes for six hours. Assume this is a new cluster with no historical data. Then we can calculate priority as:

P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 -U(0) == 0.75

Jill's job will run first because it has a higher priority. In this case, there's no recent usage by either user, so the Usage factor is 0 in both cases.

After Jill's job completes its six hour run, our priority looks like this:

P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(24/96) = 0.75 - 0.25 == 0.5

Jill's second job will run because it still has higher priority. When Jill's second job is complete, our priority queue looks like this:

P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(48/96) = 0.75 - 0.5 == 0.25

Now both both users have the same priority, and we need a tie-breaker. For simplicity, let's just break ties by giving the top priority to the user with the highest share, so Jill's third job will run next. Afterward, we have:

P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(72/96) = 0.75 - 0.75 == 0

Finally Jack's first job can run. Afterward, our priority looks like:

P(jack) = 0.25 - U(24/96) = 0.25 - 0.25 == 0
P(jill) = 0.75 - U(72/96) = 0.75 - 0.75 == 0

Now we're back to another tie-breaker, but at least this gives you an idea of how a typical fair-share algorithm works. We may be in a tie-breaker in this case, but note that we've scheduled the full 24 hour period. Also note that our algorithm has seemingly worked perfectly, in that Jill owns 75% of the cluster and was allowed to run 75% of the available time. Likewise, Jack owns 25% of the cluster and was allowed to run 25% of the time.

If only fair-share were this simple in the real world. In reality, even if we're using some sort of fair-share algorithm to prioritize jobs, we usually must take other factors into consideration. Typically, we add a little priority to every job as it gets older. We may have to account for scheduling across limited resources like software licenses, or GPUs. That's where the next section comes in.

Multi-factor prioritization

Even if we're fortunate enough to have a very simple scheduling algorithm, we frequently end up in some sort of multi-factor prioritization scheme. Frequently when we have "sponsored" clusters where the resource is paid for by individual users or groups, we still reserve some portion of the cluster for special cases, or for new users to test their codes before committing to sponsoring the cluster.

Some of the additional factors we include in calculating job priority include:

  • Licenses for commercial software
    Don't run a job if it requires a license that isn't available
  • Access to specialized hardware like GPUs or FPGAs
    Don't run a job if it requires an unavailable GPU, or give preference to jobs that can utilize GPUs
  • Job age
    Ensure that every job will eventually run by increasing its priority over time
  • Core affinity
    Schedule multiple cores on the same CPU if possible
  • NUMA affinity
    Schedule processor cores along with the memory directly attached to them
  • Network affinity
    Schedule jobs across nodes on the same physical switch if possible
  • Emergency job runs
    Give emergency jobs the highest possible priority (for example, tornado forecasting)

These are fairly common constraints on job scheduling, but you can imagine how complicated things get in a large facility with many varied users.. On the largest supercomputing clusters in the world, there's usually an effort to favor very large jobs. Otherwise, there's really no need to go through the engineering and logistical nightmare to build one of these machines. With that said, we don't want to completely block the smaller jobs. Consider a pipeline job that begins with a data reduction step that only requires a few nodes, then uses 40000 nodes to do its computation, then finishes with another data reduction step that, again, only uses a few nodes. We can either let the smaller jobs run on the same machine, or move data between clusters. We also want to maximize our utilization on these extra large resources, so the smaller jobs can be used to backfill any unused compute resources.