Scheduling jobs on a cluster is a complicated topic and is still an area of active research. If your cluster only runs a single type of job, or runs jobs on a regular schedule (like a 'daily roll-up'), then scheduling can be quite simple. With a mixed user base, though, the scheduling becomes more difficult. In general, all scheduling algorithms work the same way: they take a pile of submitted jobs, and run the job with the highest priority next. The trick is in how we determine a jobs priority.
There are three high-level scheduling goals that you may be concerned with:
Ensure the cluster is shared proportionately between sponsors. This one seems simple at first glance, but it quickly becomes very complicated. Let's say you have a cluster that was 60% funded by the Physics Department, and 40% funded by the Biology Department. It would seem to be an easy task to split the cluster in 60/40 portions and allocate the nodes proportionately, but what if the Biology Department only runs jobs on weekends? That means that 40% of your cluster is guaranteed to be idle all week long. What if you could let the Physics Department use 100% during the week, and give the Biology Department 100% on weekends? That would be better utilization, even if not exactly a 60/40 split. This is where Fair-Share scheduling comes in. The example is fairly simple. Now imagine you have seven departments, each with three to nine major projects, each of which is an individual contributor to your cluster, and you have to balance job run time depending on the individual contributions of each project. No matter how you tackle this problem, your users are going to complain that they're not getting their fair share, and you'll spend hours working out the algorithm by hand to show them that it really is working as intended.
Before we look at the specific scheduling algorithms, it helps to take a look at how jobs can be scheduled and then look at the specific algorithms that address problems with these approaches.
For the sake of these discussions, let's use the following job mix for reference.
Job # | User | Account | Time | Nodes | Job |
---|---|---|---|---|---|
123 | markov | phys-part | 2h | 2 | neutro3 |
124 | markov | phys-part | 2h | 2 | neutro3 |
125 | julie | bio-env | 1h | 1 | biodiv |
126 | mendle | bio-epi | 4h | 4 | denovo |
This is rarely a viable scheduling strategy, but in certain very obscure cases, it might actually be used (if every job needs 100% of network bandwidth, storage bandwidth, or some other obscure resource). In a practical sense, this is simply never used. This is the algorithm that a programmer would come up with as a "first pass" if you gave her the task of writing a scheduler in 15 minutes.
If we run our sample job mix through a "first-come, first-served" algorithm, this is what we end up with:
node | |||||||||
---|---|---|---|---|---|---|---|---|---|
node4 | 126 | 126 | 126 | 126 | |||||
node3 | 126 | 126 | 126 | 126 | |||||
node2 | 123 | 123 | 124 | 124 | 126 | 126 | 126 | 126 | |
node1 | 123 | 123 | 124 | 124 | 125 | 126 | 126 | 126 | 126 |
Time | 1h | 2h | 3h | 4h | 5h | 6h | 7h | 8h | 9h |
That works out to 56.25% utilization for our cluster, and it takes a total of nine hours to run all the jobs in our queue. There's obvious extra capacity available in our cluster that is being wasted by only running a single job at a time. That's where backfill comes in.
This is a fairly common scheduling algorithm for institutional clusters where the goal is to share resources among several users in a mostly fair manner. Jobs are run in the order that they're submitted, but when there's extra capacity, the job queue is scanned for any job that will fit into the "holes" created by unused nodes. This algorithm has the advantage of better utilization, but also encourages users to do a better job at specifying the requirements of their job. If a job only requires one hour to run, it may get run earlier if that time limit is specified as a requirement instead of using the default run time for the queue, which may be eight hours, 24 hours, or longer. Here's what our job queue looks like when we use this algorithm:
node | |||||||
---|---|---|---|---|---|---|---|
node4 | 124 | 124 | 126 | 126 | 126 | 126 | |
node3 | 124 | 124 | 126 | 126 | 126 | 126 | |
node2 | 123 | 123 | 126 | 126 | 126 | 126 | |
node1 | 123 | 123 | 125 | 126 | 126 | 126 | 126 |
Time | 1h | 2h | 3h | 4h | 5h | 6h | 7h |
In this case, our utilization has increased to 89%, and we were able to run all available jobs in seven hours. It's an obvious improvement, and naturally is more pronounced with a larger job queue and more resources to schedule.
This is typically the scheduling algorithm used when a computing resource has multiple sponsors and the intent is to share the resources between the stakeholders depending on their contribution. This is also the most difficult algorithm to debug. Users will frequently complain that they're not getting their fair share. After spending a few hours working the algorithm out by hand, you can show them that things are working as designed. Then they'll say "Oh." and walk away. Be prepared to spend a lot of time explaining how things work if you implement fair-share as a scheduling algorithm.
There are multiple algorithms for fair-share scheduling, including a new-and-improved algorithm called fair-tree that is now the default for the Slurm fair-share method. This was developed by a student at BYU, and published in a paper a few years ago. The paper is a good introduction to the complexity of fair-share scheduling, and is worth reading through.
In general, a fair-share algorithm works by calculating a job's priority based on the user's "share" of the cluster, along with that user's recent usage. The "recent usage" factor is a multiplier used to adjust priority based on how much compute time the user has accrued recently, with more recent usage counting more than usage days or weeks ago. We adjust this based on a "decay factor" that makes usage count less as jobs age, so once a prior job goes beyond a particular window of time (two weeks is common) it is no longer figured into the calculation.
This is easiest to explain using a few examples. The following example uses a fairly simple fair-share algorithm that computes a job priority between 0 and 1. Let's say we have two users jack and jill, and they're running on our Raspberry Pi cluster with four compute nodes. User jack purchased for one of the nodes and user jill has purchased three of the nodes, giving us a 25%/75% split of the cluster, for a Share (S) of 0.25 and 0.75, respectively. Let's further assume that we're going to consider jobs over the past seven days, and we'll use a decay (D) factor of 0.7. For the sake of simplicity, we'll consider usage calculated on a daily basis. Then the priority (P) of any job can be calculated as:
P = S - U
The Usage factor will be calculated as how much of the cluster was used by a particular user today, yesterday (today -1), the day before yesterday (today - 2), etc. for the past 7 days, biased by the Decay factor. We will calculate usage of the cluster in one hour increments, so usage will equal (used hours)/(available hours). In other words, the cluster provides 96 usage increments each day (24 hours times 4 nodes), so a job that runs all day will give a Usage of 1, a job that runs all day on two nodes will give a Usage of 0.5 (48/96), etc. Our Usage factor can then be calculated as:
U = S * (U(today) + ( D * U(today - 1)) + (D * D * U(today - 2)) + (D * D * D * U(today - 3)) + (D * D * D * D * U(today - 4)) + (D * D * D * D * D * U(today - 5)) + (D * D * D * D * D * D * U(today - 6)) ) *
Looking at the equation, you can see how the decay factor prioritizes recent usage over usage from older jobs. If a job runs today that uses 100% of the cluster, the Usage factor is equal to the Share(0.75), so priority goes to 0. After one day, the decay factor reduces this penalty to 0.525, so the priority becomes 0.225. After two days, the penalty is 0.3675, so the Priority becomes 0.3825.
Now, let's assume that both jack and jill submit six jobs at the same time asking for all four nodes for six hours. Assume this is a new cluster with no historical data. Then we can calculate priority as:
P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 -U(0) == 0.75
Jill's job will run first because it has a higher priority. In this case, there's no recent usage by either user, so the Usage factor is 0 in both cases.
After Jill's job completes its six hour run, our priority looks like this:
P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(24/96) = 0.75 - 0.25 == 0.5
Jill's second job will run because it still has higher priority. When Jill's second job is complete, our priority queue looks like this:
P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(48/96) = 0.75 - 0.5 == 0.25
Now both both users have the same priority, and we need a tie-breaker. For simplicity, let's just break ties by giving the top priority to the user with the highest share, so Jill's third job will run next. Afterward, we have:
P(jack) = 0.25 - U(0) == 0.25
P(jill) = 0.75 - U(72/96) = 0.75 - 0.75 == 0
Finally Jack's first job can run. Afterward, our priority looks like:
P(jack) = 0.25 - U(24/96) = 0.25 - 0.25 == 0
P(jill) = 0.75 - U(72/96) = 0.75 - 0.75 == 0
Now we're back to another tie-breaker, but at least this gives you an idea of how a typical fair-share algorithm works. We may be in a tie-breaker in this case, but note that we've scheduled the full 24 hour period. Also note that our algorithm has seemingly worked perfectly, in that Jill owns 75% of the cluster and was allowed to run 75% of the available time. Likewise, Jack owns 25% of the cluster and was allowed to run 25% of the time.
If only fair-share were this simple in the real world. In reality, even if we're using some sort of fair-share algorithm to prioritize jobs, we usually must take other factors into consideration. Typically, we add a little priority to every job as it gets older. We may have to account for scheduling across limited resources like software licenses, or GPUs. That's where the next section comes in.
Even if we're fortunate enough to have a very simple scheduling algorithm, we frequently end up in some sort of multi-factor prioritization scheme. Frequently when we have "sponsored" clusters where the resource is paid for by individual users or groups, we still reserve some portion of the cluster for special cases, or for new users to test their codes before committing to sponsoring the cluster.
Some of the additional factors we include in calculating job priority include:
These are fairly common constraints on job scheduling, but you can imagine how complicated things get in a large facility with many varied users.. On the largest supercomputing clusters in the world, there's usually an effort to favor very large jobs. Otherwise, there's really no need to go through the engineering and logistical nightmare to build one of these machines. With that said, we don't want to completely block the smaller jobs. Consider a pipeline job that begins with a data reduction step that only requires a few nodes, then uses 40000 nodes to do its computation, then finishes with another data reduction step that, again, only uses a few nodes. We can either let the smaller jobs run on the same machine, or move data between clusters. We also want to maximize our utilization on these extra large resources, so the smaller jobs can be used to backfill any unused compute resources.