Note: The text here comes from Luke Maurer's PR
OCaml 5 unleashed parallel programming on the OCaml ecosystem. As is often the case when something is unleashed, this has quite a bit of potential to cause chaos. Race conditions lead to bugs that are easy to write, hard to replicate, and tricky to eliminate. But it's hard to argue with the potential performance gains. What might the language do to help us get those speedups and spend less time debugging nondeterministic behavior?
OxCaml provides an infrastructure for multicore OCaml programming that exploits the full power of parallelism while guaranteeing freedom from an especially pernicious threat, that of the data race. That infrastructure includes many parts, from common primitives like atomics and locks to more exotic features like modes and capsules. Fortunately, many parallel programs only need some of this machinery, and following this tutorial, you'll see how far you can get with just two modes and the Parallel.fork_join2
function.
OCaml 5 provides parallelism for otherwise perfectly ordinary OCaml code. The program is divided into domains, each of which executes in a separate OS thread. Domains are allowed to pass values around freely to each other, so a function may at any time be accessing values allocated by other domains—and thus other domains may be accessing those values in parallel.
The great thing about multicore OCaml is that there is total flexibility to structure the program without concern for who may be touching what at any time. The problem of course is that there is total flexibilty to structure the program without concern for who may be touching what at any time. Race conditions may abound, of which data races are especially concerning. OxCaml's data-race freedom extensions retain the ability to run code in parallel and intermix data between parallel domains, yet they add enough guard rails to eliminate data races entirely.
A data race is a race condition in which two accesses to the same memory location conflict. Specifically, there are four elements of the crime:
The last condition is a carve-out for specially synchronized memory operations provided by Atomic.t
, which we'll cover later on. Any “normal” mutable OCaml data can lead to a data race: references, mutable
fields, and array
s are all possible culprits.
So, for example:
type record = { mutable field : int }
let reference : int ref = ref 1
let record : record = { field = 2 }
let arr : int array = [| 3; 4 |]
All three of these values are possible sources of data races, say if the following two functions were run in two different domains:
let fun1 () =
reference := 2;
record.field <- 3;
arr.(0) <- 4;
print_int arr.(1)
let fun2 () =
reference := 3; (* Data race: write vs. write *)
print_int record.field; (* Data race: read vs. write *)
arr.(0) <- arr.(0) + 1; (* Two data races! *)
print_int arr.(1) (* This one's okay: read vs. read *)
Every race condition is a bug, but data races are especially insidious. Obviously, the outcome is unpredictable, since we don't know which side will “win the race,” but the problems don't end there. Both the compiler and the CPU routinely reorder instructions for performance, and they do so assuming there are no data races. As a consequence, in a program with data races, it's difficult if not impossible to reason about the order in which things happen.
As an example, imagine we have two global variables:
let price_of_gold = ref 0.0
let initialised = ref false
Now consider two pieces of code running in parallel. One is some startup code we'll call domain A:
(* ... *)
price_of_gold := calculate_price_of_gold ();
initialised := true
(* ... *)
The other, domain B, accesses the variables:
(* ... *)
if !initialised then
if !price_of_gold < really_good_price_for_gold then
buy_lots_of_gold ()
(* ... *)
What happens? Well, since they're running in parallel, we don't know exactly. There is of course an intuitive way to reason it out: at any given point in time, each domain will be partway done running, we just never know exactly which domain gets to go next. So maybe domain A sets price_of_gold
, and then domain B reads initialised
and sees that it's false
, and then domain A sets initialised
to true
. Or maybe domain A sets both variables before domain B checks initialised
, and then if gold is cheap, domain B heads to the market.
The one thing we want not to happen is for domain B to see that !initialised
is true
and yet find that price_of_gold
is still stuck at its initial value of 0.0
, thus causing us to buy_lots_of_gold ()
no matter the price. Our intuitive method reassures us that this cannot happen: domain A doesn't set initialised
to true
until it's already done setting price_of_gold
to something sensible.
Unfortunately, our intuitive reasoning is wrong: this code very much can end up calling buy_lots_of_gold ()
despite price_of_gold
being high. Since the assignments to price_of_gold
and initialised
happen in the same domain and neither is a fancy atomic operation, they can be reordered so that initialised
is set first, while price_of_gold
is still zero, leading domain B to make an expensive mistake.
It turns out our intuition assumes what the literature calls sequential consistency: essentially, the principle that things happen in order they're written, and parallelism just means we're not sure which domain will do its next thing. This principle is so essential that it's hard to imagine reasoning about a multicore program without it. Alas, OCaml 5 doesn't guarantee sequential consistency for this code—because of the data races.
There are two data races, to be specific: each of price_of_gold
and initialised
is being modified by domain A and read by domain B in parallel. When there is a data race, the OCaml memory model throws up its hands and says “sorry, no sequential consistency.”^bounded-in-space-and-time
The good news is that the inverse is also true: any program without data races is sequentially consistent. Data-race freedom gives us back the power to reason intuitively about our code, no matter how buggy it might get. In our example, that means that once we fix the data races (easy to do with Atomic.t
), we know our bug is gone: sequential consistency means that it's impossible for domain B to observe a state where initialised
is true
and price_of_gold
is 0.0
.^price-zero
The even better news is that, in code using OxCaml's extensions, the data races in price_of_gold
and initialised
would never have compiled to begin with. There's a tradeoff, of course: just as OCaml's type safety means having to understand the type system, the system for data-race freedom has a learning curve. This tutorial aims to get you moving with a practical framework that hopes to achieve significant parallel speedups without much fuss. You'll still need to convince the compiler that your code has no data races, but we'll walk through the rules and even some shortcuts for common special cases.
Any problem that can be neatly subdivided is a good candidate for fork/join parallelism, as provided by the parallel
library’s fork_join*
functions (XX link to .mli). For example, a very expensive way to add four integers (we promise the other examples are more substantial) would be this:
let add4 (par : Parallel.t) a b c d =
let a_plus_b, c_plus_d =
Parallel.fork_join2 par (fun _par -> a + b) (fun _par -> c + d)
in
a_plus_b + c_plus_d
The call to Parallel.fork_join2
will schedule the calculations of a + b
and c + d
as independent tasks, returning both once they're both done, and then add4
finishes by adding the results from the tasks. The par
argument parameterizes fork_join2
(and, in turn, add4
) by a particular implementation of parallelism. It is also passed to the tasks so that they can spawn sub-tasks.
To run add4
, we need to get our hands on a scheduler. Each scheduler is provided by a library and determines the policy by which tasks get doled out and run in domains. For this tutorial, we'll use parallel_scheduler_work_stealing
, which implements the popular work-stealing strategy.
# let test_add4 par = add4 par 1 10 100 1000
# let () =
let module Scheduler = Parallel_scheduler_work_stealing in
let scheduler = Scheduler.create () in
let monitor = Parallel.Monitor.create_root () in
let result = Scheduler.schedule scheduler ~monitor ~f:test_add4 in
Scheduler.stop scheduler;
Printf.printf "result: %d\n" result;;
result: 1111
- : unit = ()
This creates a work-stealing scheduler, along with a monitor to manage exceptions. Then it tells the scheduler to run the test_add4
function before shutting down the scheduler. (Naturally, a real program will want to keep the monitor and scheduler around longer!)
You can also use the parallel
library's own Parallel.Scheduler.Sequential
, which simply runs everything on the primary domain. This is handy for testing or debugging when you want to eliminate nondeterminism. To do so, simply replace Parallel_scheduler_work_stealing
with Parallel.Scheduler.Sequential
.
Now for something more substantial. Suppose we're working with binary trees, and we want to take an average over all the values in the tree. Here's a basic implementation (note that we've made use of the new labeled tuples (XX: link) feature):
module Tree = struct
type 'a t =
| Leaf of 'a
| Node of 'a t * 'a t
end
let average (tree : float Tree.t) =
let rec total tree : (total:float * count:int) =
match tree with
| Tree.Leaf x -> ~total:x, ~count:1
| Tree.Node (l, r) ->
let ~total:total_l, ~count:count_l = total l in
let ~total:total_r, ~count:count_r = total r in
~total:(total_l +. total_r), ~count:(count_l + count_r)
in
let ~total, ~count = total tree in
total /. (count |> Float.of_int)
;;
let test_tree = Tree.Node (Leaf 3.0, Leaf 4.0)
We can use fork_join2
to parallelize average
:
let average_par (par : Parallel.t) tree =
let rec total par tree : total:float * count:int =
match tree with
| Tree.Leaf x -> ~total:x, ~count:1
| Tree.Node (l, r) ->
let (~total:total_l, ~count:count_l), (~total:total_r, ~count:count_r) =
Parallel.fork_join2
par
(fun par -> total par l)
(fun par -> total par r)
in
~total:(total_l +. total_r), ~count:(count_l + count_r)
in
let ~total, ~count = total par tree in
total /. (count |> Float.of_int)
;;
Note that we don't have to worry about unbalanced trees: the work-stealing algorithm dynamically adapts whenever tasks are unevenly distributed among cores.
We're not limited to working with simple float
s, of course. Suppose we add a module Thing
earlier in the file:
module Thing = struct
module Mood = struct
type t =
| Happy
| Neutral
| Sad
end
type t =
{ price : float
; mutable mood : Mood.t
}
let create ~price ~mood = { price; mood }
let price { price; _ } = price
let mood { mood; _ } = mood
let cheer_up t = t.mood <- Happy
end
All we have to do to sum over the prices in a Thing.t Tree.t
is change the Tree.Leaf
case:
let rec total par tree : total:float * count:int =
match tree with
- | Tree.Leaf x -> ~total:x, ~count:1
+ | Tree.Leaf x -> ~total:(Thing.price x), ~count:1
| Tree.Node (l, r) ->
let (~total:total_l, ~count:count_l), (~total:total_r, ~count:count_r) =
And of course you'll need a new test tree:
let test_tree =
Tree.Node
( Leaf (Thing.create ~price:3.0 ~mood:Happy)
, Leaf (Thing.create ~price:4.0 ~mood:Sad) )
So far, so good. But something annoying happens if we introduce an abstraction barrier. Let's introduce a signature for [Thing].
module type THING = sig
type t
module Mood : sig
type t =
| Happy
| Neutral
| Sad
end
val create : price:float -> mood:Mood.t -> t
val price : t -> float
val mood : t -> Mood.t
val cheer_up : t -> unit
end
module Thing : THING = struct
module Mood = struct
type t =
| Happy
| Neutral
| Sad
end
type t =
{ price : float
; mutable mood : Mood.t
}
let create ~price ~mood = { price; mood }
let price { price; _ } = price
let mood { mood; _ } = mood
let cheer_up t = t.mood <- Happy
end
But now we get an error from the compiler:
The value total is nonportable, so cannot be used inside a function that is
portable.
Why are things breaking just because we moved some code? We can get closer to the answer by insisting that total
is portable
:
let average_par (par : Parallel.t) (tree : Thing.t Tree.t) =
- let rec total par tree =
+ let rec (total @ portable) par tree =
match tree with
The value Thing.price is nonportable, so cannot be used inside a function that
is portable.
What's going on is that our declaration for price
left out crucial information that the compiler was previously able to infer. We can take the compiler's suggestion easily enough by explaining that price
is portable
:
val price : t -> float @@ portable
But the compiler isn't satisfied. Now it complains about the l
in total par l
(which is still as it was before):
This value is contended but expected to be uncontended.
Now the solution is much less obvious, but as we'll see, it again comes down to missing information on price
:
val price : t @ contended -> float @@ portable
(The precedences here are such that the t
argument is contended
and the whole function, namely price
, is portable
.)
Once you're familiar with the portable
and contended
modes, you'll see that this can be read as saying:
Why should we believe this is true? Well, remember, price
is simply
let price { price; _ } = price
It does nothing but project an immutable field out of a record. And as we covered in defining a data race, rule 3 is that in order for two accesses to produce a data race, one of them must be a write—and an immutable field can't be modified[^write-on-initialization]. Hence price
can't produce a data race no matter what is happening in parallel. The situation changes if we make price
mutable:
type t =
- { price : float
+ { mutable price : float
; mutable mood : Mood.t
}
Values do not match:
val price : t -> float @@ portable
is not included in
val price : t @ contended -> float @@ portable
The type that the compiler now wants to give price
says essentially “this is only safe to call if no other domain can access the argument.” But Parallel.fork_join2
can't make that promise, since it's scheduling tasks in parallel. (If this all seems vague at the moment, we'll make it systematic when we detail how portable
and contended
work.)
Giving price
the right type is enough to let average_par
go through: from there, the compiler infers that our calls to Parallel.fork_join2
won't produce data races, and in turn that average_par
is again data-race free.
If you modify our test code to run average_par
on the new test_tree
, you'll find yourself needing to make one more change to thing.mli
:
val create : price:float -> mood:Mood.t -> t @ portable @@ portable
This says somewhat tediously that create
both returns something portable
and is itself portable
. Fortunately, we'll see in the section on niceties that both of these can be made implicit.
Adding portable
and contended
to your .mli files can take some work, but over the next few sections we'll cover what exactly the modes mean, what the most important rules for using them are, and some workarounds and shortcuts. In return, these two modes suffice to run even large amounts of code in fork/join style without fear of data races.
portable
and contended
workFirstly, portable
and contended
are modes. Like a type, a mode describes something about a name in an OCaml program. But whereas a type describes the value associated with a name, a mode instead describes the value's circumstances. This could be where it is in memory, who has access to it, or what can be done with it. If you've seen @ local
(link to doc), you've already encountered the local
mode.
The portable
mode is the one you'll see most often, but it will be easier to understand once we've covered contended
and uncontended
, so we begin there.
contended
and uncontended
modesWe said before that a data race needs four things.
The contended
mode and its opposite, uncontended
, prevent data races by ensuring that this is impossible. They institute two key rules:
Taken together, these two rules guarantee data-race freedom: it's not allowed for two domains to access the same location in memory if either of them could possibly modify it.
Let's see what these rules mean for our thing.ml
from before:
type t = {
price : float;
mutable mood : Mood.t
}
We can force an argument to have a particular mode using the @
syntax.
let price (t @ contended) = t.price (* ok: [price] is immutable *)
let cheer_up (t @ contended) = t.mood <- Happy (* error! *)
Error: This value is contended but expected to be uncontended.
This is of course rule 2. To see why this has to be an error, consider that since t
is contended
, there may be another domain trying to mutate it in parallel:
let bum_out t = t.mood <- Sad (* cue ominous music: a data race lurks! *)
So we can't allow the access in cheer_up
unless we flag the argument as uncontended
:
let cheer_up (t @ uncontended) = t.mood <- Happy (* ok: [t] is [uncontended] *)
Note that you can also let the compiler infer that t
is uncontended
(as we did for bum_out
). Much like with types, adding explicit modes is often useful either to make your intentions clear to other humans or nail down the ultimate cause of a compiler error.
Note that rule 2 forbids even reading the mutable state:
let mood (t @ contended) = t.mood (* error! *)
Error: This value is contended but expected to be shared or uncontended.
This is dangerous for the same reason cheer_up
was: someone else could be running bum_out
in parallel, producing a data race.^shared
Adding uncontended
signals to the compiler that a data race is possible: if cheer_up
and bum_out
can be called with the same argument in parallel, we have a data race. Enter rule 1, which simply says it is not permissible to do so. Unlike rule 2, there's no one place where rule 1 is enforced—really, it's an invariant of the whole system (language, runtime, and core libraries)—but let's see one example of trying to break it. First, make sure cheer_up
and bum_out
have the right signatures in thing.mli
:
val cheer_up : t -> unit @@ portable
val bum_out : t -> unit @@ portable
They both take their argument uncontended
, so it would certainly be a shame if they could be called in parallel:
let beat_the_system par =
let t = { price = 42.0; mood = Neutral } in
let (), () =
Parallel.fork_join2 par (fun _par -> cheer_up t) (fun _par -> bum_out t)
in
()
Fortunately:
Error: This value is contended but expected to be uncontended.
When we cover the portable
mode next, we'll be able to explain precisely what's going on here, but for the moment suffice it to say that the type of fork_join2
forces the two tasks to obey rule 1: since the tasks may be running in separate domains, they are forbidden to access t
as uncontended
.
You may worry that the different modes on price
and mood
prevent writing sensible code like
let price_and_mood (t @ uncontended) = price t, mood t
Can we really treat t
as both contended
(to call price
) and uncontended
(to call mood
)? Actually, we can:
This is entirely consistent with rules 1 and 2: having an uncontended
value gives strictly more power than having a contended
one, so it's always safe to forget we have that power. The upshot is that if a function takes a contended
parameter then the caller is allowed to pass a contended
value but it is not required to. On the other hand, uncontended
is a constraint that must be met.^local-means-allowed
Finally, recall that our running example wants to fork/join over an entire tree of Thing.t
s, so we should consider what happens when the Thing.t
is in a bigger data structure.
In particular, every field of a contended
record, every element of a contended
tuple or array
, and every argument of every constructor of a contended
variant is contended
. And of course this applies recursively, so the components of those components are also contended
. Accordingly, we say that the contended
mode is deep. It's easy to see what goes wrong if we let anyone treat, say, a field as uncontended
when its record is contended
:
type t_in_a_trenchcoat = {
inner_t : t;
}
let cheer_up_sneakily (t_in_a_trenchcoat @ contended) =
let t @ uncontended = t_in_a_trenchcoat.inner_t (* error: rule 4 *) in
cheer_up t (* cue the ominous music again *)
If not for rule 4, we could also write bum_out_sneakily
and then call cheer_up_sneakily
and bum_out_sneakily
on the same argument in parallel, causing the same data race we've been trying to avoid all along.
In fact, rule 4 is pretty well required to make rule 1 work at all. The interested reader is invited to work out why the text of rule 1 demands that, for instance, we say that t_in_a_trenchcoat.inner_t
is contended
when t_in_a_trenchcoat
is contended
.
Now, you may be surprised to see that projecting a field from a contended
record gives a contended
value. After all, we have the function:
let price t = t.price
and we stated that its type is:
val price : t @ contended -> float @@ portable
Shouldn't the returned value be contended
? In other circumstances it would: if you try changing the float
in the definition of t
to a float ref
then you'll find you indeed have to add contended
to the return type of price
. We'll cover the details in the section on mode crossing, but the short answer is that since float
s have no mutable parts, no data race can arise from them, so the compiler simply ignores contended
and uncontended
on float
s (and you can make use of this for your own deeply immutable data as well).
portable
and nonportable
modesAs we said before, rule 1 and rule 2 of contended
give us data-race freedom—assuming we can enforce them, that is. The compiler's type checker can just raise an error when rule 2 is violated, but rule 1 is a bit squishier. To reiterate:
We've seen that Parallel.fork_join2
provides parallelism while enforcing this rule. How does it manage that? Let's look again at our attempt to sneak a data race by it, adding a few things for illustration:
let beat_the_system par =
let t @ uncontended = { price = 42.0; mood = Neutral }
cheer_up t; (* line A *)
let (), () =
Parallel.fork_join2 par
(fun _par -> cheer_up t) (* line B *)
(fun _par -> bum_out t) (* line C *)
in
()
Firstly, the new annotation on t
should be uncontroversial: we just created t
, so clearly there aren't any parallel accesses at all, much less uncontended
accesses. Accordingly, the access on line A is fine: it sees t
as uncontended
and it is. On the other hand, the access from line B is clearly bad: that code might^or-same-domain be running in parallel (in particular, in parallel with line C), and it's still assuming t
is uncontended
(as is line C).
In summary, the arguments to fork_join2
t
to be uncontended
.As you might have guessed, this is exactly what the portable
mode is for. Both arguments to fork_join2
are required to be portable
, and we have two rules:
(So far it looks like portable
is only relevant for functions, but when we get to rule 4, we'll see that records and arrays can also fail to be portable
.)
Why does rule 2 say “outside of its own definition”? Well, remember what we said about the t
defined in beat_the_system
above: it was just now created, so we knew that it was uncontended
. Similarly, the following function is portable
:
let (factorial @ portable) i =
let a @ uncontended = ref 1 in
let rec (loop @ nonportable) i =
if i > 1 then (a := !a * i; loop (i - 1))
in
loop i;
!a
As before, a
is uncontended
because it was just created. We can have factorial
access a
because it's allowed to treat things inside its definition as uncontended
. (Note that a ref
is just a record whose only field is mutable
, so !a
requires uncontended
as always.) On the other hand, if we try and mark loop
as portable
, the compiler sees that a
is defined outside of loop
, so a := !a * i
gets hit with the familiar
This value is contended but expected to be uncontended.
because rule 2b insists that it treat a
as contended
. Fortunately, loop
doesn't need to be portable
: since it's defined inside functorial
, it's safe for functorial
to call loop
even though loop
isn't portable
. (Remember, rule 1 says we can't call loop
from outside the domain that created it. Since factorial
is portable
, that could be any domain, but nonetheless its whole body executes in one consistent domain.)
Interestingly, we actually can make loop
portable:
let (factorial' @ portable) i =
let a @ uncontended = ref 1 in
let rec (loop' @ portable) (a @ uncontended) i =
if i > 1 then (a := !a * i; loop' a (i - 1))
in
loop' a i;
!a
A function's parameters aren't defined “outside the function” for the purposes of rule 2, so loop'
is in fact allowed to use a
as uncontended
. This may seem a bit arbitrary: in both cases, the inner function says “I need a value called a
that's an int ref
at mode uncontended
,” but somehow rule 2 only cares when the a
is something from an outer scope rather than a parameter. The difference is in the types:
val loop : int -> unit
val loop' : int ref @ uncontended -> int -> unit @@ portable
By expressing the requirement on a
in its type, loop'
makes its caller take on the responsibility of providing an uncontended
value. In contrast, loop
advertises nothing—or rather, its being nonportable
advertises that it requires some unknown number of uncontended
values. The caller can't hope to provide that, so in general you can't call a nonportable
function unless you know you can access all the same uncontended
values that it can (which is to say, unless you know you're in the same domain).
Of course, making loop
portable
raises a question: Can we now use fork_join2
to parallelize it? Obviously the answer had better be “no,” since it would be a whole rat's nest of data races, but the reason it falls down is a bit subtle. It is instructive, though, so the interested reader is encouraged to try it as an exercise: if loop'
tries to call itself via Parallel.fork_join2
, what error does the compiler raise and why?
Just as it's safe to forget a value's privileged uncontended
status and downgrade it to contended
, there's no danger in treating something portable
as if it's nonportable
:
And also for similar reasons as before, we need portable
to be deep the way contended
is:
You may have been wondering why rules 1 and 2a restrict all values, when really it's only functions that we care about being portable
or not.^other-code-types This is the reason: many values can contain functions, and if it weren't for rule 4, we could use, say, a record to smuggle the function around (in other words, the record could be a function in a trenchcoat). So we have to forbid all accesses of nonportable
values from other domains, not just function calls. The good news is that many types (in fact, most types) obviously can't cause a problem—you'll never make an (int * string option) list
that contains any function, much less one that will cause a data race—and if the compiler is aware that a type t
can't have a function in it, it can ignore portability requirements altogether for values of t
. We'll cover the specifics when we get to mode crossing.
We can summarize the rules of portable
and contended
like so:
|
| |
---|---|---|
Rule 1 | No parallel | Only |
Rule 2 | No accessing | A |
Rule 3 | Can treat | Can treat |
Rule 4 | Everything in a | Everything in a |
As with most systems, data-race freedom involves a tradeoff between safety and convenience. Fortunately there are powerful ways to keep things convenient in common cases.
portable
If you need to add @@ portable
to one function in an .mli, you'll usually soon find yourself adding @@ portable
to most or all of them. In such a case, you can “portabilize” the whole .mli at once simply by adding @@ portable
at the top. Similarly you can have @@ portable
as the first thing in a sig
to make each function portable
. In either case, you can always say @@ nonportable
to opt out an individual function.
So for example, we can take our earstwhile thing.mli
with the changes we've made to date (and say one more function for illustrative purposes):
type t
module Mood : sig
type t =
| Happy
| Neutral
| Sad
end
val create : price:float -> mood:Mood.t -> t @ portable @@ portable
val price : t @ contended -> float @@ portable
val mood : t -> Mood.t @@ portable
val cheer_up : t -> unit @@ portable
val bum_out : t -> unit @@ portable
val do_something_involving_shared_state : unit -> unit
And change the default to portable
:
@@ portable
type t
module Mood : sig
type t =
| Happy
| Neutral
| Sad
end
val create : price:float -> mood:Mood.t -> t @ portable
val price : t @ contended -> float
val mood : t -> Mood.t
val cheer_up : t -> unit
val bum_out : t -> unit
val do_something_involving_shared_state : unit -> unit @@ nonportable
In discussing why portable
is deep, we mentioned that very few types care about portable
, since the values of most types aren't functions and don't contain functions. Similarly, if a type t
doesn't have any mutable
fields or mutable arrays, having a contended
or uncontended
t
makes no difference.
Fortunately, the compiler can be made aware that a type “doesn't care” about portable
or contended
:
type float_and_int : immutable_data = float * int
The immutable_data
here is called a kind. Just as the value (1.2, 3)
has the type float_and_int
, in turn the type float_and_int
has the kind immutable_data
. Also like types, kinds are inferred where possible, so in fact the : immutable_data
in this case is redundant. However, in an .mli they are often useful.
The three most important kinds for data-race freedom are immutable_data
, mutable_data
, and value
. Some kinds constrain what types they describe, and in return types with those kinds get to cross certain modes, in essense ignoring those modes. In summary (this table isn't nearly exhaustive—see the documentation on kinds (XX link) for many more modes and what kinds cross them):
Kind | Requirements | Crosses |
---|---|---|
| no functions or mutable fields, deeply | portability, contention |
| no functions, deeply | portability |
| none | none |
(Note that unboxed types like int64#
can't have any of these kinds, since they have different kinds that express how they're represented in memory and in registers. Nonetheless, they are all immutable data and thus cross both portability and contention.)
A type crossing portability means that portable
and nonportable
are irrelevant for that type: a nonportable
value can be used as though it were portable
. The same goes for contention with contended
and uncontended
.
let always_portable : int_or_string @ nonportable -> int_or_string @ portable =
fun a ->
let a' @ portable = a in (* ok because [int_or_string] crosses portability *)
a'
(We have to write this slightly awkwardly to specify the returned mode of a function.)
In fact, we've already seen mode crossing in action. We mentioned before when talking about rule 4 of contended
that when Thing.price
projects the price
field out of a contended
Thing.t
, the only reason it gets to return an uncontended
float
is that float
crosses contention.
We can make things more convenient with Thing.t
as well. Recall that its definition is simply
type t =
{ price : float
; mutable mood : Mood.t
}
This clearly falls under mutable_data
, which crosses portability. So we can go into thing.mli
and make a few changes. First, we can change the type:
-type t
+type t : mutable_data
Now we can simplify the definition of create
:
-val create : price:float -> mood:Mood.t -> t @ portable
+val create : price:float -> mood:Mood.t -> t
(At one point it also had a @@ portable
but we took it out by defaulting to portable
.)
Since Thing.t
is now mutable_data
, every Thing.t
is implicitly assumed to be portable
.
Any type variable can be given a kind, so we can write a version of always_portable
that works for any mutable_data
type:
let always_portable' : ('a : mutable_data) @ nonportable -> 'a @ portable =
fun a ->
let a' @ portable = a in
a'
The type of always_portable'
can be read “Given any type 'a
of kind mutable_data
, this function takes a nonportable
'a
and returns a portable
'a
.”
We've seen that mutable
fields can cause quite some trouble: they require uncontended
access, which can make functions not portable
, and generally they very much get in the way. In some cases, however, you may (with significant caveats) be able to swap out mutable
for Atomic.t
. Let's do that with our thing.ml
:
module Thing = struct
module Mood = struct
type t =
| Happy
| Neutral
| Sad
end
type t : immutable_data =
{ price : float
; mood : Mood.t Atomic.t
}
let create ~price ~mood = { price; mood = Atomic.make mood }
let price { price; _ } = price
let mood { mood; _ } = Atomic.get mood
let cheer_up { mood; _ } = Atomic.set mood Happy
let bum_out { mood; _ } = Atomic.set mood Sad
end
As you can see, there's a bit of syntactic overhead, but in return we get to access mood
even if a Thing.t
is contended
, and in fact we can mark t
as immutable_data
so that it ignores contended
and uncontended
altogether (that is, it crosses contention). Do go over the documentation for the Atomic
module (XX link to documentation), as it has many useful operations from compare_exchange
to atomic logical bitwise XOR. Also, note that we're using Core's Atomic
here rather than the Atomic
from OxCaml's standard library, which hews closer to the upstream OCaml standard library and doesn't support mode crossing.
An Atomic.t
can be handy outside of a record as well, in cases where you would otherwise use a ref
to hold mutable state. For example, rather than return the total and count from our fork/join tasks, we can keep the running total and count in atomics:
let average_par_running (par : Parallel.t) tree =
let total = Atomic.make 0.0 in
let count = Atomic.make 0 in
let rec go par tree =
match tree with
| Tree.Leaf x ->
Atomic.update total ~pure_f:(fun total -> total +. Thing.price x);
Atomic.incr count
| Tree.Node (l, r) ->
let (), () =
Parallel.fork_join2 par (fun par -> go par l) (fun par -> go par r)
in
()
in
go par tree;
Atomic.get total /. (Atomic.get count |> Float.of_int)
If we kept total
and count
in ref
s, then go
would not be able to access them. (Exercise: What rules combine to stop us? Remember, a ref
is just a record whose only field is mutable
. Don't rely on rule 1 of contended
or rule 1 of portable
for your answer—those only tell us that we must be stopped, not what actually stops us.)
Now for the caveats: Firstly, there are performance penalties, since an Atomic.t
is a pointer and atomic operations are more expensive. Secondly, Atomic.t
frees us from concerns about data races, but it absolutely does not prevent race conditions in general. There are few guarantees about the order in which the updates to total
and count
occur: we know only that they all happen before the outermost fork_join2
returns.
Of course, in this case what saves us is that it doesn't matter what order the updates happen in: addition is commutative. However, now that we've made mood
atomic as well, suppose we wanted to take the average of all the nodes whose mood
is Happy
? Nothing in the system stops us now: if someone is calling bum_out
on the entire tree in parallel, then our average will reflect an unpredictable number of those changes. The only way to stop that is to use something more sophisticated like a lock over the whole tree, which grants a function uncontended
access while the lock is held (which is safe because of course only one domain can hold the lock). That's what the capsule API is for, but it's beyond the scope of this tutorial.
The good news is that data-race freedom guarantees that even buggy programs can be reasoned about intuitively. See Why are data races bad?.
Another possible way to avoid getting tangled in contended
is to change your array
s into iarray
s. Rule 2 of contended
forbids accessing an element of a contended
array
—that's specifically a contended
value of the OCaml type array
. Since an iarray
's elements are immutable, they're exempt from the rule (just like immutable fields of a record). Obviously, your code may not be able to make this change easily. However, since iarray
s are a new feature, it's worth checking whether your array
s are ever actually mutated. If not, changing them to iarray
s means accesses no longer have to be uncontended
—and possibly even that your record can be immutable_data
.
Our very first example added exactly four integers, and generally we've been assuming that our data is in a shape that makes it obvious how to parallelize (or at least makes one strategy obvious). But of course real data isn't so convenient:
let add_many (arr : int iarray) =
Iarray.fold arr ~init:0 ~f:(fun a b -> a + b)
(We'll be using immutable arrays a lot, since mutable arrays are miserable for parallelism. A mutable array is essentially nothing but a series of mutable
fields, requiring uncontended
access to do almost anything.)
Parallelizing this directly is possible but fussy even in this simple case. The Parallel.Sequence
module makes it nearly trivial:
let add_many_par par arr =
let seq = Parallel.Sequence.of_iarray arr in
Parallel.Sequence.reduce par seq ~f:(fun a b -> a + b)
|> Option.value ~default:0
We first need to convert from iarray
to Parallel.Sequence.t
(XX link to documentation), a general sequence type supporting a rich selection of parallel operations. Among them is a parallel reduce
, which operates much like an unordered fold
.
Just like Parallel.fork_join2
, we can use parallel sequences in a nested manner. Suppose that rather than a binary tree we have an n-ary tree:
module Tree = struct
type 'a t =
| Leaf of 'a
| Nodes of 'a t iarray
end
The sequential code isn't much different from before:
let average tree =
let rec total tree =
match tree with
| Tree.Leaf x -> ~total:(Thing.price x), ~count:1
| Tree.Nodes arr ->
let totals_and_counts =
Iarray.map arr ~f:(fun subtree -> total subtree)
in
Iarray.fold
totals_and_counts
~init:(~total:0.0, ~count:0)
~f:(fun (~total:total_acc, ~count:count_acc) (~total, ~count) ->
~total:(total +. total_acc), ~count:(count + count_acc))
in
let ~total, ~count = total tree in
total /. (count |> Float.of_int)
;;
Naturally, we now use Iarray.map
to recurse on subnodes and Iarray.fold
to combine the results. To parallelize, we still need the Parallel.Sequence.of_iarray
, but now Parallel.Sequence.fold'
combines the map
and reduce
operations:
let average_par (par : Parallel.t) tree =
let rec (total @ portable) par tree =
match tree with
| Tree.Leaf x -> ~total:(Thing.price x), ~count:1
| Tree.Node arr ->
let seq = Parallel.Sequence.of_iarray arr in
Parallel.Sequence.fold'
par
seq
~f:(fun par subtree -> total par subtree)
~init:(~total:0.0, ~count:0)
~combine:(fun _par (~total, ~count) (~total:total2, ~count:count2) ->
~total:(total +. total2), ~count:(count + count2))
[@nontail]
in
let ~total, ~count = total par tree in
total /. (count |> Float.of_int)
(We use fold'
rather than fold
to get the version that passes par
down into our f
and combine
functions.)
Another approach would be to convert the entire tree into a Parallel.Sequence.t
at once rather than rely on nested parallelism. This is possible using Parallel.Sequence.unfold
, though it does take more work, requiring a definition of an unfolding state with an operation that splits it in half.