### Module identity in modular packages

Update at the bottom!

In the last post I described Haskell’s current, limited notion of module identity. (It was none too exciting.) Now I’ll expand on that notion in our new package language. With a richer, more careful semantics of identity, we grant packages with the power of generic reusability!

This post builds on the examples of the last, so I suggest a quick refresher. For reference, the package layout from before is repeated below.

Embedding dependency in identity

As I hinted last time, the fact that server-1.0 could only exist built against a single version of http at any given time is rather unfortunate. What if we really do need to allow the http-4.1 and http-4.2 instances to exist in the same program? (Perhaps for compatibility reasons.) Then we’d really need two separate identities — and thus two distinct modules — for those server instances. We thus embed the dependency on the Http module into the identity, resulting in syntactically distinct identities server-1.0:Server(http-4.1:Http) and server-1.0:Server(http-4.2:Http).

This seems unnecessary if we can only instantiate at different version numbers of the same package — here versions 4.1 and 4.2 of the http dependency. Suppose instead that server-1.0 is built against (or more correctly, *checked against*) an abstract interface of HTTP connections. The identity of the server module doesn’t fully exist yet (since it has an abstract dependency), so let’s call it server-1.0:Server(α), where α is the variable standing in for the identity of the final Http module that Server imports.

Now suppose that package A uses a real server implementation (http-4.2) but package B uses a server with only a fake, test implementation of HTTP (mockhttp-3.3). These two packages would thus instantiate the α in the server identity as server-1.0:Server(http-4.2:Http) and server-1.0:Server(mockhttp-3.3:Http) respectively. Both server modules (and both connection types) would coexist in a program that links A and B together because there’s no conflict of identities.

Embedding dependency in module identity allows GHC to make sense of the various instantiations of packages. This is important because now a program might contain multiple instantiations of any given package. And the applicative manner in which module identities are ascribed leads to the natural sharing that one would expect: If A and B both instantiate server-1.0 with http-4.2, then in both packages (and thus the linked result) the identity of the server module is server-1.0:Server(http-4.2:Http).

Last time I mentioned the “dependency hash” that GHC uses to keep up with the instantiations of packages. One might think to incorporate this into module identity as part of the package identity, like server-1.0-abc123:Server. In that case, if we allowed both copies of server-1.0 in the package database at the same time, we might have two modules with identities server-1.0-abc123:Server and server-1.0-321cba:Server. This approach introduces two problems: First, it only works when all modules are concrete — how would the abstract modules (earlier denoted α) be summarized in a hash? Second, the hashing is too “coarse-grained” because it does not expose the dependencies on a per-module basis (or even per-package); all dependencies, no matter how many different packages factor into this “instantiation,” are smushed together into a single, unintelligible hash.

All hail the mixin module

One way to grok what I’ve demonstrated is that Hackage/Cabal currently isn’t expressive enough to allow generic package reuse. If you want to reuse a package, you’re stuck with two unfortunate constraints: (1) You can only instantiate/install it with other versions of its depended-upon package branches (that satisfy its constraints), and (2) only one particular instantiation/installation may (reliably) exist at any given time.

Think of a current package as a big ML structure and its dependencies as free (module) variables. Once created, you can use its components but you’re stuck with its inflexible form.

Instead, we’d like to treat packages as big mixin modules, where the dependencies are abstract module components that must eventually be instantiated. You can write your package code with respect to some abstract interface for a dependency, and then a client can mixin whatever implementation he or she desires in place of its abstract holes. And this can be done as many times and with as many implementations as the client desires. (Technically, in the parlance of MixML, this means packages are really units, not just modules. They’re also highly related to normal ML functors.)

That’s it for today

Imbuing modules with identity in the presence of packages and module abstraction is loaded with subtleties. I’ve presented here the main idea behind our approach with the new Haskell package language. The moral of the story is to re-interpret module-level imports as parameterization instead of definite reference and then to recover a consistent notion of module identity. This allows for generic reuse of package code and emphasizes the package as a programming abstraction rather than merely a chunk of files.

One tricky part of this approach concerns recursive modules — realized in Haskell as cyclical imports. I’ll leave that to a future blog post, partly because I haven’t fully worked it out yet. (Now is a good time to remind the reader that these semantics might change as the project moves forward!)

gasche has asked some good questions in the comments. Instead of posting my response down there I’m updating this post:

gasche: Are you advising to embed module dependencies inside module identity, or to make modules parametric over what was formerly their dependencies, so that they don’t “depend” on them anymore?

My overloading of the term “dependencies” is getting in the way here, I think. You’ve made a crucial distinction that I’ve (perhaps unnecessarily) concealed so far: among the imports, some resolve to abstract modules (i.e. Haskell signatures) and some resolve to concrete modules (i.e. Haskell modules). The dependencies I’m talking about are entirely the abstract kind, which you’re calling “parameters” (and rightfully so) — the imports that don’t go to specific things. We’re using the same import syntax to mean both things because (a) that’s already Haskell’s syntax and (b) there’s no reason that the modules should care.

For example, when checking the server package, the import Http resolves to a signature, and I call that a dependency. However there might be other imports, like import Prelude, that *do* resolve to concrete modules. We can discard those imports from our reasoning about identity since they’re fixed. There’s no way to instantiate Server with a different Prelude so we just consider that as a definite reference that’s baked in. (Note that this means we don’t support any notion of overriding module definitions.) But if the import Prelude instead resolves to a signature, then that too will be part of the dependency.

gasche: I was also unconvinced by your arguments against the current GHC hash.

Hashing open terms in the manner you’ve suggested didn’t occur to me. It sounds like that might address my first argument.

My second argument isn’t so much about hashes’ general unintelligibility as it is about their concealment of inputs. When hashing a set of module identities, we get a single output; but here we’re concerned with a subset of those module identities and we want to compare the part of the hashed output that concerns only that subset.

Here’s an extension of the above example that illustrates the ineffectiveness of this hashing:

package http-4.1:
Http = [Http-4.1.hs]

package http-4.2:
Http = [Http-4.2.hs]

package http-4.1-plus:
include http-4.1 -- grabs its definition of Http
AnotherMod = [AnotherMod.hs]

Now imagine instantiating the server-1.0 package with each of these three packages providing the Http module; this generates three completely different dependency hashes, three completely different package IDs, and thus three completely different Server IDs — server-1.0-abc123.Server, server-1.0-321cba.Server, and server-1.0-def456.Server. But why shouldn’t the instantiated Server modules for http-4.1 and http-4.1-plus be the same? After all, the two instantiations will use precisely the same code.

This is what I mean by GHC’s hashing being too coarse-grained. But with our approach the identity of Http within those two packages is the same, so the resulting Server modules will be the same too. Our notion of identity achieves this sharing of code/modules.

gasche: Finally, this seems strongly related to the now-classic Gilad Bracha’s “ban on imports” [part 1] [part 2]