Mechanized Mathematics: September 2010

The first part told the story of the requirements surrounding a packaging mechanism for code chunks. Alex left some helpful comments (only Buzz, not here). Here, we expand upon what was said in the first part. This should establish a sufficient understand upon which to build a solution.

Assets

Why talk about assets and code chunks instead of functions (procedures, class members, what have you)? The influence here is completely from literate programming where one can name fairly arbitrary chunks of code and re-use them. But surely in a language that has been influenced by Haskell (amongst many others), functions would be sufficient? Maybe. If the polymorphism supported by the language is sufficiently rich. And that's the source of my hedging: at this point, I don't want to mix these two issues (organization and polymorphism). But it is an issue to keep in mind: in a strongly typed language, insufficient polymorphism and/or persistent abstractions leads to macros. [A 'persistent' abstraction is an abstraction mechanism over which the programmer has no control, i.e. cannot reliably tell the compiler to inline it. I'll probably make a post on that in the future.] Macros are not evil per se, but they are a way of "giving up" on a language's intrinsic features as "not powerful enough". I would like to avoid that.

All this to say that, in theory, all assets should be 'like functions' where external dependencies, both input and output, should be visible and controllable. In the meantime, let's just go for "code chunks".

Naming

There was an implicit assumption that all our chunks are (somehow) named; this can be taken as an additional axiom. Names are not the only 'solution', but it sure is very convenient. For theoretical purposes, as well as ease of implementation, there are other solutions. The explicit design choice here is to favour human understandability above other measures. So names it is.

Another assumption is that names are expected to be locally meaningful, but not necessarily globally stable. In other words, it is implicitly assumed that we have a renaming mechanism. That is an easy assumption because, at the lower levels, we already do have a renaming mechanism, it is used a lot, and it is very handy. So we just extend it to the higher levels too.

Assembly

Because we priviledge human understanding (of source code), this means that machine-ready code might require a fair bit of processing to 'build' it. Clearly such processing can be an important part of 'understanding' too. Let us see it another way: whether it is the denotational semantics (of the source code) or the operational semantics (of the assembly code) which is 'complicated', both will present a barrier to understandability. Ideally, the operational semantics of the assembly code would be a straightforward implementation of some kind of hierarchical denotational semantics for the source code.

But we already assume that we 'understand' the MathScheme language. And the language has built-in features dealing with syntax, as well as black boxes. So it is a rather simple leap to think that 'assembly' should be regarded as something which should be an internal rather than external entity. In other words, as a program, it should be written in the same language.

Context

Why do we name chunks at all? To be able to conveniently refer to them. Why would we want to do that? To be able to (re)use them. The simplest situation is when we name a function, we just want to call it. In the situation at hand, what we really want to do is to create a context, aka a set of definitions, within which the code we are currently reading/writing 'makes sense'.

Ideally, everything would be done by reference, so as to minimize actual duplication. In practice, we sometimes do want duplication. But that will be considered to be a separate problem - though needs to be 'part of' the design.

As MathScheme grows, various scalability issues need to be solved. Various solutions leap to mind, coupled with words such as packages, libraries, and modules. But are these actually solutions to the problems we are encountering? Which leads to this post: what are the problems that need a solution? This is a requirements issue first and foremost; we shouldn't jump into designs and solutions until that's been properly figured out. But let's walk through the discovery process rather than jump ahead too far. [Be prepared for a long tortuous story, but hopefully there will be a happy ending.]

The issue first popped up when we were doing some data-structure development (well, theories of types of data-structure, to be precise), and we wanted to separate that out from the base theories that cover algebraic structures. But data-structures very quickly leverage algebraic structure, so that we needed to build our second library 'on top of' the first. The obvious thing to do is to implement some kind of import mechanism. But something naive really doesn't scale: C's #include mechanism, for example, is a complete nightmare for scalability and maintenance.

A saner method is something like Haskell's import (or Java's package for that matter). This provides namespace support, and so allows you to segregate your code into discrete chunks and still be able to use it seamlessly. But this isn't perfect either: with a large enough set of imports, one no longer knows where any particular function came from. This is made considerably worse if the import mechanism silently does shadowing. There are two refinements of this idea: explicit imports and qualified imports. Explicit imports require you to name all the functions you want, while qualified imports let you give a "short name" (or alias) to the whole package. But both of these have problems as well: explicit import lists can grow to be very large, and now matter how "short", these alises clutter the resulting code in sometimes rather unaesthetic ways. Furthermore, a lot of these solutions are file-based, in that the name of the 'chunk' of code must somehow correspond to the name of the file it is in, which also comes with all sorts of problems of its own (as file names are really not portable from OS to OS, with directory names being even worse).

At this point, two further solutions ideas pop into mind. [I don't really know where my education has failed me, but so much emphasis on problem-solving has lead to an instinct of jumping from potential solution to potential solution. A better process would have been to focus much earlier on refining the exact problem statement to be solved, but this step is something which I must consciously remind myself to do, rather than the unconscious coming-up-with-solutions.] In any case, these are literate programming and URIs. But since these are again in solution-space, let's not explore these now. It is at this point that it became rather clear that the thinking process was solution-driven, and what was needed was some problem-space thinking.

So what is the real problem? We have an asset (code), which we want to use in many different ways. The code base will grow to be large enough that it needs to be organized. But some methods of organization are too obtrusive, and are a larger inconvenience than the problem they solve (imagine if all your functions had ~100 character names and you had to type them out in full everywhere). The instinct is then to come up with a reasonable compromise between order and convenience.

But that's not necessarily a good solution either! The point is, we want to use our code in many different ways. Crucial point: different tasks require different levels of order and convenience. Which means that we really need to understand what the different tasks are. And what the background axioms are.

(My) Axioms:

There will be many people wanting to develop the code base simultaneously.
There is no one organization which will fit all uses.
When we name and use an entity, it should be (relatively) easy to pinpoint where its definition is. [Yes, ad hoc overloading and AOP, I'm looking at you.]
Complex resolution rules are not 'easy'.
Tying the organization mechanism to concepts from the underlying operating system is a bad idea.
local code should be clutter-free.
global and glue code should be precise.
source code is primarily intended for humans to read and understand.
source code may require significant processing before it is machine-ready.
Names of entities should not be artificially restricted (like global uniqueness)
There are different scales for a code base.

Some explanations are necessary: local code is code which, while defined within a context, understanding that code is largely a local affair; low-level libraries are obvious examples. Global (and glue) code is code which uses a large context for its work (pulls in concepts from a lot of different places). And what is a context? It is the set of all definitions necessary for the current code to 'make sense'; it is not necessary to recurse through these definitions and expand them as well. And scale is really a synonym for the size of a context, which is purely the number of definitions (rather than their complexity).

There are also some clear tensions: if we use the same name in two different parts of the source code and then try to use these two pieces simultaneously, we have a problem. Our axioms tell us that the system should not artificially force us to restrict names (so as to avoid this problem). Of course, if these two pieces are meant to be used together, then it is good design to avoid needless clashes - but that is a human design decision. If this naming overlap makes sense, then there needs to be a mechanism to allow for precision; any kind of implicit resolution mechanism which relies on subtle ordering properties of either the source code or (worse) the assembly process hamper understanding too much to be viable. Some restrictions on names of entities are natural: when coding 'in the small', clearly one wants names to be unique -- but there should be no restriction that forces global uniqueness. There are some names which are just very natural in certain contexts, and it ought to be possible to use that name, at least locally. For example shadowing seems like a bad solution - it goes against ease of resolution.

A number of tasks were embedded (or at least inferable) in those axioms:

Write clutter-free local code.
Establish a context.
Write global code.
Trace where definitions come from.
Assemble 'code' from 'source code' aimed at humans.
Organize 'source code' so that it can be understood by humans.
Organize 'source code' so that it can be assembled by programs.

What I see as important is not so much how things are organized, but rather how does one establish a context. In other words, how do you name chunks of code so that later one can refer to these chunks as the context in which to work? And it seems to all come down to a matter of scale: small contexts are easily understood, large contexts are confusing.

Of course, one can use organization to deal with this problem: assemble chunks of source code into modules and packages. If each of these expose only a small number of 'functions', then a context made up of a (small) number of modules/packages should be reasonably easy to understand. Wishful thinking you say? Agreed.

In the next part, analysis of the above will be continued. A solution may even emerge. More likely, the tasks will need to be refined, and new ones will arise, requiring further analysis.

Mechanized Mathematics

Monday, September 13, 2010

Source code architecture, part II

Assets

Naming

Assembly

Context

Thursday, September 02, 2010

Source code architecture, part I