Thursday, September 02, 2010

Source code architecture, part I

As MathScheme grows, various scalability issues need to be solved. Various solutions leap to mind, coupled with words such as packages, libraries, and modules. But are these actually solutions to the problems we are encountering? Which leads to this post: what are the problems that need a solution? This is a requirements issue first and foremost; we shouldn't jump into designs and solutions until that's been properly figured out. But let's walk through the discovery process rather than jump ahead too far. [Be prepared for a long tortuous story, but hopefully there will be a happy ending.]

The issue first popped up when we were doing some data-structure development (well, theories of types of data-structure, to be precise), and we wanted to separate that out from the base theories that cover algebraic structures. But data-structures very quickly leverage algebraic structure, so that we needed to build our second library 'on top of' the first. The obvious thing to do is to implement some kind of import mechanism. But something naive really doesn't scale: C's #include mechanism, for example, is a complete nightmare for scalability and maintenance.

A saner method is something like Haskell's import (or Java's package for that matter). This provides namespace support, and so allows you to segregate your code into discrete chunks and still be able to use it seamlessly. But this isn't perfect either: with a large enough set of imports, one no longer knows where any particular function came from. This is made considerably worse if the import mechanism silently does shadowing. There are two refinements of this idea: explicit imports and qualified imports. Explicit imports require you to name all the functions you want, while qualified imports let you give a "short name" (or alias) to the whole package. But both of these have problems as well: explicit import lists can grow to be very large, and now matter how "short", these alises clutter the resulting code in sometimes rather unaesthetic ways. Furthermore, a lot of these solutions are file-based, in that the name of the 'chunk' of code must somehow correspond to the name of the file it is in, which also comes with all sorts of problems of its own (as file names are really not portable from OS to OS, with directory names being even worse).

At this point, two further solutions ideas pop into mind. [I don't really know where my education has failed me, but so much emphasis on problem-solving has lead to an instinct of jumping from potential solution to potential solution. A better process would have been to focus much earlier on refining the exact problem statement to be solved, but this step is something which I must consciously remind myself to do, rather than the unconscious coming-up-with-solutions.] In any case, these are literate programming and URIs. But since these are again in solution-space, let's not explore these now. It is at this point that it became rather clear that the thinking process was solution-driven, and what was needed was some problem-space thinking.

So what is the real problem? We have an asset (code), which we want to use in many different ways. The code base will grow to be large enough that it needs to be organized. But some methods of organization are too obtrusive, and are a larger inconvenience than the problem they solve (imagine if all your functions had ~100 character names and you had to type them out in full everywhere). The instinct is then to come up with a reasonable compromise between order and convenience.

But that's not necessarily a good solution either! The point is, we want to use our code in many different ways. Crucial point: different tasks require different levels of order and convenience. Which means that we really need to understand what the different tasks are. And what the background axioms are. 

(My) Axioms:
  • There will be many people wanting to develop the code base simultaneously.
  • There is no one organization which will fit all uses.
  • When we name and use an entity, it should be (relatively) easy to pinpoint where its definition is. [Yes, ad hoc overloading and AOP, I'm looking at you.]
  • Complex resolution rules are not 'easy'.
  • Tying the organization mechanism to concepts from the underlying operating system is a bad idea.
  • local code should be clutter-free.
  • global and glue code should be precise.
  • source code is primarily intended for humans to read and understand.
  • source code may require significant processing before it is machine-ready.
  • Names of entities should not be artificially restricted (like global uniqueness)
  • There are different scales for a code base.
Some explanations are necessary: local code is code which, while defined within a context, understanding that code is largely a local affair; low-level libraries are obvious examples.  Global (and glue)  code is code which uses a large context for its work (pulls in concepts from a lot of different places).  And what is a context?  It is the set of all definitions necessary for the current code to 'make sense'; it is not necessary to recurse through these definitions and expand them as well.  And scale is really a synonym for the size of a context, which is purely the number of definitions (rather than their complexity).

There are also some clear tensions: if we use the same name in two different parts of the source code and then try to use these two pieces simultaneously, we have a problem.  Our axioms tell us that the system should not artificially force us to restrict names (so as to avoid this problem).  Of course, if these two pieces are meant to be used together, then it is good design to avoid needless clashes - but that is a human design decision.  If this naming overlap makes sense, then there needs to be a mechanism to allow for precision; any kind of implicit resolution mechanism which relies on subtle ordering properties of either the source code or (worse) the assembly process hamper understanding too much to be viable.  Some restrictions on names of entities are natural: when coding 'in the small', clearly one wants names to be unique -- but there should be no restriction that forces global uniqueness.  There are some names which are just very natural in certain contexts, and it ought to be possible to use that name, at least locally.  For example shadowing seems like a bad solution - it goes against ease of resolution.

A number of tasks were embedded (or at least inferable) in those axioms:
  • Write clutter-free local code.
  • Establish a context.
  • Write global code.
  • Trace where definitions come from.
  • Assemble 'code' from 'source code' aimed at humans.
  • Organize 'source code' so that it can be understood by humans.
  • Organize 'source code' so that it can be assembled by programs.
What I see as important is not so much how things are organized, but rather how does one establish a context.  In other words, how do you name chunks of code so that later one can refer to these chunks as the context in which to work?  And it seems to all come down to a matter of scale: small contexts are easily understood, large contexts are confusing.

Of course, one can use organization to deal with this problem: assemble chunks of source code into modules and packages.  If each of these expose only a small number of 'functions', then a context made up of a (small) number of modules/packages should be reasonably easy to understand.  Wishful thinking you say?  Agreed.

In the next part, analysis of the above will be continued.  A solution may even emerge.  More likely, the tasks will need to be refined, and new ones will arise, requiring further analysis.

1 comment:

... said...

The idea of ``naming chunks of code'' is easily realised by Emacs' org-mode. Therein one programs in a literate style and one gives an actual name to a code block which is then referenced for its value, or output, in other code blocks. Perhaps such names could be used to extract the desired contexts?