A Unit of Analogy

The Urbit Has Landed

At long last, Urbit has landed. The Internet is bewildered, flummoxed, and intrigued.

I encountered Urbit (and Moldbug) when the original Moron Lab posts dropped on HN. Unless it was through LoperOS, who incidentally might want to get his tail moving. Either way, I wasn't programming computers then, and I am now, and that's not a coincidence.

See this one time, I had a vision. I'm sure my dimethyltryptamine levels were elevated, since you can't have visions (or dream) without that being the case. It would be impossible to explain, and has proven very difficult to draw. But what it told me is that the Kabbalah, specifically the Tree of Life, holds the key to a Kabbalistic Computer.

So reading the Nock spec was a revelation, because you can map it to the Tree of Life. I haven't been able to suss out whether this was cgy's intention, he's cagey as a mockingbird. He's also a fellow Yid whose passion for the classics exceeds mine by a full order: it's improbable in the extreme that this is coincidence.

This whole post will be steeped in Kabbalah. You think Hoon semantics are weird, watch out. At least I'm not making this stuff up.

A Grove of Pomegranates

Kabbalah, and Hermeticism in general, is not religion. Nor is it science. To call it philosophy is merely to acknowledge that it is too old and crufty to be more specifically typed.

It has a non-trivial relationship to actual mathematics. By far the best modern work of Hermetics is Christopher Alexander's "Nature of Order", though I haven't a clue if Mr. Alexander would consider it to be such a thing.

I mean non-trivial exactly: Hermetics cannot be reduced to mathematics, nor vice versa. Where there is no mapping, one or several await discovery.

Hermetics means "that which is proper to Hermes", which should not be misunderstood as a religious sentiment. We might call engineering Vulcanics by the same trick (I consider interchangeable use of Latin and Greek to be strength, not sin); the Rod of Asklepios means medicine and implies no belief whatsoever in Olympian deity.

You will find no Wikipedia article on Hermetics, though Hermeticism is covered. We pick our isms carefully over at Unit of Analogy, and aren't signing up for this one. There is considerable sympathy.

The Pardesh ha'Rimonim, the grove of pomegranates, is a common image in Kabbalah. Which means "the received". One will find a QBLH desk at every hotel in Israel.

The Tree of Life (in orthodoxy) is an arrangement of the numbers 1 through 10. Each of these Sephirah (the name for number treated in this fashion) is fractal, containing an entire Tree of Life within it. Furthermore, 1 is 10, and there are four levels which repeat between the Ultimate and the Real. They are called Atziluth, Briah, Yetzirah and Assiah. The manifest is not even Assiah, it is merely our picture of Assiah.

Gibberish? No, Jargon. You will find the identical scheme in Plato and in Tantric thought. That's likely to be cribbing, not parallel invention. From whom? Good question.

I will be using jargon, and worse, translating it on the fly into a mapping I completely made up. Can't be helped.

As a statement about ontology, let's set it aside for today. According to the elves, it's a diagram of the network layers. How do we go from the conceptual (we have an urge to calculate) to the realized (here is a machine which does so)?

Any which way we want to, of course! What Would Ari Do?


Look, you already have a spaceship written in the language of horse-headed beasts which eat Yahoos.

I'm taking it for granted that you can handle a little colorful metaphor.

Layer Zero

The zero layer is of course physics. Let's try to stick with things that work on that substrate.

Layer One

This is Atziluth. It is our calculation represented literally, as a calculation. In a word, Nock.

Nock is a work of praeternatural brilliance. I will save all critique for a later post. Let's pretend it's perfect, as indeed it is.

Let's note, though, that to get Layer One running on Layer Zero (our chip) requires a bit of cheating from Layer Three. Can this be avoided? Perhaps.

Layer Two

Layer Two is missing from the Urbit stack, or rather, it is conflated with Layer Three. I see how that happened. It's tempting. It may even be fixable from within the existing structure.

Let me break down how, in ol' QBLH, something might travel down this ladder. There's some primal Thirst that is the same thirst no matter what experiences it, anywhere in Universe. So goes the premise. That's Atziluth. Briah is where an I forms an urge that is a personal thirst. Yetzirah is where this coalesces into an action, in Assiah the action is actually taken, and then the individual experiences a theatrical performance of the act of drinking water, assembled by his or her neural cortex.

Is that last point at all unsettling to you, dear Reader? I should hope not. It is on the firmest of ground, however shaky the rest may be.

That's the metaphor for our network stack. The Briatic layer is specification of form. In perfection it would be purely declarative, and the form it should take is ancient and not open to debate: it is, should be, ultimately must be, a grammar.

Not a powerful set of gonads. A grammar. GLL is totally a thing and for the first time we can make performant grammars that are as expressive as Pāṇini.

Here's some Hoon:

        =+  yug=(yell q.p.lot)
        =>  ^+(. .(rex ?~(f.yug rex ['.' (s-co f.yug)])))
        :-  '~'
        ?:  &(=(0 d.yug) =(0 m.yug) =(0 h.yug) =(0 s.yug))
          ['.' 's' '0' rex]
        =>  ^+(. ?:(=(0 s.yug) . .(rex ['.' 's' (a-co s.yug)])))
        =>  ^+(. ?:(=(0 m.yug) . .(rex ['.' 'm' (a-co m.yug)])))
        =>  ^+(. ?:(=(0 h.yug) . .(rex ['.' 'h' (a-co h.yug)])))
        =>  ^+(. ?:(=(0 d.yug) . .(rex ['.' 'd' (a-co d.yug)])))

This, I am told, specifies the syntax of a floating point number. Or some of it does.

This is from the JSON spec:

    int frac
    int exp
    int frac exp 
    digit1-9 digits
    - digit
    - digit1-9 digits 
    . digits
    e digits
    digit digits

The former is executable, the latter is admirably clear. These advantages can be combined profitably.

This is not even a critique of Hoon the language, because we haven't gotten to Level Three at all. This is an assertion that Hoon is poorly suited to specify any data format which may be expected to be used by anything but the Urbiverse. I consider that a deficit.

Thing is, I'm pretty sure those gonads can be whipped into a nice powerful GLL for parsing binary data. Hoon is not leading the pack as a choice for the first implementation; that's a simple matter of documentation, namely the lack of it.

Level Three

Level Three is the Executive layer, wherein we get to specify what we want our machine to do. Generally that's a programming language of some sort.

Here the difference in approach becomes clear. From cgy's perspective, Urbit's Level Three is written in Hoon. From my perspective, Urbit's L3 is written in Hoon, C, Nock and Markdown.

That is because humans execute, not machines. Machines don't rush to the wall and plug in when they're low on juice, at least not yet. For a calculation to happen, a person (I can only introduce you to human persons, but let's not be prejudiced) must know that they want to calculate, and must know how to do it.

I defy you to do anything at all with only Hoon. Without any reference to Urbit's Markdown files, or rather, their conveniently compiled HTML derivatives. Hell, I can't do anything with Hoon yet, even with existing documentation. That can only improve and is no criticism at all at this stage of the project.


I could like Hoon. I want to like Hoon. I cannot seriously credit the idea of One Language to Rule Them All. If I could, it would not look even vaguely like either Perl or APL. No offense to the Admiral.

If I could credit the idea of the UrTongue, it would clearly need to be a format capable of usefully embedding any existing or contemplated programming languages, cleanly and usefully.

I would strongly recommend to anyone considering designing a new language at the present time: The sequence \n``` is utterly reserved and I will come down on you like a ton of tiny Internet bricks if you tamper with that convention. I suspect the present dominance of github is sufficient motive to keep it real.

This leads to a couple important questions: Will Hoon, presuming decent tutorials and documentation, prove a pleasant systems language? More specifically and urgently, will it be pleasurable to write parsers and compilers in?

I am sold on one aspect of Hoon: as the Urbit core and bootstrap sequence. Why? Because it's there, brah.

I think cgy gets this. The deal with Unix is clear: you can have any language you want, as long as it's C. I hope the same bargain between Urbit and Hoon will prove to be the upgrade we all want.

Again, having a nice tight Layer Two spec would make this all the more likely.

Level Four

You can't execute without an environment, which is fundamentally about data in aggregate. That's Urbit, which is fantastic. The surface area is a set of rules on strings that produces a "directory" and you should just read the docs because they're pretty good. It's URL safe, which is nifty.

I am cheerfully unclear on how any of that operates under the hood. I have notions of how it should work, but no way to contrast that to how it does work. It appears to work, in that pre-alpha-software way. I'd wager the problems we're seeing right now aren't design-level.

This is the Assiah layer, which is the world you actually wander around in when you go get your drink of water. If anyone is still keeping track.

Okay. That was arcane. Your point?

Here's a helpful table:

Kabbalah Urbit Arc
Atziluth Nock AX
Briah Gonads GGF
Yezirah Hoon Marmalade
Assiah Urbit ArcOS/Arcive


I'm sure that made everything much clearer!

The names on the right are referents without value in the present. Unless you count a bunch of Markdown and a partially specified grammar description language. From my modest perspective, that's code, since I can compile it; it does nothing but inform, and even in that capacity is not ready for public consumption.

The Arc doesn't exist and needn't be written if Urbit will serve. By the very nature of the name, it's a huge, ballsy target. It's utterly vaporous, though I hope to release the first tool in the chain before the end of the year. That would be Marmalade, the literate Markdown dialect. The first metacircular compiler of Marmalade is in Clojure, but the lovely thing about Git Flavored Markdown is that one may embed any number of languages in it. Indeed that is rather the point.

In following posts I'll go over the cake, layer by layer, with less attention to Urbit and more to a hand-rolled, idiosyncratic take on the same domain. I'll remind the Reader that there's no substitute for working code, which cgy haz and I haz not.

In the meantime, Urbit is here, utterly fascinating, and on the verge of working. Come check out #urbit on freenode, and join in the madness.

On Decimation

I commonly see the word "decimate" used quantitatively, to mean "to reduce by 90%".

So commonly, in fact, that I consider it correct.

Yes, yes. In Latin, it means "to reduce by 10%", which doesn't sound very scary unless you're a Centurion awaiting the drawing of lots.

That's Latin. In English, having a word for "knocked down by an order of magnitude" is useful. We have, fortunately, not retained decimation as a form of punishment. "to reduce by 90%" is also closer in feeling to the qualitative meaning, "to desolate emotionally".

I was prompted to write this when someone described traffic as "literally decimated" when occupancy of a highway reaches a certain number. It occurred to me that he or she was right to do so. Your useage may vary.

A Parenthetical Observation

Everyone, when they encounter Lisp for the first time, has trouble with the parentheses. It's a rite of passage.

Everyone who goes on to learn Lisp develops a very different attitude. I'd like to share an observation that helps understand why.

I write a fair amount of Lisp, in a few dialects. I could get that work done without a ")" key on my keyboard.

Lisp means never having to close your forms. The delimiters on the left are for you to read, the ones on the right are just a period.

I find it interesting that anathema accumulates around code barriers, rather than internal syntax. Semicolons, significant whitespace, brace placement, and parentheses have all been the subject of holy war, and one may easily find developers who simply refuse to work with certain separation styles unless pressed. At the moment I can't account for this observation, but I'd like to note that I've never seen a holy war over, say, := vs = vs <-. Perhaps I'm not on the right mailing lists.

On the other hand, capitalization and underscore vs. slash vs. nothing are certainly capable of eating up arbitrary amounts of bad blood. Never underestimate the geek's capacity to bikeshed.

Why I Am Lisp 2

First things first: Lisp 1 and Lisp 2 are the worst sort of jargon, because they mean exactly nothing without explanation. Renaming them is even worse.

I've done this kind of bad abstracting before, in public even. Be careful when naming things. Nouns are sticky.

For those who don't speak Boston Yiddish, Lisp 1 means that symbols and functions share a namespace. In Lisp 2, they do not. Simple as that.

Of the Big Four, Clojure and Scheme are Lisp 1. Common Lisp and Emacs Lisp are Lisp 2. Yes, elisp is at least as important as the other three.

Having worked at this point with three of the four, Scheme excepted, I prefer Lisp 2. One simple reason: a function is linguistically a verb. If we don't make a separate namespace for 'variable' uses of symbols, we can't have separate nouns and verbs. This is somewhat uncomfortable in practice.

Ultimately, I feel that this problem arises from a peculiarity of English: The standard way to verb a noun, in the third person (which is what is used for the imperative), involves no change whatsoever to the noun. If we have a list, we list it. There are hundreds of valid verbs which are letter-identical to a corresponding noun, and it is increasingly easy to create more. Thanks Joss Whedon! No, seriously, thanks, it fits English like a glove.

Contrast with a language like Spanish, where there are quite few verbs which cannot be letter distinguished from a noun. It's a shame computer language design is so heavily influenced by English. A Spanish programming language would be a pleasure to work with, and I can only begin to imagine the sophistication possible in an Arabic or Hebrew based programming language that makes clever use of the Semitic roots and mutations.

Meanwhile, back on this planet, we benefit from separate noun and verb slots in a symbol. funcall is a small price to pay.

In Which We Build Zeus

Athena is our weaver. This is her source file.

As we are writing a weaver, it happens that we do not have one. This file must perforce be hand woven until Athena may take over. Asking a Goddess to take over any process should not, and shall not, be done casually.

Athena is written in Git Flavored Markdown, a format designed around code sharing. The executable parts are written in Clojure, using the Instaparse parsing library.


Hail, fleet footed Hermes, beloved of Athena!

Hail, Pallas Athene! Hear the ancient words:

I begin to sing of Pallas Athena, the glorious Goddess, bright-eyed,  
inventive, unbending of heart,  
pure virgin, saviour of cities,  
courageous, Tritogeneia. Wise Zeus himself bare her  
from his awful head, arrayed in warlike arms  
of flashing gold, and awe seized all the gods as they gazed.  
But Athena sprang quickly from the immortal head 
and stood before Zeus who holds the aegis,  
shaking a sharp spear: great Olympus began to reel horribly 
at the might of the bright-eyed Goddess, 
and earth round about cried fearfully,  
and the sea was moved and tossed with dark waves,  
while foam burst forth suddenly:  
the bright Son of Hyperion stopped his swift-footed horses a long while, 
      until the maiden Pallas Athena had stripped the heavenly armour 
      from her immortal shoulders.  
And wise Zeus was glad. 

And so hail to you, daughter of Zeus who holds the aegis!

Now I will remember you and another song as well.


To bootstrap Athena, we write a restricted program. It does not weave, so much as extract and concatenate code.

We then write more Markdown that specifies a macro format, also in this restricted format. We use our first weaver to weave both generations of the project into Athena, which will then be more broadly useful.

This first weaver will be known as zeus. zeus is, of course, that from which Athena will spring full-born.

Clojure projects are typically generated with Leiningen, and Athena is no exception. Leiningen projects are specified in a root directory file called project.clj.

This is project.clj:

(defproject athena "0.1.0-SNAPSHOT"
  :description "Athena: a Weaver of Code"
  :url ""
  :license {:name "BSD 2-Clause License"
            :url "http://"}
  :dependencies [[org.clojure/clojure "1.4.0"]])

In order to weave code, in general, we need a macro format. This may be made as flexible as necessary. The minimal requirement is the ability to specify a macro name, and expand those macros into files.

This is weaving in its essence.

The above code contains no macro, yet. Writing macros in a Lisp is of course pleasurable and powerful, and Clojure is no exception, having definable reader macros. Soon, we will define one.

First, however, we need a parser that can extract our code. For that, we need to add Instaparse.

Time to fire up Catnip real quick. Be back soon!

What we're doing next is adding Instaparse to our project. To do this, we have to tell Leiningen to grab Instaparse, which we must do from the config file. This is normally found at ~/.lein/profiles.clj; if it's not, I hope you know what you're doing. We add the string [instaparse "1.2.2"] to the :plugins vector.

That done, start or restart your project in Catnip, Emacs, or however you like to do it. You must launch with lein, which is totally conventional.

This being a bootstrap, we will need to resort to some custom syntax in our Markdown. As we extract the source, we will encounter various @magic words@, which the parser will do various things with. The ones in this paragraph, for example, it will ignore. The recognition sequence is `@ to begin a magic word, and @` to end one.

These aren't macros. As you can see, they remain in the source code, and don't modify it.

Adding [instaparse "1.2.2"] to our project.clj gives us this:

@/marmion/athena/project.clj@ –> where we find it, natch

(defproject athena "0.1.0-SNAPSHOT"
  :description "Athena: a Weaver of Code"
  :url ""
  :license {:name "BSD 2-Clause License"
            :url "http://"}
  :dependencies [[org.clojure/clojure "1.4.0"]
                 [instaparse "1.2.2"]])

Which should compile.

Lein provides us with the following template in @/marmion/athena/src/athena/core.clj@:

(ns athena.core
    (:require [instaparse.core :as insta]))

(defn foo
  "I don't do a whole lot."
  (println x "Hello, World!"))

We added instaparse.core there. This should compile too.

We now have a powerful and general [GLL] parser at our disposal. Yippie!

Let's do something with it!

How about we open up our source file,, and see if we can produce a quine of our existing file and directory structure?

Leiningen provides us with a test, @/marmion/athena/src/athena/core_test.clj@. It begins life looking like this:

(ns athena.core-test
  (:use clojure.test

(deftest a-test
  (testing "FIXME, I fail."
    (is (= 0 1))))

We will leave it alone for now. Eventually, we will want to test our quine against the code as it was written.

For the same reason, we will leave the function foo in the namespace. Nothing will be deleted or modified, and the order in which code is introduced is the order into which it will be woven. This is a bootstrap, after all.

Instaparse has its own format, which could be specified as a string within the .clj file. We prefer to put the grammar in its own file, @/marmion/athena/zeus.grammar@, which we start like this:

(* A Grammar for Zeus, Father of Athena, A Literate Weaver *)

Our first rule is top level. The markdown may be separated into that which is kept, that which is ignored, and that which is magic.

In Instaparse, that looks something like this:

zeus-program = (magic | code | <markdown>) *

What this says is that a zeus program is any combination of magic, code, and markdown. Since Zeus does nothing with the markdown, we use angle brackets to tell Instaparse that we don't care to see the output.

We'll define code next:

code =  <"`" "`" "`"> code-type code-block+ <"`" "`" "`">
   | <"`" "`" "`"> "\n" code-block+ <"`" "`" "`">

code-type = "clojure" | "text" ;

<code-block> = #'[^`]+' | in-line-code ;

<in-line-code> = !("`" "`" "`") ("`"|"``");

Which will suffice to capture our quine.

Please note: we could use a more direct way to capture three `, if we weren't writing a peculiar quine. Zeus uses the simplest possible grammar to extract a minimalist weaver from this very source file.

A couple notes: code-block is mostly a regular expression that doesn't consume backticks. in-line-code uses negative lookahead !, which is great stuff: it says, if there aren't three backticks ahead of us, you may match one or two of those backticks.

Between them, they match everything except three backticks. Real Markdown uses newlines and triple backticks together. This is harder to write and understand, so we'll do it in the second pass.

We also need magic:

<magic> = <"`@"> magic-word <"@`"> ;

magic-word = #'[^@]+' ;

Which is defined fairly carefully to consume our magic words. We don't use the at-sign elsewhere in the outer Markdown to enable easy magic.

That leaves markdown which is perhaps not strictly named, since the code blocks are markdown also. For Zeus, we may as well call it junk; we have to match it, but we don't look at it. It looks like this:

markdown = #'[^`@]+' | in-line-code;

We're done! We now have a grammar that we can make into a parser, so let's do it: we need to add more code to @/marmion/athena/src/athena/core.clj@.

(def zeus-parser (insta/parser (slurp "zeus.grammar")))

That was simple enough. It disguises the toil of repeatedly writing bad, useless and exponentially explosive grammars.

But then, literature generally hides the messiness behind its production. If you have read the unedited Stranger in a Strange Land, which Heinlein never wanted published, you can see why. Presuming you've read the edited version.

Now, we use zeus-parser to parse this document,

(def parsed-athena (zeus-parser (slurp "")))

When we run core.clj in a REPL, we see that parsed-athena is a tree-structure containing our magic words and code. We've designed this puzzle so that we can use this sorted information in the order we found it, so we don't need the tree structure.

To simply get rid of the tree structure we would flatten it. But this would leave us in a bad way, because some of our code blocks aren't globbed into a single string, thanks to separate detection for ` ` [sic] and `.

Fortunately, this is so common that Instaparse ships with a function for fixing it. insta/transform to the rescue!

First we need a helper function for insta/transform to call:

(defn cat-code "a bit of help for code blocks"
  [tag & body] (vec [tag (apply str body)]))

Then we call it and do some stuff to the results:

(def flat-athena (drop 10 (flatten (insta/transform {:code cat-code} parsed-athena))))

Now, how you feel about this line depends on how you feel about Lisp, generally. This was written progressively from the middle out, on a REPL. It's easy to read if you know that, and would be easier still if formatted more naturally.

A more idiomatic Clojure way to do all this would be to use a threading macro like ->> to thread the data structure through the transformations, instead of making all these global defs. Everything so far could be a single function, though it's sensible to put the parser in its own ref.

drop 10 just gets us past the front matter. We introduce our idioms before we use them, for a reason.

We now have a flat vector, containing all the information we need. We need to transform it into a data structure which may then be massaged and spat out as our original project and core files.

The quine could be completed with a trivial act, which we put in the margins: (spit (remove-tags-flatten-and-concatenate (zeus-parser (slurp :unhide :all))), which calls a function we needn't bother to write. All this does is parse, remove the tags, flatten the remaining literal values, which, because we used :unhide :all, was everything from our original source file. Cute, but not interesting enough to belong in the quine. Opening your source file, doing nothing interesting to it, and saving/printing it is generally a trivial quine, though if the convolutions you put the text through are hard enough to follow you will amuse someone at least.

Instead, let's write a little helper function, key-maker

(defn key-maker
  "makes a keyword name from our file string"
  (keyword  (last (clojure.string/split file-name #"/"))))

This takes our fully-qualified filename, pulled from a magic word, and keywordizes it. The magic words are arranged so there's one each time zeus needs to change files.

Now for the meat of the matter. weave-zeus produces the source file to zeus from the markdown version of this very file.

(defn weave-zeus
  "a weaver to generate our next iteration"
  [state code]
  (if (keyword? (first code))
      (if (= :magic-word (first code))
          (weave-zeus (assoc state
                             :current-file, (first (rest code)))
                      (drop 2 code))
          (let [file-key (key-maker (:current-file state))]
              (weave-zeus (assoc state
                                 (apply str (state file-key) (first (rest (rest code)))))
                          (drop 3 code))))

Now, that's a hack. It's a bootstrap; I fiddled with it until it worked. It is perhaps more readable than a more elegant version, if you have, like most of us, a background or ongoing investment in imperative style. The principle is ruthless pruning and minimal intelligence. We aren't touching it further, though we use it as a spring-off point for migraine, the next step in the process.

Migraine because it actually gives birth to Athena. Named in honor of whichever poor sufferer dreamed that mythos up.

So we move the latest into the project directory, load up the REPL and sure enough, it's all in there. Now we just have to spit it out. To do it right, we'd have kept some exact record of our file name so we could put it into a new directory of the same form. We're going to cheat instead; we did the hard part, and want to keep it readable since we don't have macros with which to bury the boring parts.

So here's our last trick:

(def zeus-map (weave-zeus {} flat-athena))

(do (spit "migraine/zeus.grammar"  (:zeus.grammar zeus-map))
    (spit "migraine/core-test.clj" (:core_test.clj zeus-map))
    (spit "migraine/core.clj"      (:core.clj zeus-map)))

That's it! The structure of the migraine directory is flat, not the structure leiningen requires, and there are some extra newlines in the source, but I don't care and neither should you. It's officially close enough for government work. In our next chapter, we will undergo the formality of writing a test and demonstrating that Migraine's markdown contains Athena alpha, which will be a part of Athena herself.

Migraine, the next chapter in this adventure, will add some actual capabilities. Migraine is just Zeus with an extra headache: instead of producing himself, he has to produce Athena, which is a more challenging software to write.

Progressive Refinement in GLL

The GLL Algorithm is one of those core concepts that can change how we do computation. It's phenomenally powerful. I believe we're just starting to see what it's capable of.

I plan to implement GLL, as soon as some of the support work is done. It's going to be a somewhat leisurely task. I'm taking a lot of notes. Currently, I'm using Instaparse to play around with the algorithm.

It's fantastic stuff, much more flexible as an idiom than, say, ANTLR. It is also much easier to write yourself into an exponential corner, and I find myself abusing regular expressions somewhat rather than torturing the grammar in other ways.

I think this is a consequence of GLL's strengths, which allow it to parse out well over data that is already in a nested structure. Using it for a shallower kind of pattern matching kills performance even in parallel execution, because at any point in matching a long string of related structures, it could encounter a context that would kill the whole chain of inquiry. If there is any possibility of ambiguous matching, this explodes even faster.

An example of this kind of use is parsing a bunch of sentences and paragraphs into text, which is words and whitespace. Except if you encounter a special character you have to switch context completely; this last requirement makes most natural ways of writing the grammar fail.

An alternative would be progressive refinement. In Instaparse, the latest definition of a rule is used, and all earlier rules are ignored. My proposal, which I intend to use in my own work, is that multiple definitions of the same rule are tried sequentially against the data, after a successful parse.

This would damage the arbitrarily-ordered nature of Instaparse grammars, in a sense. Instaparse still has to decide what to do with multiple definitions, and currently uses the last one provided.

This would just formalize something one can already do with Instaparse, which is to parse a string, flatten the area of interest back into a string, and re-parse that string into a new rule. Substitute your new parse for the old one (this is very easy), and you've done it.

Automating that into the grammar would allow one to grab large globs of identical data, switching context when necessary, then parse that data into shape in smaller pieces that don't have to worry about context boundaries.

It's a straightforward adaptation, being properly speaking a composition of the same function, insta/parse grammar, over a subset of the data returned by the first call.

Automated profiling

One neat thing about grammars is that you can run them backwards to generate compliant gibberish. As you'd expect, given the vast difference in magnitude between the set of valid input of a given length and the set of all input of that same length, it is quick to use Markov type rules to generate babble for even an intricate, ambigous grammar that would blow up trying to validate the output.

In fact, that's the point. One may envision, and I may even write, a tool that uses pseudorandom walks to generate longer and longer streams of valid gibberish, and try them against the grammar, looking for suspicious jumps in speed or number of possible parses. Even a naive tool, running on a remote server, would generate useful information while developing a parser. One may envision the tool dialing in where parses go ambigous, generating input accordingly, and alerting the user, or doing the same for square and cubic factors that show up.

If the validity of the babble is questionable, then one has identified permission within the grammer that one may wish to eliminate. It has potential.

A Tangled Web We Weave

Literate Programming is one of those paradigms whose fate is continual reinvention. I've been noticing that my software projects start as Markdown. It stands to reason that they should end up as Markdown as well.

Git Flavored Markdown, in particular, is crying out for a literate, multi-model programming system. The mechanism of named fenced code blocks lets one put multiple languages in a single file, and they will already be syntax highlighted according to the named language.

As literate programming is for the ages, we shall call our system Marmion. The weaver shall be known as Athena; the tangler, Arachne.

If at all possible, we don't want to touch GFM itself. Therefore, here are some principles:

  • Code in fenced code blocks is extracted, macro-expanded, and executed in whatever ways are appropriate.

  • Macros must employ patterns not used in a given language; therefore, we must be able to define those patterns.

  • All configuration happens in special code blocks, called ```config:

{ :name "A config file",
  :format :edn
  :magic-number 42 } ;this is actually tagged ```clojure
  • Code in regular unfenced code blocks is not included in the weave. Nor are fenced code blocks that aren't reached from the top macro. The code above, for example, will not be in the finished weave, because it is exemplary.

  • All text ends up in the tangle, which is an HTML file. No other tangle format is contemplated.

  • If standardized, the tangle format will not be specified, only the markup format and the requirements for the subsequent weave. HTML is a moving target, as is visual display in general.

  • The Markdown may be extended, but only in the same way as any other code: by specifying a macro template and expanding it from provided code. It is the macro-expanded Markdown which is tangled and woven.

  • Corollary: the Markdown is macro expanded before anything in a code block.

  • Corollary: the Markdown macro will be standard. There should be no reason to include it. Because Clojure is the implementation language, and has a defined reader macro syntax, this is already true of Clojure(Script).

  • The weaver should visit all internal links in search of code. Some tag in HTML form should be provided so that fully-marked-up links, so tagged, will also be followed in search of exterior code.

  • If exterior code is requested, it is added to the source as a fenced code block. The tangle will preserve the link directly above the code block. Some sensible effort will be made to infer the code format from the file extension. This is to be done before macro expansion, so that if there are macros in the exterior code, they will be expanded.

  • We should maintain a set of canonical macro patterns for languages, to encourage mutual compatibility in source and tangled code.

  • No mechanism for transclusion on the file level will be provided. The file structure of the Markdown is the file structure of the tangle. Working around this using the tagged-link method will leave a broken link in your tangle.

This is the sort of project that we can tackle in stages. The most important part is the weaver, because we have a fine tangler in the form of Jekyll.

This is a job for Clojure. The weaver and perhaps the tangler will be Clojurescript compatible in the narrow sense, but useless unless Instaparse is ported, which seems unlikely, though you never know.

Clojure is chosen for a few reasons. EDN, for one, which will be the format of any ```config code block. Also because of Instaparse, for which the usual regular-expression based markup approach is a strict subset of capabilities. It has the best story I'm aware of for setting regular expressions declaratively in a data format, which is exactly how we will provide macros.

To be clear, this will let us syntax highlight a provided macro in a distinctive way, and put things like the colors to use right in the markdown. This is only useful with a completed weaver; Pygments will get the macros wrong but this is a minor stylistic matter which can be corrected by retangling with a better highlighter.

Instaparse is my go-to choice for writing flexible parsers that are meant to be shared, so Clojure it is. I hope Instaparse catches on to the point where it becomes core, and hence worth maintaining separate .clj and .cljs versions.

The first, and most important step, is writing Athena, the weaver. The weaver does the following: finds all the ```config code, parses it to configure itself, then goes after the code blocks, and uses the macros and config information to construct the weave. Finally, it calls the trigger file, which must contain everything needed to build the weave into an executable, or whatever the final product is.

The tangler, Arachne, should be a fork of Jekyll, with a low surface area of interaction. What I mean by this is that merges between the bases should avoid touching one another's files wherever possible. The only changes I contemplate personally is to plug-replace the syntax highlighter, for several reasons.

Pygments requires one to write actual code to markup a new format. This is distasteful. Also, we need to markup the macros, which we won't know until we weave the code. Furthermore, a static syntax highlighter should be based on a powerful parser, not a regular engine janked up with extra Python.

If Marmion becomes popular, someone might want to write advanced capabilities: putting compatible code in a REPL, for example, or linking to one from the code, or linking to the line number in a public Github repository generated by the weaver. The last is particularly powerful. All of this will assuredly be easier with a parser-backed tangler.

This is the only way I have to tackle large problems: recursing through the Big Project until I hit something atomic and critical to further progress. Arc leads to GGG, which will benefit greatly from a literate style, which leads to Marmion. Marmion built, writing GGG in an understandable way becomes possible.

I think I've painted myself into a corner, as I can't think of anything offhand which I need to write in order to write Marmion.

Time to generate more Markdown!

Syntax for Literal Strings

I find it somewhat astonishing that the languages with which I'm familiar still start and end strings with the same character. It is as though we used | for { and } in code blocks and relied on context to figure out if it was begin or end.

Incidentally, it's quite possible to write a language this way, and an interesting exercise. for | i = 0 ; i < 2 ; i++ || codeBlock | should parse just fine. Heaven help you if you misplace anything.

Check out bracket delimiters on the Wiki. Two of these things are not like the others. Those two are used preponderantly for strings.

It's clear enough how it happened. A string has an obvious mapping to literary quotation: "That's what she said!". ASCII gives us exactly three options: ', `, and ". It turns out that C was defined using 91 characters, and ` was not among them.

Meta enough, I'm writing this in Markdown, and to type `, I must type `` ` ``. I will leave how I typed `` ` `` as an exercise for the reader.

So C chose " for string syntax, and ' for characters, and these decisions made sense, somewhere in the mists of time. C also initiated the proud tradition of string escaping, which wasn't invented to get around the delimiter problem, but which can be used for that purpose in a hacky way. String escaping is so you can say \n and get a newline, the incidental benefit is you can say \" and get a ", hence one may include any character in such a string. Two backslashes is of course \\\\. One gets used to it.

Oh hey, just for fun, why not write a regex that will match such strings? Won't take you long, I promise. I'll be right here!

To the point. In typography, we don't do this. We start quotations with or and end them with . On the Continent, « and » are used, and this would be my preference as they are much easier to tell apart and don't have two choices for the opening delimiter. If you follow the link, It turns out they are used both «this way» and »this way« and even »this way» by Finns (of course). We favor the first, because all other brackets in computer programming are inward facing <{[(like so)]}>.

What's the point? They aren't on standard keyboards in the US; while any worthwhile editor can get around this, there's a pain point there. Some people will argue a virtue in using ASCII for source code, and while those people have a point, the ship sailed a long time ago. We use Unicode, and it isn't going anywhere.

The point is that, without proper left-right matched strings, you cannot literally quote your own source code within your source code. This is damaged, for any language that lets you evaluate at runtime (the interesting ones IOW). If we use « and », we can use bog-standard reference counting to assure that any properly-balanced literal strings in the source code get quoted. Since in this imaginary syntax a bare » not balanced on the left with a « is a syntax error, any correct program can be embedded.

If, for any reason, you need a bare », why not use the ISO standard SHA-1 hash of the Unicode value of »? Why not indeed. It then becomes impossible to literally quote that one hash, which is officially the point where it is perverse to pursue the matter further. Concatenate for that one.

To be clear, " for escaped strings is concise and well understood, and with enough convolutions one may write as one pleases. It's syntax such as ''' for literal strings that grates against my sensibilities.

Clojure has no syntax for literal, multi-line, unescaped strings. That's too bad; no one does syntax like Rich Hickey, and I suspect that the inadequacy of existing options plays a role here. He may not be willing to go off-keyboard, but I feel that the « and » syntax has a lot to offer. Certainly Europeans would be pleased.

Homoiconicity and Data Forms

Representation of Data in Structured Programs.

Today we're going to discover a programming language. We're going to start by contemplating the idea of code as data.

LISP, and by the all-caps I mean the original flavours, had two fundamental forms: atoms, and lists. As Lisp grew up, the lists became able to represent any sort of data type, but at the expense of a certain homoiconicity.

That's a controversial assertion, but hear me out. A list in a Lisp is a bunch of cons cells, it's a specific data structure used by default to do pretty much anything. Since the first position (first second third) has a function or a macro, you can fake, say a hash, by saying something like (hash a a-prime b b-prime) but here's the problem: that's not homoiconic to your data anymore. Not in a way that accords with modern expectations.

Let's talk about JSON. Now, JSON is homoiconic to your data. {}? Object. []? List. ""? String. (1-9)(digits*)? Number. And so on.

An Introduction to Ent

ent is a new approach to code creation. It is (will be) an editor and library that works on parse trees, rather than on files, and registers all changes as operational transformations. It does so through the medium of familiar code files, but these may be thought of as an interface to the code, and as a product of it, similar to the executable binaries produced by a compiler.

Parse Aware Editing of Structured Files

ent's major rationale is parse-awareness. It will, in general, not allow you to type invalid code, though this can always be overridden. It will parse your code as you create it, storing the resulting file as a series of operational transformations on the parse tree. As a language is more thoroughly defined within ent, this enables REPL-like instant feedback and sophisticated refactoring.