jon.recoil.org / Code Block Metadata

Code block metadata

March 17, 2025 #odoc #parsing

Back in 2021 @julow introduced some new syntax to odoc’s code blocks to allow us to attach arbitrary metadata to the blocks. We imposed no structure on this; it was simply a block of text in between the language tag and the start of the code block. Now odoc needs to use it itself, we need to be a bit more precise about how it’s defined.

The original concept looked like this:

{@ocaml metadata goes here in an unstructued way[
  ... code ...
]}

where everything in between the language (“ocaml” in this case) and the opening square bracket would be captured and put into the AST verbatim. Odoc itself has had no particular use for this, but it has been used in mdx to control how it handles the code blocks, for example to skip processing of the block, to synchronise the block with another file, to disable testing the block on particular OSs and so on.

As part of the Odoc 3 release we decided to address one of our oldest open issues, that of extracting code blocks from mli/mld files for inclusion into other files. This is similar to the file-sync facility in mdx but it works in the other direction: the canonical source is in the mld/mli file. In order to do this, we now need to use the metadata so we can select which code blocks to extract, and so we needed a more concrete specification of how the metadata should be parsed.

We looked at what mdx does, but the way it works is rather ad-hoc, using very simple String.splits to chop up the metadata. This is OK for mdx as it’s fully in charge of what things the user might want to put into the metadata, but for a general parsing library like odoc.parser we need to be a bit more careful. Daniel Bünzli suggested a simple strategy of atoms and bindings inspired by s-expressions. The idea is that we can have something like this:

{@ocaml atom1 "atom two" key1=value1 "key 2"="value with spaces"[
    ... code content ...
]}

Daniel suggested a very minimal escaping rule, whereby a string could contain a literal " by prefixing with a backslash - something like; "value with a \" and spaces", but we discussed it during the odoc developer meeting and felt that we might want something a little more familiar. So we took a look at the lexer in sexplib and found that it follows the lexical conventions of OCaml’s strings, and decided that would be a reasonable approach for us to follow too.

The resulting code, including the extraction logic, was implemented in PR 1326 mainly by @panglesd with a little help from me on the lexer.