Week 33

More work this week on the OCaml MCP server. Sadiq and I met before I went away on holiday and discussed the next steps to 'park' the work on the MCP server. The final steps are:

Not much, right? As always though, writing things up lead to a whole load more work.

The first problem occurred when writing up how it parsed the input docs. It turned out that when converting the repo so that it took markdown formatted files (using a slightly tweaked version of davesnx's PR), Claude had decided that the way to do this was to first convert the markdown into HTML, and then use the HTML parser it had already built. Whilst tidying this up, Claude was remarkably keen to just use regexps to parse the markdown rather than using a pre-existing markdown library, so it took a little persuasion to get it into a state I was happy with.

The second issue was that the script that form the bulk of the repo had been written at different times, and therefore Claude didn't really take into account any of the decisions it had made in one script when building the next. So most of the command-line arguments were slightly different, which made writing up a mini 'howto' in the README quite a jarring experience.

Thirdly, and most importantly, we had decided that we needed a few example searches to show how the system worked. We'd already had a useful experience with this when Anil had tried to search for a 'time and date parsing and formatting' library, so it shouldn't really have been a surprise that trying a few more examples showed some more interesting behaviour. Specifically, the searches I wanted to do were for an "HTTP client", "JSON parser", "Cryptographic Hash" and Anil's time-and-date query, and in actually trying these searches and critically examining the results, I had to go back and figure out why they weren't giving me the results I had expected.

The first of these searches I had anticipated would be quite interesting, as this is a query that should show the OCaml ecosystem missing an obvious HTTP client. However, even with this in mind one of the top results was one of Cohttp's module types, Cohttp.Generic.Client.S. This, of course, isn't much use if you're looking for an HTTP client, as module-types aren't going to give you an implementation to actually use. So I decided that we'd exclude module-types from the results. This turned out to be slightly more tricky than I anticipated as we'd lost the distinction between modules and module types further back in the pipeline, so Claude had to do some plumbing to ensure we had this information at the point we were doing the search.

The cryptographic hash search gave some plausible looking results, so I moved on to the JSON search. I was expecting to see Yojson somewhere near the top of the list as that's a very popular library. I was also expecting to see Jsonm somewhere near the top - or at least I'd like to be able to find it by searching for a 'streaming parser' as that's one of its key strengths. However, searching for "JSON parser" yielded some less than brilliant answers. The top 5 results were for modules in the packages yojson-five, decoders-yojson, decoders-jsonaf, ocplib-json-typed-browser and ppx_protocol_conv_jsonm. While all of these are clearly in the same realm as I was after, having jsonm show up literally 99th in the list, and yojson itself not in the top 100 wasn't a great result.

Some investigation showed that yojson had a particularly bad showing because the description of the module Yojson.Basic was the empty string! This turned out to be because of some bad error-handling logic in the summariser script, which ended up turning some errors into a blank description. Since running the summariser costs actual money, I didn't want to just rerun the whole thing, so I asked Claude for a script to find these problems and rerun them. The problem is not totally trivial as the summaries of child modules are used when generating the summary for parents, so when one is regenerated we should regenerate the summaries of all ancestors too. Given my recent experiences with Claude I'd like to look this over quite carefully before letting it loose on my data, so I've run it on yojson, which seemed to do the right thing, but not yet on the rest of the packages.

Having fixed this, I still found that jsonm was making a very poor showing. This turned out to be because the description it gives itself is a "Non-blocking streaming JSON codec for OCaml" which had a fairly low similarity with "JSON parser". I was using a fairly small embedding model for the queries - Qwen/Qwen3-Embedding-0.6B, so I thought I might address this by using a larger one, and opted for Qwen/Qwen3-Embedding-8B. The machine I had been using for the MCP server has no GPU and had taken a while to do the embeddings using the 0.6B model, so I switched to generating them on my M4 macbook. This went much faster, though since I have about 70Mb of module summaries it still took quite a while. This improved the situation somewhat, but it was still not high in the list.

So I took a step back and had a think about the problem some more. Searching for a JSON parser is really quite a high-level search, and when evaluating the results I realised I was really thinking in terms of packages rather than modules. So I thought we could split the search in two - a package search and a module search. The package search would be used for the broad queries where you're interested in pulling in whole chunks of functionality, and the module search is for more low-level queries. In fact, the 'time and dating formatting' query is somewhere in between, so I might need to have some more example queries for the module search functions. In addition, the module search could be restricted to the set of packages you're using, which might make it even more useful.

Part of the split meant that I needed a different source of 'popularity' for the packages than the occurrences data that came out of docs ci, as that was per-module and I needed something per-package. The obvious thing is to look at reverse dependencies in opam. I have this kind-of working, but it's currently not particularly smart, so this will need a little more attention. For example, it currently thinks that melange has over 3000 reverse dependencies.

With these changes in place, a package search for 'JSON parser' now returns yojson as number one, followed by ppx_deriving_yojson, ezjsonm, ocplib-json-typed and jsonaf. Unfortunately jsonm is still languishing in 27th place, so there's still some tweaking to do.