Memories of writing a parser for man pages

March 24, 2018

I generally enjoy being bored, but sometimes enough is enough—that was the case a Sunday afternoon of 2015 when I decided to start an open source project to overcome my boredom.

In my quest for ideas, I stumbled upon a request to build a "Man page viewer built with web standards" by Mathias Bynens and without thinking too much, I started coding a man page parser in JavaScript, which after a lot of back and forths, ended up being Jroff.

Back then, I was familiar with manual pages as a concept and used them a fair amount of times, but that was all I knew, I had no idea how they were generated or if there was a standard in place. Two years later, here are some thoughts on the matter.

How man pages are written

The first thing that surprised me at the time, was the notion that manpages at their core are just plain text files stored somewhere in the system (you can check this directory using the manpath command).

This files not only contain the documentation, but also formatting information using a typesetting system from the 1970s called troff.

troff, and its GNU implementation groff, are programs that process a textual description of a document to produce typeset versions suitable for printing. It's more 'What you describe is what you get' rather than WYSIWYG.

extracted from troff.org

If you are totally unfamiliar with typesetting formats, you can think of them as Markdown on steroids, but in exchange for the flexibility you have a more complex syntax:

groff-compressor

The groff file can be written manually, or generated from other formats such as Markdown, Latex, HTML, and so on with many different tools.

Why groff and man pages are tied together has to do with history, the format has mutated along time, and his lineage is composed of a chain of similarly-named programs: RUNOFF > roff > nroff > troff > groff.

But this doesn't necessarily mean that groff is strictly related to man pages, it's a general-purpose format that has been used to write books and even for phototypesetting.

Moreover, It's worth noting that groff can also call a postprocessor to convert its intermediate output to a final format, which is not necessarily ascii for terminal display! some of the supported formats are: TeX DVI, HTML, Canon, HP LaserJet4 compatible, PostScript, utf8 and many more.

Macros

Other of the cool features of the format is its extensibility, you can write macros that enhance the basic functionalities.

With the vast history of *nix systems, there are several macro packages that group useful macros together for specific functionalities according to the output that you want to generate, examples of macro packages are man, mdoc, mom, ms, mm, and the list goes on.

Manual pages are conventionally written using man and mdoc.

You can easily distinguish native groff commands from macros by the way standard groff packages capitalize their macro names. For man, each macro's name is uppercased, like .PP, .TH, .SH, etc. For mdoc, only the first letter is uppercased: .Pp, .Dt, .Sh.

groff-example

Challenges

Whether you are considering to write your own groff parser, or just curious, these are some of the problems that I have found more challenging.

Context-sensitive grammar

Formally, groff has a context-free grammar, unfortunately, since macros describe opaque bodies of tokens, the set of macros in a package may not itself implement a context-free grammar.

This kept me away (for good or bad) from the parser generators that were available at the time.

Nested macros

Most of the macros in mdoc are callable, this roughly means that macros can be used as arguments of other macros, for example, consider this:

Lack of beginner-friendly resources

Something that really confused me was the lack of a canonical, well defined and clear source to look at, there's a lot of information in the web which assumes a lot about the reader that it takes time to grasp.

Interesting macros

To wrap up, I will offer to you a very short list of macros that I found interesting while developing jroff:

man

mdoc

Sources