fedops blog

Privacy in Computing

Mon 28 March 2022

Installing and Using pandoc on Fedora/Centos/Rocky

Posted by fedops in Howto   

Documentation-As-Code is a much better approach than your WYSIWYG editor of old.

The general consensus regarding how to create and maintain documentation has definitely changed over the past few years. People had come to regard WYSIWYG text processors as the gold standard. Effortlessly arrange text and images, see what you are getting in real time, and everything is just a click away. Besides the big commercial packages there is also a sprawling landscape of free software, such as LibreOffice that many people happily use.

However, the disadvantages have become ever more obvious:

  • most word processors use binary formats to store the documents, usually in one big file or a "camouflaged" zip archive containing multiple files.
  • said files are usually in proprietary formats and not only not easily translatable to other software packages, but also fast-changing within one software suite.
  • meaningful version control is nigh-on impossible.
  • comparing changes between versions is usually only possible with software-specific tools, or not at all.
  • converting or processing files for different output formats is a mostly manual process.
  • writers tend to spend way too much time obsessing over formatting minutiae or fighting the software instead of concentrating on the content of their document.
  • most formatting is haphazard and "for show" only, usually using formatting styles which may or may not carry across documents and software versions.

Alternatives

Better methods have existed for decades, such as the roff/nroff/groff typesetting systems for Unix, SGML, or (La)TeX for every system under the sun. Usually there is a fairly steep learning curve involved, which means their use was limited to special situations or was only picked up by people writing a lot of intricate documents, such as researchers or technical writers (the profession).

Rest/Markdown

Lately several low-barrier-to-entry documentation formats have come onto the scene. They have been adopted especially by developers, such as reStructured Text and Markdown. While nowhere near as flexible as for example TeX, they follow the 80/20 principle. 20% of effort nets 80% of the gain, there are workarounds for advanced use cases, and a lot of the burden of writing documentation is removed. Which is a crucial first step in actually having documentation written!

What's more, being plain-text based these documents can be:

  • easily version-controlled
  • collaboratively edited with standard conflict-resolving merge strategies
  • are light-weight, completely open, and require no special editing tools
  • can be read as-is in source form without losing any content
  • can be automatically converted into HTML web pages and other document formats

The apparent sparseness of markup also means that a special WYSIWYG editor isn't required for many use cases - what is in your editor is fine to read as-is. Also, especially Markdown has become the darling of Git forges and as such can be readily rendered and edited in web interfaces.

The pandoc system

One of THE best tools to work with many of these formats is pandoc. It can process various input formats, among them Markdown in various flavors and reStructured Text. On the output side it again supports a slew of formats, among them HTML and PDF output - the latter by way of LaTeX typesetting, guaranteeing the highest quality. There exist many extensions to pandoc itself, and of course one can use any of the zillion of add-on packages for LaTeX or develop new ones, making this a fairly easy step up from or indeed a side-by-side to hand-editing LaTeX.

One of the downsides is that the complete pandoc system tends to be somewhat involved to set up. There isn't a one-stop shop solution with up to date packages available for RPM-based systems. So here's what I did with the hope of it being useful to you as well.

This is current as of 28-Mar-2022 and was done on a Rocky Linux 8.5 system.

pandoc

This is the actual pandoc software itself. I found it easiest to download the correct pandoc binary for your architecture. If you prefer to build it yourself then documentation is available there.

Once you have it downloaded, unpack the binary into /usr/local/pandoc/<version> and create a symlink (using ln -s) from the binary into /usr/local/pandoc so the shell will find it in your $PATH. Using this approach enables you to maintain multiple pandoc releases on the same system. If something goes wrong with a new version you just need to adjust the symlink to point to the older one and you're back in business until you get the problem fixed.

pandoc-crossref

The is the cross-referencer for pandoc, a very useful extension. Download the suitable tar archive from the releases page at github and unpack it somewhere.

Move the executable into /usr/local/bin so it can be found in the $PATH by pandoc and the man page into /usr/local/share/man/man1 so it's in your MANPATH.

rsvg-convert

This is the SVG converter which will enable you to use vector images in your documents. Install using dnf install librsvg2-tools.

texlive

The Rocky/Centos packaged texlive version is very old and misses templates. This must be installed from scratch following the instructions given on the TeX User Group's web site.

Start by downloading the install.tar.gz installer archive and unpacking it somewhere (e.g. your ~/Downloads directory; you can delete it again after installation is complete).

Then, as root create and chown directory /usr/local/texlive to your non-root user running the installation.

Finally, cd into the unpacked installer and run perl ./install-tl. This will download thousands of packages and run for quite a while depending on your machine and Internet connection speeds. When the process is complete you will find the installation in /usr/local/texlive, owned by you. If you wish you can run a sudo chown -R bin:bin /usr/local/texlive on the entire directory tree to make it accessible to all users on your system.

Eisvogel template

A template I have really grown to like is Eisvogel. It can be used to create professional looking documents in corporate environments.

If you want to install it, download the tar.gz file from the releases page of github. Unpack it into ~/.local/share/pandoc/templates - do a mkdir -p ~/.local/share/pandoc/templates if that doesn't exist yet. If you later find additional templates you want to use, they all install into that directory.

Setup environment

For every user that should work with pandoc, add the following to ~/.bashrc or ~/.bash_profile (or equivalent startup files if you use another shell):

# User specific environment
if ! [[ "$PATH" =~ "$HOME/.local/bin:$HOME/bin:" ]]
then
    PATH="$HOME/.local/bin:$HOME/bin:$PATH:/usr/local/texlive/2021/bin/x86_64-linux"
fi
export PATH

# for pandoc
MANPATH=${MANPATH}:/usr/local/texlive/2021/texmf-dist/doc/man
INFOPATH=${INFOPATH}:/usr/local/texlive/2021/texmf-dist/doc/info
export MANPATH INFOPATH

Test an example

Go into cd ~/.local/share/pandoc/templates/examples/basic-example and run: pandoc basic-example.md -o basic-example.pdf --from markdown --template eisvogel --listings.

This should create beautifully formatted output in basic-example.pdf which is styled with the Eisvogel template.

"Documents as Code"

With this out of the way you are now in a position to start writing your documents in Rest or Markdown. I'd suggest starting a new (sub-)directory with a useful hierarchy to store them, and to take that under Git control so you always have backups and version management.

A fantastic software development use case is to keep your documentation as reStructured Text or Markdown files inside of your source code repo. You would treat them as regular source files, including tagging for releases. Then add a pandoc run to your CI flows to ensure that the current version of documentation is built along with your software artifacts. The exact same documentation would then be available as an HTML document tree, and as a shipped PDF file. And of course you have full access to Git tags or commit hashes which you can use in places such as document headers or front matter.

Note that pandoc can be run on lists of input files, processing them into a single output file. So I tend to write my documents as single chapter files; e.g. 00-cover.md, 01-scope.md, 02-intro.md, ..., xx-glossary.md. Running pandoc *.md -o document.pdf automatically assembles them in the correct order. No more futzing around with splitting and merging documents. Want to add a chapter? Simply create a new file.

Similarily, it's also a piece of cake to auto-generate some content. Let's say you want to include a list of error messages as an appendix chapter. A shell script in your build pipeline can extract that information from the source code, beautify it with some markup, and write it to a specific file.

The possibilities are endless once you move beyond the point & click way of creating documentation.