Is highlight.js Harmful for Your Site?

Syntax highlighting on the client side is bad for users & the web

Is highlight.js Harmful for Your Site?

Abstract

It’s strangely common to see the usage of client-side JavaScript libraries to do syntax highlighting on what is largely static content—be that a blog, documentation, a code forge, or something else. Given the contents don’t change this is almost entirely a waste of bandwidth & power, & some practices can lead to a leakier site for privacy. The solution should be apparent from the premise, build-time or server-side rendering, but why is it a big deal?

Introduction

If not already obvious, the title of this weblog post is clickbait. There are very good reasons to have highlight.js (‘picked on’ for being probably the most popular), Prism, et al., in a client-side package. These syntax highlighters are meant for highlighting your dynamic in web applications where a server side trip would be slow & expensive. In fact, the aforementioned options can be used as a part of a build-time or server-side highlighting solution.

Important

The focus of this post however will be on that static side of the equation where syntax highlighting on the client is a problem—not disparaging a particular syntax highlighting library.

A probably not uncommon scenario

Imagine: you’re a developer & you just built a cool project or have a cool idea & now the time has come to publish it and/or content about it. Since you’re a dev & your audience is largely other devs, you’ll be needing code samples & demonstrations on how to use your project/idea. Your text editor had ‘fancy’ colors for the different syntactic elements of your code & your docs/posts deserves that same treatment. With other devs reading, they’ll likely be used to this colorful setup too, & prefer too since the colors help most coders quickly parse the code in their heads. But you’ve ran into a problem: HTML’s built-in <pre> tag offers no such highlighting—no, you are stuck with black & white (or whatever colors you’ve designed with, or whatever color the consumer’s user agent selected) but this is solvable because you’ve seen it other places on the web! You fire up your favorite privacy-friendly search engine to blast off a query: “how to add syntax highlighting to a website”. SEO being best gamified by the so-called ‘gurus’ bubble up results at the time of this writing pointing specifically to highlight.js or Prism & with copypastable <script> + <style> tags in the head from a third-party. You, not wanting to think about this more necessary, paste in these resources in your project page & as if by magic, your new flagship laptop & flagship smart phone on an unlimited 5G data plan render the rainbow with but a small flicker.

Case closed. Or was it?

—The narrator of this post

What’s happening with these scripts?

A couple of script was copied into our sources. First if these script weren’t vendored onto our domain we’re probably connecting to a third-party CDN, & with none of the examples I saw checking integrity or talking about vendoring we can probably assume this to be true in many cases—but these CDNs are useless & dangerous, offer little to no performance improvement, can be hacked (tho integrity checks can mitigate somewhat), do go down, expose user IP addresses, & require adding exceptions to our content security policy CSP. If we assume the dev did the right thing & vendored their sources, the files will be downloaded once the the user agent encounters them in the <head> (or elsewhere 😞) & then a second script that must execute the syntax highlighter on the appropriate sources after both the script is downloaded+parsed & the DOMContentLoaded event is fired. The elements we selected will need to be queried from the DOM their text contents will be ran thru a lexer to gather tokens, followed by a parser, to make sense of those tokens, which is generates a syntax tree of the contents. That tree is then handed to a printer function that needs to wrap all the relevant parts with their <span>s class’d in a way that matches a corresponding style sheet. The DOM element’s contents are wiped & then replaced with our printer’s output with <span> elements (and maybe line numbers & other things if enabled).

To the outsider this is magical. The fact that it can happen fairly quickly is a bit of a marvel in & of itself. However, a lot is happening & it’s happening on every request & for every users.

Why is it bad to do syntax highlighting like this?

Our biggest offender: idempotency

Idempotency, in mathematics & computer science is when a certain action repeated multiple times produces the same result. The makers of our syntax highlighter options are smart & have test suites to more or less guarantee that given a particular version of the parser setup, given a certain blob of text to parse, we printer will give us the same output. Under our client-side setup, we are doing this load+query+lex+parse+print+insert loop on every page refresh, & each page we navigate to. But it’s not just a single user’s time/CPU cycles that were wasted, every user’s machine consuming this content is doing the exact same load+query+lex+parse+print+insert task to get the exact same resulting HTML. This task isn’t cheap either, especially on low-end hardware (an e-reader for instance should be a optimal devices to read weblog posts/documentation & they also are not known for be CPU powerhouses). The larger your project & with more users viewing it, the more resources are wasted.

If we were for example going to calculate the Fibonacci sequence for a large numbers we might employ a technique like memoization to cache results of previous iterations so we can look in that cache to find values we’ve calculated already. If we applied such a technique to syntax highlighting, it would look like this: our build tool would in CI or otherwise would run this highlighting once & we would serve those results to our users so only the build tool needed to calculate it for our users. Similarly we may have a dynamic-ish page that can cache these views at choose-your-layer of the stack. There are very few situation where the content doesn’t change often that an end user should ever be doing this parsing on this principle alone.

Delaying our experience & flashing content

Due to awaiting the page & the script’s loading + full execution, we will always cause repaints & flashes for users. There is no way around this with client-side rendering. This can lead to mild annoyance, to dropped users waiting for loading, to unnecessarily chewing thru a user’s battery. A slow initial paint can lead to worse performance metrics & unprioritized SEO in some cases as well.

Network implies latency & it’s not free

While it’s nice that syntax highlighters usually break up their scripts per language to save on size, even optimized these requests are delaying your page load times. In many parts of the world (or just folks that don’t like to be wasteful) downloading these scripts take considerable time. While this is true of all scripts, not all scripts are as useless syntax highlighters on static content.

No experience for the wise JavaScript allowlisters

Without invoking imagery of the likely-to-age-poorly “Chad”, we really should recognize that even with contemporary (over)reliance on JavaScript disabling JavaScript by default is a best practice for internet hygiene (JS is for web applications …and progressively enhancing web pages). As such, making experiences for these folks, likely power users, is ideal (within reason—we don’t need to resort to checkbox hacks & such). In the scenario of the build-time or server-side rendering you give these users the same, optimal experience. As a bonus can help the low-end, broken-X11, or saving-every-watt-of-battery folks where even elinks, the TUI browser, supports CSS & could get the nice highlighted experience.

Tip
To be become an JS allowlister using uBlock Origin
  • Navigate to the add-on’s settings (triple cogwheel ⚙️)
  • Settings → Default behavior → ✓ Disable JavaScript

When on a site that requires JS is encountered:

  • Open the fly-out menu
  • Click the </> button labeled “click to no longer disable JavaScript site” followed by a page refresh
  • Optionally click the 🔒 lock icon to save the settings & permanently allow the site to execute JavaScript (at least the scripts that get thru the filtering process)

Solving by highlighting syntax just once

There are a wealth of highlighting options like Tree-sitter, highlight, Pygments, Rouge, Chroma, just to name a few. We also shouldn’t forget JS options leading this post in highlight.js & Prism that function just as well. All of these have have or could be adapted to a CLI or come in a library form—meaning they can be used at the time of build for a static site (like documentation) or ran quickly on the server side & sent to the user. Doing highlighting at this phase fixes all of the drawbacks in the previous section of this post on the pitfalls of client-side highlighting.

Can we think of (or create) some pitfalls?

These solutions, unless offered as an configuration option in your framework/build tool, will require a bit more effort to create versus copypasta JavaScript libraries. The rest of the pitfalls become a stretch. Tree falling in a forest, if no one ever visits your content, you have wasted your own CPU cycles to render the text, but with just 2 page visits (might even be yourself testing that everything looks okay), moving the syntax highlight away from the client is already a win. Caching all of the highlighting could be expensive for the developer—tho maybe pushing it the client as a ‘not my problem now’ solution isn’t the right thing to do; this really only applies to server-side rendering, as a static site by its nature is generating all the files to be ready to be cached with the whole point being to eliminate the need for the moving parts of a server by building once.

Transgressors

Bad, but excusable is the solo developer or small team, but there are some big projects that are perpetuating this bad practice, & some of those projects are used for documentation for lots of downstream projects, & I want to call them out openly since I would like to see the landscape change. All of these projects could & should be baking syntax highlighting into the their systems & they have enough people to look at the problem. I hope to in the future strikethru these tools after they fix it.

MkDocs
It doesn’t work very well without JavaScript (menus that require dropdowns simply do nothing, not offering a click-thru to the menu despite being marked up with an <a> anchor tag), but I’ll spare you the other gripes. Not only does it do synax highlighting client-side via highlight.js, but by default, it’s using a public CDN without integrity checks. The scripts themselves are blocking in head without defer.
mdBook
Popular in the Rust community & I guess will be used for Nixpkgs, mdBook’s syntax highlighting section states it’s shipping with highlight.js (no CDN) which can be extended by the user. Also the rendering situation is made worse by <script> tags at the end of the body instead of in the <head> with defer; this means the scripts aren’t blocking which is good, but when a script is in the <head> it clues the user agent into starting to download these resources to be ready for when the document is loaded. Alternatively a prefetch link header would have a similar effect, it’s missing (the extra bytes might just favor defer anyhow in some situations). There’s even open merge requests opened for years about build-time rendering, but nothing has happened with any of them.
docsify
Beware

These docs require JavaScript to read, ironically for a static site/documentation generator labeling itself as “simple and lightweight”

A JavaScript-based static site generator that, according to language highlighting, bakes in Prism, & not only uses a public CDN but encourages it.

The Pijul Nest
Recently this forge did upgrade to no longer requiring Cloudflare’s public CDN for its highlight.js scripts, choosing instead to vendor such scripts, but instead a) the entire architecture is now in Cloudflare’s ‘edge’ offerings & b) if you’re going to use a Cloudflare’s architecture, at least cache these views on the edge instead of hurting the user experience.

Takeaway

Client-side syntax highlighting has unseen costs to many developers. We can solve a lot of these costs by moving the syntax highlighting to the server and/or build tool. Developers should be considerate of a user’s time, their data usage, their power usage, & the impact frivelous computations have on the planet. These ideas can & should expand to other parts of the document with heavy parsing/rendering requirements such as LaTeX, MathJax, diagrams (Mermaid, Graphviz, etc.), & more.