How We Ported HighlightJS to Dart

09
May
HighlightJS is one of the most popular packages for code highlighting, and is used on many websites where code snippets can be seen. At Akvelon, we needed the same for apps written in Dart.
There is a Dart highlight package that was made by porting HighlightJS, however it was abandoned and not maintained for 2 years. We needed a fresh version with bugs fixed, so we decided to fork and revive that package and then to catch up on HighlightJS changes.
This resulted in highlighting and flutter_highlighting packages, which you can now use too. In this article, we will share with you details on how we made them.
The original JavaScript package consists of its core and 197 language definitions:
Each language is declared as an instance of Language interface. It does not contain the whole syntax tree definition, but only rules to parse comments, string literals, and some other things sufficient for highlighting. This is more efficient than a whole syntax tree parser.
Each such rule is called a Mode. There are modes for C-style comments, for string literals, for keywords, etc. A Mode may contain other Modes. For instance, a doc comment may contain references to variables, and a string literal may contain variable interpolation. A Mode may even contain a piece in another language: think of a doc comment with Markdown code. To accommodate this, the Language interface extends the Mode interface.
The core of the package contains the parser and some other processors of the Language objects. The output of the parser is a Result object, which can be turned into an HTML string.
The architecture of HighlightJS is mirrored in Dart for ease of maintenance. We have the core, the language definitions, and the Result class with the same fields.
Additionally, we have a Flutter package that wraps the whole thing in a widget. It accepts the code string and builds a RichText widget with a tree of colored TextSpan objects.
The original Dart package was ported from HighlightJS 9.18, while the current version is 11.7, so a lot has changed.
Because the language definitions change the most, the package comes with a porting script for them. However, the tool no longer works out of the box. With minor changes, we could port many languages, but others required extensive work.
There is no porting tool for the core because it does not follow a simple format like the language definitions, and there is no general transpiler from JavaScript to Dart.
The Core of HighlightJS diverged to the point that it was easier to discard the old Dart core and write a new one. This is why we chose to transpile the core from JavaScript to Dart manually. We just replicated all classes and functions, mostly line by line.
We rely on tests to see if anything has changed in the JavaScript core that needs to be reflected in the Dart core.
In HighlightJS, each language is defined in a separate file as a factory function that returns an object conforming to the Language interface. For instance, here is the definition of Dockerfile syntax:
The key property here is contains, which is a list of the Mode objects that can be parsed from text. In this example, only the last element is specific to the language, while 4 others are reusable constants because they are common for many languages:
For instance, this is the definition for double-quoted string literals:
Here:
After all the contains entries in the Language are processed, keywords come into play and consume whatever is not claimed by literals and comments (TODO: VERIFY).
The easiest way to parse and transpile such language definitions is to write a tool in TypeScript. This tool is a client of the HighlightJS library. It loops through language definitions and calls each one’s factory function to get a JavaScript object. This object will get longer because all constants for common modes will expand.
For the above Dockerfile syntax definition, we get this Language object at runtime:
Then, the idea is to walk through this JavaScript object and to write the definition of the Language object in Dart based on that.
If we simply generate the Dart equivalent of this object definition, it will be just as long. We can simplify this if we detect those common modes that were just expanded by JavaScript.
In the case of Dockerfile, we should identify those HASH_COMMENT_MODE, APOS_STRING_MODE, QUOTE_STRING_MODE, and NUMBER_MODE.
To be able to use these building blocks in the Dart language definitions, we must port them to Dart.
We do this by inspecting the global hljs object because each common mode is an exported constant that ends up being a property of hljs.
Then we write common_modes.dart file with definitions like these:
Or with a pass to detect nested common modes:
Now that we ported the common modes, which are the building blocks of a language, we can do the same with the language definitions.
However, a huge difference is that language definitions may have circular references.
A good example is Dart language, which allows interpolating strings with arbitrary expressions, that may in turn contain strings, and so on.
This definition creates a circular reference (comments are mine):
This means that we cannot just iterate this object to write the Dart definition because the recursion will be infinite.
To break the circular references, we will use the circular-json package. It is an object serializer that detects if some object is repeated in the structure.
This simple snippet shows what the package does:
In this example:
In general, it replaces all repeating objects with the path of their first occurrence in the structure.
When we serialize the Dart language definition this way, we get this JSON below (comments are mine):
In this JSON, we get a lot of such tokens: ~contains~0~variants~4~contains~2
Sometimes they save us from circular references. Other times, they just shorten the language definition by avoiding repetitions of long JSONs.
Also while serializing, we replace occurrences of common modes with their names to make the definition even shorter:
This whole language definition can now be parsed back with the ordinary JSON.parse() into a non-circular object. We can walk it and write the Dart equivalent:
In this definition, Modes come in different ways:
At runtime, a language must be ‘compiled’ before it can be used for highlighting.
This means that all the ‘reference’ modes should be replaced with their dictionary entries: each Mode(ref: "~contains~...") is replaced with the corresponding Mode object from the Language.refs map.
With circular references back in place, we can do recursive highlighting of string literals that contain interpolation that contains string literals, etc.
At this point, we are ready to highlight most of the languages that HighlightJS supports. The actual parsing of code and matching against modes is done in the package core that we transpiled to Dart manually.
HighlightJS has golden tests. For each supported language, it has snippets of input code and the reference HTML to be produced when highlighting them.
We need to take those snippets and feed them to the Dart package to see if it highlights them to the same HTML as the original HighlightJS.
It is as simple as:
This work is one-off, so no language-specific work is required.
The original Dart package did that. We additionally made the tool write the actual output when it does not match the golden one. So for each mismatch we have a directory:
For each broken language, we have a directory with all broken tests. For each of them, we have three files: the original code snippet, the actual highlighted HTML, and the expected highlighted HTML, so we can easily compare:
At this point most of the languages were working, and the tests showed us what exactly was still broken. We inspected each broken language to identify and port rarely used mechanisms in HighlightJS.
Unfortunately, not all language definitions are declarative. Some modes use callbacks when matching against code.
For instance, in PHP a multi-line string starts with a token that will also end it (called the “Heredoc” syntax):
It’s hard to come up with a declarative solution to match such patterns. So this is how HighlightJS defines a Mode for such syntax:
The END_SAME_AS_BEGIN function adds two callbacks to the passed Mode:
This results in the following effective definition of this Mode:
Note the entries on:begin and on:end.
The first callback is called when the beginning of the Mode is matched. It stores the token that matched the regular expression.
The second callback is called when the end of the Mode is matched. It makes the core ignore the match if the ending token differs from the one that stated the match.
Since these callbacks contain arbitrary code, we cannot port them automatically. Our solution was to port them manually and to have a dictionary to map the JavaScript function bodies to Dart function names.
The most common callbacks were produced by functions like END_SAME_AS_BEGIN. For them we made the following map:
In JavaScript, the function.toString() returns the function code. We use this fact to populate the map.
We added a new check when serializing a language definition to JSON. We inspect all properties and find callbacks. If the callback’s body is found in this map, we replace it with the Dart function name.
Other callbacks are specific to the languages they are used in. For example, in the definition of the JavaScript language there is a huge callback that checks if something is a JSX tag.
As with any callback, we manually transpiled (TODO: LINK) it to Dart. For language-specific callbacks, we just generate the Dart function names from the path where this callback was defined:
language_javascript_contains_0_contains_0_variants_0_onBegin
HighlightJS can auto-detect the language. Initially, we wanted to skip this to complete the explicit highlighting faster. But it turned out that auto-detection is necessary even for that.
For example, XML language can treat the style tag in two ways:
This is how these rules are defined:
Therefore, even to highlight a snippet as explicit XML, we must be able to detect different languages in its content.
We ported the auto-detection part of the core as well, and that fixed a few more languages for us.
As a result, this is how we improved over the package that was 2 years old:
Improved highlighting details in languages with new callbacks. With older HighlightJS, PHP and others had to go false-positive on string endings without that check. That led to errors that were extremely hard to find:
A lot of apps can make use of code syntax highlighting, including but not limited to:
This article was written by:
Lead Flutter Developer at Akvelon
Stay updated with tech trends,
our company news and case studies
Akvelon Secures Six Awards from the Puget Sound Business Journal