EBNF and RegEx

So with the 1.13 update to Go, a few big things have changed with numeric literals. You can read about it on their release docs. Here are the relative changes:

* Binary integer literals: The prefix 0b or 0B indicates a binary integer literal such as 0b1011.

* Octal integer literals: The prefix 0o or 0O indicates an octal integer literal such as 0o660. The existing octal notation indicated by a leading 0 followed by octal digits remains valid.

* Digit separators: The digits of any number literal may now be separated (grouped) using underscores, such as in 1_000_000, 0b_1010_0110, or 3.1415_9265. An underscore may appear between any two digits or the literal prefix and the first digit.

That is quite a few changes, but we can summarize this to:

  • Binary literals are a thing. They start with 0b or 0B
  • Octal literals now have a label too! They start with 0o or 0O or a zero followed by only octal digits.
  • Underscores are cool (but only if they follow the rules)

So what does this mean for existing Go workspaces? Well not much actually. If you didn't know about these changes, nothing would be different. Everything added is backwards compatible with previous versions of Go, meaning these additions do not modify existing code. If you wanted to take advantage of these changes for new code, all you would need to do is update to the 1.13 toolset.

However I said 'workspace' and last time I checked, my workspace includes an editor. VSCode to be exact. The Go syntax parser in VSCode is based off Atom's language-go package. At time of writing, the package has not been updated to include these new language changes. While thats not a major problem (the changes are backwards compatible after all), you won't get highlighting for literals with these changes.

What if I want them though? Well, I would have to modify the VSCode Go package with the correct pattern matching expressions or regular expressions (RegEx). I might cover how to do this in another post, but for now let's look at the expressions.

Disclaimer Go is a regular language, defined by a regular grammar. Just because you can write its grammar rules in RegEx, does not me you should. Regular expressions are not the same as regular grammar. A grammar is a set of rules for building a string. An expression is a set of rules for matching a string. In addition, there are multiple "correct" ways to write regex translations from grammar, just as there are multiple ways to write grammar from regex.

The current expression used for integer literals is as follows:

\b((0x[0-9a-fA-F]+)|(0[0-7]+i?)|(\d+([Ee]\d+)?i?)|(\d+[Ee][-+]\d+i?))\b
VSCode Go highlighter Integer Literal RegEx

In English, this means any word boundary; followed by a hex signature and characters, or an octal signature and digits, or a decimal with an exponent; followed by a word boundary. Here is an example of this in action.

Thats cool and all, but how are we going to match the changes to match the new specification. In fact, what are the specifications? Well, let's look at the language spec.

decimal_digit = "0" … "9" .
binary_digit  = "0" | "1" .
octal_digit   = "0" … "7" .
hex_digit     = "0" … "9" | "A" … "F" | "a" … "f" .

int_lit        = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit    = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit     = "0" ( "b" | "B" ) [ "_" ] binary_digits .
octal_lit      = "0" [ "o" | "O" ] [ "_" ] octal_digits .
hex_lit        = "0" ( "x" | "X" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .

The specification is written in Extended Backus–Naur form or EBNF. Each term definition is terminated with a . and terms can be composed of other terms. What we care about here is the int_lit term. Meaning we just need to build expressions that match the combined terms (decimal_lit, binary_lit, octal_lit, and hex_lit).

Note that the language spec does not address exponents here, but instead includes them with floating-point literals. Because of this, I will not be including exponents in this post. I will cover them in another one.

Let's start with decimal literals. A translation of the decimal_lit term into RegEx might look like this:

\b(0|[1-9](_?\d)*)\b
Decimal Literal RegEx

This means that we match either a 0 or we match any number in the set ( [ ] ), which is a range of 1 to 9, followed by zero or more repetitions ( * ) of an optional ( ? ) underscore and a required digit ( \d ). The \b means word boundary, or match a change between word and non-word characters. The actual number match is in a group because there is an alternation ( | ) in there. Without the group ( ( ) ), we would be matching either \b0 or 1-9*\b. Here is a working example of the decimal literal regex.

Awesome. How about binary? To do that, we have to limit the accepted digits to 0 and 1, as well as make sure we have either a 0b or 0B prefixing the whole thing. Here is a translation of the binary_lit term:

\b0[bB](_?[01])+\b
Binary Literal RegEx

Same as before, we start with a 0, but this time we don't need a group/alternation as there is just one expected format. Next we check for a b or a B, followed by at least one match of an optional underscore followed by either a 0 or a 1. Here is a working example.

Next we need to work out that octal literal. This is actually more similar to our binary literal regex than our decimal regex. The difference from our binary pattern is that the base identifier is optional, and we accept digits in the range from 0 to 7. Here is a translation of the octal_lit term:

\b0[oO]?(_?[0-7])+\b
Octal Literal RegEx

Same start: word boundary, base character but this time followed by a ?. This indicates that the token preceding it is optional. And since that token is a set, it can optionally match o, or O. Next is the repeated group of optional underscore and mandatory digit between 0 and 7. Here is the example.

Finally, let's look at the hexadecimal literal. Again, our regex will be similar to the binary literal regex. The changes this time will be the base identifier and the accepted characters. Some (not all) regex parsers have a \h token meaning any valid hexadecimal character. But to make things clear, let's use a set of acceptable characters. Here is a translation of the hex_lit term:

\b0[xX](_?[0-9a-fA-F])+\b
Hexadecimal Literal RegEx

In this expression, the set of acceptable characters is actually 3 difference ranges. We accept anything in the range of 0 to 9, any lowercase letter a to f and any uppercase letter A to F. Here is the example.

So now that we have our separate literal expressions, we can match them all individually. But what if we want an expression that matches all of them? Thats pretty simple to do. Just add them all inside a group, separating each of the smaller expressions with | between them.

\b((0|[1-9](_?\d)*)|0[bB](_?[01])+|0[oO]?(_?[0-7])+|0[xX](_?[0-9a-fA-F])+)\b
Integer Literal RegEx

Check it out here. Now we have an expression matching any integer literal. Here is a diagram showing the flow of this expression.

Full Integer Literal RegEx flow diagram

Some might notice that this isn't the simplest expression. That's not necessarily a bad thing. Short regular expressions are often more susceptible to lower accuracy and bad matches. Here is a page explaining that. However we can make this expression a bit more compact and not risk making it any less accurate.

\b([1-9](_?\d)*|0([bB](_?[01])+|[oO]?(_?[0-7])+|[xX](_?[0-9a-fA-F])+)?)\b
Shortened Integer Literal RegEx flow diagram

What I did here was move the three non-decimal matching patterns and put them in a group of their own, preceded by a 0. This is possible because all three of the non-decimal patterns match a required 0 at the start of their strings. In addition, this new group is optional, allowing us to match a lone 0. And since we now have a way to match 0 on its own, we don't need to match it with decimals. Here is the example.

Shortened Integer Literal RegEx flow diagram