Regular Expressions & Brzozowski Algorithm

Regular expressions are an algebraic notation to describe regular languages. They can be converted into minimal deterministic finite automata with the elegant Brzozowski algorithm.

Imagine you are given a regular language, how do you construct the DFA that accepts that language? But not any DFA, one that is minimal that is the one that has the least amount of states.

Regular expressions (RE) were introduced by Kleene in 1956 as an algebraic notation to describe regular languages.

The set \(\mathcal{R}(\Sigma)\) of RE over an alphabet \(\Sigma\) is defined by the syntax

\[R ::= 0 | 1 | a \in \Sigma | S^\star | ST | S + T | S \& T | \lnot S\]

the language of a RE is given by the recursive function \(\mathcal{L}: R \to \mathcal{P}(\Sigma^\star)\)¹

\begin{array}{rll} \mathcal{L}(0) =& \emptyset & \text{the empty language} \\ \mathcal{L}(1) =& \{ \epsilon\ \} & \text{just the empty string} \\ \mathcal{L}(a) =& \{ a \} & \text{just the letter a} \\ \mathcal{L}(S^\star) =& \cup_{n=0}^\infty \mathcal{L}(S)^n & \text{zero or more repetitions of words in S} \\ \mathcal{L}(S T) =& \{ v w | v \in \mathcal{L}(S), w \in \mathcal{L}(T) \} & \text{the concatenation of words from S and T} \\ \mathcal{L}(S + T) =& \mathcal{L}(S) \cup \mathcal{L}(T) & \text{the union of both languages} \\ \mathcal{L}(S \& T) =& \mathcal{L}(S) \cap \mathcal{L}(T) & \text{the intersection of both languages} \\ \mathcal{L}(\lnot S) =& \Sigma^\star / \mathcal{L}(S) & \text{the complement language} \end{array}

For example consider \(R = a (bb + c)^\star\) using the definition we see that \(\mathcal{L}(R)\) must be the concatenation of an "a" with zero or more repetitions of either "bb" or "c" that is \(\mathcal{L}(R) = \{a, abb, ac, abbc, abbbb, acc, abbcbb, \dots\}\).

Using languages we can define equality of REs: two REs are equal if and only if their languages are equal.

With this notion of equality it is easy to see that REs obey the algebraic laws

\begin{array}{ccc} R + R = R & R \& R = R \\ R + S = S + R & R \& S = S \& R \\ R + (S + T) = (R + S) + T & R \& (S \& T) = R \& (S \& T) \\ 0 + R = R + 0 = R & 0 \& R = R \& 0 = 0 \\ \\ (R S) T = R (S T) & (R^\star)^\star = R^\star \\ 0 R = R 0 = 0 & 1^\star = 1 \\ 1 R = R 1 = R & 0^\star = 0 \\ \\ R (S + T) = RS + RT & (R + S) T = RT + ST \end{array}

for example to show that \(R 1 = R\) we need to prove that

\[\mathcal{L}(R 1) = \mathcal{L}(R)\]

applying the definition we see that

\[\mathcal{L}(R 1) = \{ u v | u \in \mathcal{L}(R), v \in \mathcal{L}(1) = \{ \epsilon \} \} = \{ u | u \in \mathcal{L}(R) \} = \mathcal{L}(R)\]

qed.

In [1] Brzozowski introduced the derivative \(\partial_a\) of a language with respect to a letter \(a \in \Sigma\), suppose you have a language \(\mathcal{L}\) its derivative with respect to a letter is

\[\partial_a \mathcal{L} = \{ w | aw \in \mathcal{L}\}\]

that is we chop off an "a" from the beginning of each word of the language (and we exclude words that do not begin with an "a").

Now consider a RE \(R\) along with its language \(\mathcal{L}(R)\), what's the RE whose language is \(\partial_a \mathcal{L}(R)\)? Can we define a RE \(\partial_a R\) such that \(\mathcal{L}(\partial_a R) = \partial_a \mathcal{L}(R)\)?

Yes, the definition is²

\begin{align*} \partial_a 0 &= 0 \\ \partial_a 1 &= 0 \\ \partial_a a &= 1 \\ \partial_a b &= 0 (b \neq a) \\ \partial_a (R^\star) &= (\partial_a R) R^\star \\ \partial_a (R S) &= (\partial_a R) S + \nu(R) \partial_a S \\ \partial_a (R + S) &= \partial_a R + \partial_a S \\ \partial_a (R \& S) &= \partial_a R \& \partial_a S \\ \partial_a (\lnot R) &= \lnot(\partial_a R) \end{align*}

where \(\nu\) tells if a regular expression is nullable that is it generate a language that contains the empty string and is defined recursively as

\begin{align*} \nu(0) &= 0 \\ \nu(1) &= 1 \\ \nu(a) &= 0 \\ \nu(R^\star) &= 1 \\ \nu(R S) &= \nu(R) \& \nu(S) \\ \nu(R + S) &= \nu(R) + \nu(S) \\ \nu(R \& S) &= \nu(R) \& \nu(S) \\ \nu(\lnot R) &= \begin{cases} 1 & \nu(R) = 0 \\ 0 & \nu(R) = 1 \\ \end{cases} \end{align*}

What does this definition mean? \(\partial_a\) tries to remove the letter "a" at the beginning of every word of the generated language and returns a RE that generates what is left (it returns \(0\) if it does not succeed).

This is easy to see this in the case of \(0\), \(1\) and the constant RE³.

For sums and intersections \(\partial_a\) tries to remove the letter a from either member, so for example

\[\partial_a (a + b) = \partial_a a + \partial_a b = 1 + 0 = 1\]

The interesting case is concatenation: the rule contains two terms

in the first it tries to remove the letter from the first member of the concatenation
but we are not done, we must consider the case when the first member is nullable (that is it generates a language containing the empty string), in this case the derivative must try to remove the letter from the second member

For example take \(R=abc\) what is \(\partial_a abc\)?

\[\partial_a abc = (\partial_a a) bc + \nu(a) (\partial_a bc) = 1 bc + 0 = bc\]

in this case the first member \(a\) is not nullable so \(\partial_a\) just removes the first "a" and the second term does nothing.

On the other hand consider \(R=a^\star (abc+def)\) now we have

\begin{align*} \partial_a \left[a^\star (abc+def)\right] &= \partial_a (a^\star) (abc+def) + \nu(a^\star)\partial_a (abc+def) \\ &= a^\star (abc+def) + 1 \partial_a (abc+def) \\ &= a^\star (abc+def) + \partial_a (abc) + \partial_a (def) \\ &= a^\star (abc+def) + bc \end{align*}

the first member now is nullable this means that we can find the words "abc" and "def" without the preceding "a"s from \(a^\star\) and we need the second term to take its derivative.

Using derivatives it is easy to check if a RE matches a word.

Consider the language \(\mathcal{L}(R)\) and a word \(w = a_1a_2 \dots a_n\), how can we check that \(w \in \mathcal{L}(R)\)?

If we take the derivatives with respect to the letters of \(w\) one after the other \(\partial_{a_n} \cdots \partial_{a_2} \partial_{a_1} \mathcal{L}(R)\) we are progressively

discarding from the language the words that do not begin with \(w\)
erasing \(w\) from the beginning of the remaining words

thus if we are left with a language containing the empty string we can be sure that \(w\) was in the language in the first place.

We can do the same with the RE: in order to check if a RE \(R\) matches a word \(w = a_1a_2 \dots a_n\) we can take the derivative

\[\partial_{a_n} \cdots \partial_{a_2} \partial_{a_1} R\]

if it is nullable then \(R\) matches \(w\).

Taking the previous example \(R = a (bb + c)^\star\), how do you check in practice that \(R\) matches the word "abbc" (\(\text{abbc} \in \mathcal{L}(R)\))?

Start with \(R\) and take the derivatives with respect to "a", "b", "b", "c" one after the other

\(\partial_a a (bb + c)^\star = (bb + c)^\star\)
\(\partial_b (bb + c)^\star = b (bb + c)^\star\)
\(\partial_b b (bb + c)^\star = (bb + c)^\star\)
\(\partial_c (bb + c)^\star = (bb + c)^\star\)

we are left with \((bb + c)^\star\) which is nullable so we conclude that we have a match.

Compare it with the case of the word "acac":

\(\partial_a a (bb + c)^\star = (bb + c)^\star\)
\(\partial_c (bb + c)^\star = (bb + c)^\star\)
\(\partial_a (bb + c)^\star = 0\) (and we could just stop here)
\(\partial_c 0 = 0\)

we are left with \(0\) which is not nullable and we conclude that we don't have a match.

Or with "acb"

\(\partial_a a (bb + c)^\star = (bb + c)^\star\)
\(\partial_c (bb + c)^\star = (bb + c)^\star\)
\(\partial_b (bb + c)^\star = b (bb + c)^\star\)

we are left with \(b (bb + c)^\star\) which again is not nullable and we don't have a match.

Looking at the algorithm to match a RE we realize that it has a lot in common with how we match a word with a DFA:

it starts with a RE
it processes the input word one letter at a time
at each step it updates the RE using the derivative
if the final RE is nullable the word is accepted

if we substitute "RE" with "state" it's almost the same as a DFA.

So now we have a suggestion for an algorithm that build a DFA from a RE:

we identify DFA states with RE
there is a transition \(R \xrightarrow{a} S\) if \(\partial_a R = S\)

It works like this, suppose you are given

\[a (bb + c)^\star\]

let us build a map from RE to states, in the beginning it contains only

RE	state	transitions
\(a (bb + c)^\star\)	\(q_0\)

let us take the derivatives

\(\partial_a a (bb + c)^\star = (bb + c)^\star\)
\(\partial_b a (bb + c)^\star = \partial_c a (bb + c)^\star = 0\)

we see that the only non-zero derivative is \((bb +c)^\star\), let's add it to the map as new state \(q_1\) remembering that we got there from \(q_0\)

RE	state	transitions
\(a (bb + c)^\star\)	\(q_0\)	\(q_0 \xrightarrow{a} q_1\)
\((bb + c)^\star\)	\(q_1\)

again we take the derivatives of the new RE (ignoring 0s)

\(\partial_b (bb + c)^\star = b (bb + c)^\star\)
\(\partial_c (bb + c)^\star = (bb + c)^\star\)

we see that the first derivative gives a new RE: we add it to the map along with its transition

RE	state	transitions
\(a (bb + c)^\star\)	\(q_0\)	\(q_0 \xrightarrow{a} q_1\)
\((bb + c)^\star\)	\(q_1\)	\(q_1 \xrightarrow{b} q_2\)
\(b (bb +c)^\star\)	\(q_2\)

but the second derivative gives a RE that we have already seen: \(q_1\) so we do not need to add it anew, we only record the new transition \(q_1 \xrightarrow{c} q_1\).

RE	state	transitions
\(a (bb + c)^\star\)	\(q_0\)	\(q_0 \xrightarrow{a} q_1\)
\((bb + c)^\star\)	\(q_1\)	\(q_1 \xrightarrow{b} q_2\)
		\(q_1 \xrightarrow{c} q_1\)
\(b (bb +c)^\star\)	\(q_2\)

taking the derivatives again

\(\partial_b b (bb +c)^\star = (bb +c)^\star\)

we add the new transition \(q_2 \xrightarrow{b} q_1\)

RE	state	transitions
\(a (bb + c)^\star\)	\(q_0\)	\(q_0 \xrightarrow{a} q_1\)
\((bb + c)^\star\)	\(q_1\)	\(q_1 \xrightarrow{c} q_1\)
		\(q_1 \xrightarrow{b} q_2\)
\(b (bb +c)^\star\)	\(q_2\)	\(q_2 \xrightarrow{b} q_1\)

and since there are no new RE to add to the map we stop.

We are almost finished, we only need to specify the start state and the accepting states

as start state we take \(q_0\) (which corresponds to the original RE)
as accepting states we take all the states that are nullable: just \(q_1\)

the final result is

You can find the detailed algorithm in [2].

For an implementation in C++ you can look at pywfplan/fsm.h.

More examples of DFA construction are here pywfplan/brzozowski demo.ipynb.

Why is the resulting automaton minimal?

If we think of states as representing features of a subword then in a minimal automaton we do not expect the same feature to be associated to more than one state.

Brzozowski demonstrates that taking RE derivatives as states ensures just that (see [1] Th.5.1).

[1]

J. A. Brzozowski, Derivatives of Regular Expressions, Journal of the ACM 11, 481 (1964).

[2]

S. OWENS, J. REPPY, and A. TURON, Regular-Expression Derivatives Re-Examined, Journal of Functional Programming 19, 173 (2009).

Footnotes:

\(\mathcal{P}(\Sigma^\star)\) is just the set of subsets of \(\Sigma^\star\)

note how the rules for sum and product resemble those of the usual derivative: \((f(x)+g(x))' = f'(x)+g'(x)\) and \((f(x)g(x))' = f'(x)g(x)+f(x)g'(x)\)

in this case \(\partial_a a\) removes the letter "a" and leaves us with the rest which is just the empty RE \(1\), if the letters do not match \(\partial_a b\) fail and return \(0\)