Brzozowski derivative

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Brzozowski derivative (on red background) of a dictionary string set with respect to the string "con"

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u1S of a set S of strings and a string u is the set of all strings obtainable from a string in S by cutting off the prefix u. Formally:

u1S={vΣ*uvS}.

For example,

c1{cat,cow,dog}={at,ow}.

The Brzozowski derivative was introduced under various different names since the late 1950s.[1][2][3] Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an algorithm to compute the derivative of a generalized regular expression.[4]

Definition

[edit | edit source]

Even though originally studied for regular expressions, the definition applies to arbitrary formal languages. Given any formal language S over an alphabet Σ and any string uΣ*, the derivative of S with respect to u is defined as:[5]

u1S={vΣ*uvS}

The Brzozowski derivative is a special case of left quotient by a singleton set containing only u:  u1S={u}S.

Equivalently, for all u,vΣ*:

vu1SuvS.

From the definition, for all u,vΣ*:

(uv)1S=v1(u1S)

since for all wΣ*, we have w(uv)1SuvwSvwu1Swv1(u1S).

The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all aΣ,uΣ*: (ua)1S=a1(u1S)ε1S=S

A language SΣ* is called nullable if and only if it contains the empty string ε. Each language S is uniquely determined by nullability of its derivatives:

wS  εw1S

A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string wΣ* denotes a node in the tree, with label true when wS and false otherwise. In this interpretation, the derivative with respect to a symbol a corresponds to the subtree obtained by following the edge a from the root. Decomposing a tree into the root and the subtrees a1S corresponds to the following equality, which holds for every language SΣ*:

S=({ε}S)aΣa(a1S).

Derivatives of generalized regular expressions

[edit | edit source]

When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.

Given a finite alphabet A of symbols,[6] a generalized regular expression R denotes a possibly infinite set of finite-length strings over the alphabet A, called the language of R, denoted L(R).

A generalized regular expression can be one of the following (where a is a symbol of the alphabet A, and R and S are generalized regular expressions):

  • "∅" denotes the empty set: L(∅) = {},
  • "ε" denotes the singleton set containing the empty string: L(ε) = {ε},
  • "a" denotes the singleton set containing the single-symbol string a: L(a) = {a},
  • "RS" denotes the union of R and S: L(RS) = L(R) ∪ L(S),
  • "RS" denotes the intersection of R and S: L(RS) = L(R) ∩ L(S),
  • R" denotes the complement of R (with respect to A*, the set of all strings over A): LR) = A* \ L(R),
  • "RS" denotes the concatenation of R and S: L(RS) = L(R) · L(S),
  • "R*" denotes the Kleene closure of R: L(R*) = L(R)*.

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

Computation

[edit | edit source]

For any given generalized regular expression R and any string u, the derivative u−1R is again a generalized regular expression (denoting the language u−1L(R)).[7] It may be computed recursively as follows.[8]

(ua)−1R = a−1(u−1R)       for a symbol a and a string u
ε−1R = R

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a. The latter can be computed as follows:[9]

a−1a = ε
a−1b = ∅ for each symbol ba
a−1ε = ∅
a−1 = ∅
a−1(R*) = (a−1R)R*
a−1(RS) = (a−1R)S ∨ ν(R)a−1S
a−1(RS) = (a−1R) ∧ (a−1S)
a−1(RS) = (a−1R) ∨ (a−1S)
a−1R) = ¬(a−1R)

Here, ν(R) is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R's language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:[10]

ν(a) = ∅ for any symbol a
ν(ε) = ε
ν(∅) = ∅
ν(R*) = ε
ν(RS) = ν(R) ∧ ν(S)
ν(RS) = ν(R) ∧ ν(S)
ν(RS) = ν(R) ∨ ν(S)
ν(¬R) = ε if ν(R) = ∅
ν(¬R) = ∅ if ν(R) = ε

Properties

[edit | edit source]

A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.[11]

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to strings of length less than dR.[12] Furthermore, there is a complete deterministic finite automaton with dR states that recognises the regular language given by R, as stated by the Myhill–Nerode theorem.

Derivatives of context-free languages

[edit | edit source]

Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages.[13] Implementation of such algorithms have shown to have cubic time complexity,[14] corresponding to the complexity of the Earley parser on general context-free grammars.

See also

[edit | edit source]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ Brzozowski (1964), p.481, required A to consist of the 2n combinations of n bits, for some n.
  7. ^ Brzozowski (1964), p.483, Theorem 4.1
  8. ^ Brzozowski (1964), p.483, Theorem 3.2
  9. ^ Brzozowski (1964), p.483, Theorem 3.1
  10. ^ Brzozowski (1964), p.482, Definition 3.2
  11. ^ Brzozowski (1964), p.483, Theorem 4.2
  12. ^ Brzozowski (1964), p.484, Theorem 4.3
  13. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  14. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).