Regular Expression Notes

Explains the regular expressions that this parser uses for tokenization of N-Triples / N-Quads.

These regular expression have been tested mainly with V8’s regular expression engine (but generally should work with other engines as well). Credits go to regex101.com, which is a great tool for testing regular expressions.

IRI match

That’s simple: (<[^>]+>) or <([^>]+)> (with or without angle brackets)

Tokenization regex

Matches IRIs (including <, >), literals (including suffix, if present), blank node labels (including the _: prefix) and end of statements (.):

((?:{{literal}}(?:@\w+(?:-\w+)?|\^\^<[^>]+>)?)|<[^>]+>|\_\:\w+|\.)

where {{literal}} is used as a placeholder for a literal match expression, explained below.

A token's type can then be easily determined by looking at the first character of a match.

Literal tokens include the suffix (@… language tag or ^^… type), if present. This makes more sense here, as is makes reading the matches easier (since they have no type attached). Could be split by a second regex if needed.

(Language tag EBNF: LANGTAG ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*)

Literals

Literal matching is complicated a bit because the quote character (") that delimit literals may occur in the literal itself, if preceded with an escape character (\).

Possibilities (taken from a stackoverflow answer, thanks, ridgerunner):

"([^"\\]|\\.)*", less efficient
"[^"\\]*(?:\\.[^"\\]*)*": “Implements Friedl’s: ‘unrolling-the-loop’ technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)”

“Bonus”: Matching quads

The following describes how to match a full N-Quad with these regexes. This is however not used by this parser, which only uses the tokenization regex and then further processes the resulting token strings.

Regex to match a token:

((?:"[^"\\]*(?:\\.[^"\\]*)*"(?:@\w+|\^\^<[^>]+>)?)|<[^>]+>|\_\:\w+|\.)

Quad match, restricts token per quad position, written more readable with # comments here:

\s*
(<[^>]+>|\_\:\w+)    # IRI: <…>, or blank node: _:…
\s*
(<[^>]+>|\_\:\w+)    # IRI or blank node
\s*
# Literal with optional language tag _or_ type IRI
((?:"[^"\\]*(?:\\.[^"\\]*)*"(?:@\w+|\^\^<[^>]+>)?)|<[^>]+>)
\s*
(<[^>]+>)            # Graph label IRI
\s*
\.                   # End of statement, literal dot
\s*

yields:

\s*(<[^>]+>|\_\:\w+)\s*(<[^>]+>|\_\:\w+)\s*((?:"[^"\\]*(?:\\.[^"\\]*)*"(?:@\w+|\^\^<[^>]+>)?)|<[^>]+>)\s*(<[^>]+>)\s*\.\s*

Yay.

And here’s a use case: Splitting N-Quads into N-Triples and graph labels:

This only changes the parentheses for matches / makes groups non-capturing:

(\s*(?:<[^>]+>|\_\:\w+)\s*(?:<[^>]+>|\_\:\w+)\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"(?:@\w+|\^\^<[^>]+>)?|<[^>]+>))\s*(<[^>]+>)\s*\.\s*

Don't forget to set the regex engine's global (g) modifier where needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex.md

regex.md

Regular Expression Notes

IRI match

Tokenization regex

Literals

“Bonus”: Matching quads

Files

regex.md

Latest commit

History

regex.md

File metadata and controls

Regular Expression Notes

IRI match

Tokenization regex

Literals

“Bonus”: Matching quads