lang: es
regexp3 (C-lang, Go-lang) and regexp4 (C-lang, Go-lang)
raptor-book (draft (spanish)) : here
benchmarks ==> here
- Easy to use.
- No error checking.
- only regexp
- The most compact and clear code in a human regexp library.
- Zero dependencies. Neither the standard C library is present PURE C.
- No explicit dynamic memory management. No
malloc
,calloc
,free
, … - Count matches
- Catchs
- Replacement catch
- Placement of specific catches within an array
- Backreferences
- Support UTF8
Recurseve Regexp Raptor is a library search, capture and replacement Regular expressions written in C language from zero, trying to achieve what following:
- Having most of the features present in any other regexp library.
- Elegant Code: simple, clear and endowed with grace.
- Avoid explicit request dynamic memory.
- Avoid using any external libraries, including the standard library.
- Be a useful learning material.
There are two parallel developments of this library the first (regexp3) focuses on simplicity and code, the second (this) still in beta seeks achieve the maximum speed possible implementing a “table of instructions.” In both cases the algorithm is from scratch, and only use C, enjoy!
C does not have a standard library of regular expressions, although there are several implementations, such as pcre, the regexp.h library of the GNU project, regexp of the Plan 9 operating system, and some other more, the author of this work (which is a little bit retard) found in all code farfetched and mystical divided into several files full of macros, scripts low and cryptic variables. Unable to understand anything and after a retreat to the island of onanista meditacion the author intended to make your own library with casinos and Japanese schoolgirls.
Has been used GNU Emacs (the only true operating system), gcc (6.3.1) & clang compiler (LLVM) 3.8.1, konsole and fish, running in Freidora 25.
There are two tests for the library, the first ascii test battery is used in
the file ascii_test.c
.
to test the ascii library
gcc ascii_test.c regexp4_ascii.c
to the ut8 vercion
gcc ascii_test.c regexp4_utf8.c
the second battery of tests is exclusive of regexp4_utf8.c
gcc utf8_test.c regexp4_utf8.c
in either case run with
./a.out
To include Recursive Regexp Raptor in their code, place the files regexp4.h
,
charUtils.h
and regexp4_ascii.c
or regexp4_utf8.c
inside the folder of
your draft. You must include the header
#include "regexp4.h"
and finally compile well with
gcc myProyect.c regexp4_ascii.c
or
gcc myProyect.c regexp4_utf8.c
obviously compile with optimization provides a significant decline,
runtime, try -O3
This the only search function, its prototype is:
int regexp4( const char *txt, const char *re );
- txt
- pointer to string on which to perform the search, must end with the sign of termination ‘\0’.
- re
- pointer to string containing the regular expression search, You must end with the sign of termination ‘\0’.
The function returns the number of matches 0
(none) o n
matches.
The standard syntax for regular expressions using the character ’\
’,
unfortunately this sign goes into “conflict” with the syntax of C, by this
and trying to keep simple the code, has opted for a alternate syntax detailed
below
- Text search in any location:
regexp4( "Raptor Test", "Raptor" );
- Multiple search options “exp1|exp2”
regexp4( "Raptor Test", "Dinosaur|T Rex|Raptor|Triceratops" );
- Matches any character ‘.’
regexp4( "Raptor Test", "R.ptor" );
- Zero or one coincidences ‘?’
regexp4( "Raptor Test", "Ra?ptor" );
- One or more coincidences ‘+’
regexp4( "Raaaptor Test", "Ra+ptor" );
- Zero or more coincidences ‘*’
regexp4( "Raaaptor Test", "Ra*ptor" );
- Range of coincidences “{n1,n2}”
regexp4( "Raaaptor Test", "Ra{0,100}ptor" );
- Number of specific matches ‘{n1}’
regexp4( "Raptor Test", "Ra{1}ptor" );
- Minimum Number of matches ‘{n1,}’
regexp4( "Raaaptor Test", "Ra{2,}ptor" );
- Sets.
- Character Set “[abc]”
regexp4( "Raptor Test", "R[uoiea]ptor" );
- Range within a set of characters “[a-b]”
regexp4( "Raptor Test", "R[a-z]ptor" );
- Metacaracter within a set of characters “[:meta]”
regexp4( "Raptor Test", "R[:w]ptor" );
- Investment character set “[^abc]”
regexp4( "Raptor Test", "R[^uoie]ptor" );
- Character Set “[abc]”
- UTF8 characters
regexp4( "R△ptor Test", "R△ptor" );
also
regexp4( "R△ptor Test", "R[△]ptor" );
- Coinciding with a character that is a letter “:a”
regexp4( "RAptor Test", "R:aptor" );
- Coinciding with a character that is not a letter “:A”
regexp4( "R△ptor Test", "R:Aptor" );
- Coinciding with a character that is a number “:d”
regexp4( "R4ptor Test", "R:dptor" );
- Coinciding with a character other than a number “:D”
regexp4( "Raptor Test", "R:Dptor" );
- Coinciding with an alphanumeric character “:w”
regexp4( "Raptor Test", "R:wptor" );
- Coinciding with a non-alphanumeric character “:W”
regexp4( "R△ptor Test", "R:Wptor" );
- Coinciding with a character that is a space “:s”
regexp4( "R ptor Test", "R:sptor" );
- Coinciding with a character other than a space “:S”
regexp4( "Raptor Test", "R:Sptor" );
- Coincidence with utf8 character “:&”
regexp4( "R△ptor Test", "R:&ptor" );
- Escape character with special meaning “:character”
the characters ‘|’, ‘(‘, ‘)’, ‘<’, ‘>’, ‘[‘, ‘]’, ‘?’, ‘+’, ‘*’, ‘{‘, ‘}’, ‘-‘, ‘#’ and ‘@’ as a especial characters, placing one of these characters as is, regardless one correct syntax within the exprecion, can generate infinite loops and other errors.
regexp4( ":#()|<>", ":::#:(:):|:<:>" );
The special characters (except the metacharacter) lose their meaning within a set
regexp4( "()<>[]|{}*#@?+", "[()<>:[:]|{}*?+#@]" );
- Grouping “(exp)”
regexp4( "Raptor Test", "(Raptor)" );
- Grouping with capture “<exp>”
regexp4( "Raptor Test", "<Raptor>" );
- Backreferences “@id”
the backreferences need one previously captured expression “<exp>”, then the number of capture is placed, preceded by ‘@’
regexp4( "ae_ea", "<a><e>_@2@1" )
- Behavior modifiers
There are two types of modifiers. The first affects globally the exprecion behaviour, the second affects specific sections. In either case, the syntax is the same, the sign ‘#’, followed by modifiers,
modifiers global reach is placed at the beginning, the whole and are as follows exprecion
- Search only the beginning ‘#^exp’
regexp4( "Raptor Test", "#^Raptor" );
- Search only at the end ‘#$exp’
regexp4( "Raptor Test", "#$Test" );
- Search the beginning and end “#^$exp”
regexp4( "Raptor Test", "#^$Raptor Test" );
- Stop with the first match “#?exp”
regexp4( "Raptor Test", "#?Raptor Test" );
- Search for the string, character by character “#~”
By default, when a exprecion coincides with a region of text search, the search continues from the end of that coincidence to ignore this behavior, making the search always be character by character this switch is used
regexp4( "aaaaa", "#~a*" );
in this example, without modifying the result it would be a coincidence, however with this switch continuous search immediately after returning character representations of the following five matches.
- Ignore case sensitive “#*exp”
regexp4( "Raptor Test", "#*RaPtOr TeSt" );
- Search only the beginning ‘#^exp’
all of the above switches are compatible with each other ie could search
regexp4( "Raptor Test", "#^$*?~RaPtOr TeSt" );
however modifiers ‘~’ and ‘?’ lose sense because the presence of ‘^’ and/or ‘$’.
one exprecion type:
regexp4( "Raptor Test", "#$RaPtOr|#$TeSt" );
is erroneous, the modifier after the ‘|’ section would apply between ‘|’ and ‘#’, ie zero, with a return of wrong
local modifiers are placed after the repeat indicator (if there) and affect the same region affecting indicators repetition, ie characters, sets or groups.
- Ignore case sensitive “exp#*”
regexp4( "Raptor Test", "(RaPtOr)#* TeS#*t" );
- Not ignore case sensitive “exp#/”
regexp4( "RaPtOr TeSt", "#*(RaPtOr)#/ TES#/T" );
Catches are indexed according to the order of appearance in the expression for example:
< < > | < < > > > = 1 ========================== = 2== = 2 ========= = 3 =
If the exprecion matches more than one occasion in the search text index is increased according to their appearance that is:
< < > | < > > < < > | < > > < < > | < > > = 1 ================== = 3 ================== = 5 ================== = 2== = 2== = 4== = 4== = 6== = 6== coincidencia uno coincidencia dos coincidencia tres
cpytCatch
function makes a copy of a catch into an array character, here
its prototype:
char * cpyCatch( char * str, const int index )
- str
- pointer capable of holding the largest capture.
- index
- index of the grouping (
1
ton
).
function returns a pointer to the capture terminated ‘\0’. an index incorrect return a pointer that begins in ‘\0’.
to get the number of catches in a search, using totCatch
:
int totCatch();
returning a value of 0
a n
.
Could use this and the previous function to print all catches with a function like this:
void printCatch(){
char str[128];
int i = 0, max = totCatch();
while( ++i <= max )
printf( "[%d] >%s<\n", i, cpyCatch( str, i ) );
}
functions gpsCatch()
and lenCatch()
perform the same work cpyCatch
with the variant not use an array, instead the first returns a pointer to
the initial position of capture within the text of search and the second
returns the length of the capture.
int lenCatch( const int index );
const char * gpsCatch( const int index );
the above example with these fuciones, would:
void printCatch(){
int i = 0, max = totCatch();
while( ++i <= max )
printf( "[%d] >%.*s<\n", i, lenCatch( i ), gpsCatch( i ) );
}
char * putCatch( char * newStr, const char * putStr );
putStr
argument contains the text with which to form the new chain as well
as indicators which you catch place. To indicate the insertion a coke
capture the ‘#’ sign followed the capture index. for example putStr
argument could be
char *putStr = "catch 1 >>#1<< catch 2 >>#2<< catch 747 >>#747<<";
newStr
is an character array large enough to contain the string +
catches. the function returns a pointer to the starting position of this
arrangement, which ends with the sign of completion ‘\0’.
to place the character ‘#’ within the escape string ‘#’ with ‘#’ further, ie:
"## Comment" -> "# comment"
Replacement operates on an array of characters in which is placed the text
search modifying a specified catch by a string text, the function in
charge of this work is rplCatch
, its prototype is:
char * rplCatch( char * newStr, const char * rplStr, const int id );
- newStr
- character array dimension text is placed dende original on which is carried out and the replacement text of catches.
- rplStr
- replacement text capture.
- id
- Capture identifier after the order of appearance within regular exprecion. Spend a wrong index, place a unaltered copy of the search string on the settlement = Newstr =.
in this case the use of the argument id
unlike function getCatch
does
not refer to a “catch” in specific, that is no matter how much of occasions
that has captured a exprecion, the identifier indicates the position
within the exprecion itself, ie:
< < > | < < > > > id = 1 ========================== id = 2== = 2 ========= id = 3 = capturing position within the exprecion
The amendment affects so
< < > | < > > < < > | < > > < < > | < > > = 1 ================== = 1 ================== = 1 ================== = 2== = 2== = 2== = 2== = 2== = 2== capture one "..." two "..." Three
:d
- digit from 0 to 9.
:D
- any character other than a digit from 0 to 9.
:a
- any character is a letter (a-z, A-Z)
:A
- any character other than a letter
:w
- any alphanumeric character.
:W
- any non-alphanumeric character.
:s
- any blank space character.
:S
- any character other than a blank.
:&
- Non-ASCII character (in UTF8 version only).
:|
- Vertical bar
:^
- Caret
:$
- Dollar sign
:(
- Left parenthesis
:)
- Right parenthesis
:<
- Greater than
:>
- Less than
:[
- Left bracket
:]
- Right bracket
:.
- Point
:?
- Interrogacion
:+
- More
:-
- Less
:*
- Asterisk
:{
- Left key
:}
- Right key
:#
- Modifier
::
- Colons
additionally use the proper c syntax to place characters new line, tab, …, etc. Similarly you can use the c syntax for “placing” characters in octal, hexadecimal or unicode.
ascii_test.c
file contains a wide variety of tests that are useful as
examples of use, these include the next:
regexp4( "07-07-1777", "<0?[1-9]|[12][0-9]|3[01]><[/:-\\]><0?[1-9]|1[012]>@2<[12][0-9]{3}>" );
captures a date format string, separately day, stripper, month and year. The separator has to coincider the two occasions that appears
regexp4( "https://en.wikipedia.org/wiki/Regular_expression", "(https?|ftp):://<[^:s/:<:>]+></[^:s:.:<:>,/]+>*<.>*" );
capture something like a web link
regexp4( "<mail>[email protected]</mail>", "<[_A-Za-z0-9:-]+(:.[_A-Za-z0-9:-]+)*>:@<[A-Za-z0-9]+>:.<[A-Za-z0-9]+><:.[A-Za-z0-9]{2}>*" );
capture sections (user, site, domain) something like an email.
┌────┐ │init│ └────┘ │◀───────────────────────────────────┐ ▼ │ ┌──────────────┐ │ │loop in string│ │ └──────────────┘ │ │ │ ▼ │ ┌─────────────┐ no ┌─────────────┐ │ <│end of string│>────▶<│search regexp│>──────┘ └─────────────┘ └─────────────┘ no match │ yes │ match ▼ ▼ ┌────────────────┐ ┌─────────────┐ │report: no match│ │report: match│ └────────────────┘ └─────────────┘ │ │ │◀────────────────────┘ ▼ ┌───┐ │end│ └───┘
search regexp
version one
┌──────────────────────────────┐ ┏━━━━━━━━━━━━━┓ ▼ │ ┃search regexp┃ ┌───────────┐ │ ┗━━━━━━━━━━━━━┛ │get builder│ │ └───────────┘ │ │ │ ▼ │ ┌───────────────┐ no ┌────────────┐ │ <│we have builder│>────▶│finish: the │ │ └───────────────┘ │path matches│ │ │ yes └────────────┘ │ ┌────────┬─────┬──────────┼────────────┬──────────┐ │ ▼ ▼ ▼ ▼ ▼ ▼ │ ┌───────────┐┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐ │ │alternation││set││point││metacharacter││character││grouping│ │ └───────────┘└───┘└─────┘└─────────────┘└─────────┘└────────┘ │ │ │ │ │ │ │ │ ▼ └─────┴──────────┼────────────┘ └──────┤ ┌────────────────┐ │ │ ┌────────│ save position │ ▼ │ │ └────────────────┘ ┌─────────────┐ no match │ │ ┌────────────────┐ <│match builder│>──────────┐ │ ▼◀───────│restore position│◀────┐ └─────────────┘ │ │ ┌──────────────┐└────────────────┘ │ │ match │ │ │loop in paths │ │ ▼ ▼ │ └──────────────┘ │ ┌─────────────────┐ ┌───────────────┐ │ │ │ │advance in string│ │finish, the │ │ ▼ │ └─────────────────┘ │path no matches│ │ ┌────────────┐ yes ┌─────────────┐ │ │ └───────────────┘ │ <│we have path│>───▶<│search regexp│>──┘ └──────────────────────────────┘ └────────────┘ └─────────────┘ no match │ no match │ ▼ ▼ ┌───────────────────────┐ ┌────────────┐ │finish, without matches│ │finish, the │ └───────────────────────┘ │path matches│ └────────────┘
search regexp
version two
┌─────────────┐ │save position│ ┏━━━━━━━━━━━━━┓ └─────────────┘ ┃search regexp┃ ┌────────────▶│ ┗━━━━━━━━━━━━━┛ │ ▼ │ ┌──────────────┐ │ │loop in paths │ │ └──────────────┘ │ │ ┌────────────────────────────────┐ │ ▼ ▼ │ │ ┌────────────┐ yes ┌───────────┐ │ │ <│we have path│>────────▶│get builder│ │ │ └────────────┘ └───────────┘ │ │ │ no │ │ │ ▼ ▼ │ │ ┌───────────────────────┐ ┌───────────────┐ no ┌─────────────┐ │ │ │finish: without matches│ <│we have builder│>───▶│finish: the │ │ │ └───────────────────────┘ └───────────────┘ │path matches │ │ │ │ yes └─────────────┘ │ │ ┌─────┬──────────┼────────────┬─────────┐ │ │ ▼ ▼ ▼ ▼ ▼ │ ┌────────────────┐ ┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐ │ │restore position│ │set││point││metacharacter││character││grouping│ │ └────────────────┘ └───┘└─────┘└─────────────┘└─────────┘└────────┘ │ ▲ │ │ │ │ │ │ │ └─────┴──────────┼────────────┘ │ │ │ ▼ ▼ │ ┌───────────────┐ no match ┌─────────────┐ ┌─────────────┐ │ │finish: the │◀────────┬──────────<│match builder│> ┌───<│search regexp│> │ │path no matches│ │ └─────────────┘ │ └─────────────┘ │ └───────────────┘ │ │ match │ │ │ └────────────────┈┈│┈┈────────┘ │ match │ ▼ │ │ ┌─────────────────┐ └─────────┤ │advance in string│ │ └─────────────────┘ │ │ │ └────────────────────────────────┘
This project is not “open source” is free software, and according to this, use the GNU GPL Version 3. Any work that includes used or resulting code of this library, you must comply with the terms of this license.