jon.recoil.org

Module ReSource

Module Re: code for creating and using regular expressions, independently of regular expression syntax.

type t

Regular expression

Sourcetype re

Compiled regular expression

Sourcemodule Group : sig ... end

Manipulate matching groups.

Sourcetype groups = Re.Group.t
  • deprecated Use Group.t

Compilation and execution of a regular expression

Sourceval compile : Re.t -> Re.re @@ portable

Compile a regular expression into an executable version that can be used to match strings, e.g. with exec.

Sourceval group_count : Re.re -> int @@ portable

Return the number of capture groups (including the one corresponding to the entire regexp).

Sourceval group_names : Re.re -> (string * int) list @@ portable

Return named capture groups with their index.

Sourceval exec : ?pos:int -> ?len:int -> Re.re -> string -> Re.Group.t @@ portable

exec re str searches str for a match of the compiled expression re, and returns the matched groups if any.

More specifically, when a match exists, exec returns a match that starts at the earliest position possible. If multiple such matches are possible, the one specified by the match semantics described below is returned.

Examples:

  # let regex = Re.compile Re.(seq [str "//"; rep print ]);;
  val regex : re = <abstr>

  # Re.exec regex "// a C comment";;
  - : Re.Group.t = <abstr>

  # Re.exec regex "# a C comment?";;
  Exception: Not_found

  # Re.exec ~pos:1 regex "// a C comment";;
  Exception: Not_found
  • parameter pos

    optional beginning of the string (default 0)

  • parameter len

    length of the substring of str that can be matched (default -1, meaning to the end of the string)

  • raises Not_found

    if the regular expression can't be found in str

Sourceval exec_opt : ?pos:int -> ?len:int -> Re.re -> string -> Re.Group.t option @@ portable

Similar to exec, but returns an option instead of using an exception.

Examples:

  # let regex = Re.compile Re.(seq [str "//"; rep print ]);;
  val regex : re = <abstr>

  # Re.exec_opt regex "// a C comment";;
  - : Re.Group.t option = Some <abstr>

  # Re.exec_opt regex "# a C comment?";;
  - : Re.Group.t option = None

  # Re.exec_opt ~pos:1 regex "// a C comment";;
  - : Re.Group.t option = None
Sourceval execp : ?pos:int -> ?len:int -> Re.re -> string -> bool @@ portable

Similar to exec, but returns true if the expression matches, and false if it doesn't. This function is more efficient than calling exec or exec_opt and ignoring the returned group.

Examples:

  # let regex = Re.compile Re.(seq [str "//"; rep print ]);;
  val regex : re = <abstr>

  # Re.execp regex "// a C comment";;
  - : bool = true

  # Re.execp ~pos:1 regex "// a C comment";;
  - : bool = false
Sourceval exec_partial : ?pos:int -> ?len:int -> Re.re -> string -> [ `Full | `Partial | `Mismatch ] @@ portable

More detailed version of execp. `Full is equivalent to true, while `Mismatch and `Partial are equivalent to false, but `Partial indicates the input string could be extended to create a match.

Examples:

  # let regex = Re.compile Re.(seq [bos; str "// a C comment"]);;
  val regex : re = <abstr>

  # Re.exec_partial regex "// a C comment here.";;
  - : [ `Full | `Mismatch | `Partial ] = `Full

  # Re.exec_partial regex "// a C comment";;
  - : [ `Full | `Mismatch | `Partial ] = `Partial

  # Re.exec_partial regex "//";;
  - : [ `Full | `Mismatch | `Partial ] = `Partial

  # Re.exec_partial regex "# a C comment?";;
  - : [ `Full | `Mismatch | `Partial ] = `Mismatch
Sourceval exec_partial_detailed : ?pos:int -> ?len:int -> Re.re -> string -> [ `Full of Re.Group.t | `Partial of int | `Mismatch ] @@ portable

More detailed version of exec_opt. `Full group is equivalent to Some group, while `Mismatch and `Partial _ are equivalent to None, but `Partial position indicates that the input string could be extended to create a match, and no match could start in the input string before the given position. This could be used to not have to search the entirety of the input if more becomes available, and use the given position as the ?pos argument.

Sourcemodule Mark : sig ... end

Marks

High Level Operations

Sourcetype split_token = [
  1. | `Text of string
    (*

    Text between delimiters

    *)
  2. | `Delim of Re.Group.t
    (*

    Delimiter

    *)
]
Sourceval all : ?pos:int -> ?len:int -> Re.re -> string -> Re.Group.t list @@ portable

Repeatedly calls exec on the given string, starting at given position and length.

Examples:

  # let regex = Re.compile Re.(seq [str "my"; blank; word(rep alpha)]);;
  val regex : re = <abstr>

  # Re.all regex "my head, my shoulders, my knees, my toes ...";;
  - : Re.Group.t list = [<abstr>; <abstr>; <abstr>; <abstr>]

  # Re.all regex "My head, My shoulders, My knees, My toes ...";;
  - : Re.Group.t list = []
Sourcetype 'a gen = unit -> 'a option
Sourceval all_gen : ?pos:int -> ?len:int -> Re.re -> string -> Re.Group.t Re.gen @@ portable
Sourceval all_seq : ?pos:int -> ?len:int -> Re.re -> string -> Re.Group.t Stdlib.Seq.t @@ portable
Sourceval matches : ?pos:int -> ?len:int -> Re.re -> string -> string list @@ portable

Same as all, but extracts the matched substring rather than returning the whole group. This basically iterates over matched strings.

Examples:

  # let regex = Re.compile Re.(seq [str "my"; blank; word(rep alpha)]);;
  val regex : re = <abstr>

  # Re.matches regex "my head, my shoulders, my knees, my toes ...";;
  - : string list = ["my head"; "my shoulders"; "my knees"; "my toes"]

  # Re.matches regex "My head, My shoulders, My knees, My toes ...";;
  - : string list = []

  # Re.matches regex "my my my my head my 1 toe my ...";;
  - : string list = ["my my"; "my my"]

  # Re.matches ~pos:2 regex "my my my my head my +1 toe my ...";;
  - : string list = ["my my"; "my head"]
Sourceval matches_gen : ?pos:int -> ?len:int -> Re.re -> string -> string Re.gen @@ portable
Sourceval matches_seq : ?pos:int -> ?len:int -> Re.re -> string -> string Stdlib.Seq.t @@ portable
Sourceval split : ?pos:int -> ?len:int -> Re.re -> string -> string list @@ portable

split re s splits s into chunks separated by re. It yields the chunks themselves, not the separator. An occurence of the separator at the beginning or the end of the string is ignoring.

Examples:

  # let regex = Re.compile (Re.char ',');;
  val regex : re = <abstr>

  # Re.split regex "Re,Ocaml,Jerome Vouillon";;
  - : string list = ["Re"; "Ocaml"; "Jerome Vouillon"]

  # Re.split regex "No commas in this sentence.";;
  - : string list = ["No commas in this sentence."]

  # Re.split regex ",1,2,";;
  - : string list = ["1"; "2"]

  # Re.split ~pos:3 regex "1,2,3,4. Commas go brrr.";;
  - : string list = ["3"; "4. Commas go brrr."]

Zero-length patterns:

Be careful when using split with zero-length patterns like eol, bow, and eow. Because they don't have any width, they will still be present in the result. (Note the position of the \n and space characters in the output.)

  # Re.split (Re.compile Re.eol) "a\nb";;
  - : string list = ["a"; "\nb"]

  # Re.split (Re.compile Re.bow) "a b";;
  - : string list = ["a "; "b"]

  # Re.split (Re.compile Re.eow) "a b";;
  - : string list = ["a"; " b"]

Compare this to the behavior of splitting on the char itself. (Note that the delimiters are not present in the output.)

  # Re.split (Re.compile (Re.char '\n')) "a\nb";;
  - : string list = ["a"; "b"]

  # Re.split (Re.compile (Re.char ' ')) "a b";;
  - : string list = ["a"; "b"]
Sourceval split_delim : ?pos:int -> ?len:int -> Re.re -> string -> string list @@ portable

split_delim re s splits s into chunks separated by re. It yields the chunks themselves, not the separator. Occurences of the separator at the beginning or the end of the string will produce empty chunks.

Examples:

  # let regex = Re.compile (Re.char ',');;
  val regex : re = <abstr>

  # Re.split regex "Re,Ocaml,Jerome Vouillon";;
  - : string list = ["Re"; "Ocaml"; "Jerome Vouillon"]

  # Re.split regex "No commas in this sentence.";;
  - : string list = ["No commas in this sentence."]

  # Re.split regex ",1,2,";;
  - : string list = [""; "1"; "2"; ""]

  # Re.split ~pos:3 regex "1,2,3,4. Commas go brrr.";;
  - : string list = ["3"; "4. Commas go brrr."]

Zero-length patterns:

Be careful when using split_delim with zero-length patterns like eol, bow, and eow. Because they don't have any width, they will still be present in the result. (Note the position of the \n and space characters in the output.)

  # Re.split_delim (Re.compile Re.eol) "a\nb";;
  - : string list = ["a"; "\nb"; ""]

  # Re.split_delim (Re.compile Re.bow) "a b";;
  - : string list = [""; "a "; "b"]

  # Re.split_delim (Re.compile Re.eow) "a b";;
  - : string list = ["a"; " b"; ""]

Compare this to the behavior of splitting on the char itself. (Note that the delimiters are not present in the output.)

  # Re.split_delim (Re.compile (Re.char '\n')) "a\nb";;
  - : string list = ["a"; "b"]

  # Re.split_delim (Re.compile (Re.char ' ')) "a b";;
  - : string list = ["a"; "b"]
Sourceval split_gen : ?pos:int -> ?len:int -> Re.re -> string -> string Re.gen @@ portable
Sourceval split_seq : ?pos:int -> ?len:int -> Re.re -> string -> string Stdlib.Seq.t @@ portable
Sourceval split_full : ?pos:int -> ?len:int -> Re.re -> string -> Re.split_token list @@ portable

split re s splits s into chunks separated by re. It yields the chunks along with the separators. For instance this can be used with a whitespace-matching re such as "[\t ]+".

Examples:

  # let regex = Re.compile (Re.char ',');;
  val regex : re = <abstr>

  # Re.split_full regex "Re,Ocaml,Jerome Vouillon";;
  - : Re.split_token list =
    [`Text "Re"; `Delim <abstr>; `Text "Ocaml"; `Delim <abstr>;
    `Text "Jerome Vouillon"]

  # Re.split_full regex "No commas in this sentence.";;
  - : Re.split_token list = [`Text "No commas in this sentence."]

  # Re.split_full ~pos:3 regex "1,2,3,4. Commas go brrr.";;
  - : Re.split_token list =
    [`Delim <abstr>; `Text "3"; `Delim <abstr>; `Text "4. Commas go brrr."]
Sourceval split_full_gen : ?pos:int -> ?len:int -> Re.re -> string -> Re.split_token Re.gen @@ portable
Sourceval split_full_seq : ?pos:int -> ?len:int -> Re.re -> string -> Re.split_token Stdlib.Seq.t @@ portable
module Seq : sig ... end

String expressions (literal match)

val str : string -> Re.t @@ portable
val char : char -> Re.t @@ portable

Basic operations on regular expressions

val alt : Re.t list -> Re.t @@ portable

Alternative.

alt [] is equivalent to empty.

By default, the leftmost match is preferred (see match semantics below).

val seq : Re.t list -> Re.t @@ portable

Sequence

val empty : Re.t @@ portable

Match nothing

val epsilon : Re.t @@ portable

Empty word

val rep : Re.t -> Re.t @@ portable

0 or more matches

val rep1 : Re.t -> Re.t @@ portable

1 or more matches

val repn : Re.t -> int -> int option -> Re.t @@ portable

repn re i j matches re at least i times and at most j times, bounds included. j = None means no upper bound.

val opt : Re.t -> Re.t @@ portable

0 or 1 matches

String, line, word

We define a word as a sequence of latin1 letters, digits and underscore.

val bol : Re.t @@ portable

Beginning of line

val eol : Re.t @@ portable

End of line

val bow : Re.t @@ portable

Beginning of word

val eow : Re.t @@ portable

End of word

val bos : Re.t @@ portable

Beginning of string. This differs from start because it matches the beginning of the input string even when using ~pos arguments:

  let b = execp (compile (seq [ bos; str "a" ])) "aa" ~pos:1 in
  assert (not b)
val eos : Re.t @@ portable

End of string. This is different from stop in the way described in bos.

val leol : Re.t @@ portable

Last end of line or end of string

val start : Re.t @@ portable

Initial position. This differs from bos because it takes into account the ~pos arguments:

  let b = execp (compile (seq [ start; str "a" ])) "aa" ~pos:1 in
  assert b
val stop : Re.t @@ portable

Final position. This is different from eos in the way described in start.

val word : Re.t -> Re.t @@ portable

Word

val not_boundary : Re.t @@ portable

Not at a word boundary

val whole_string : Re.t -> Re.t @@ portable

Only matches the whole string, i.e. fun t -> seq [ bos; t; eos ].

Match semantics

A regular expression frequently matches a string in multiple ways. For instance exec (compile (opt (str "a"))) "ab" can match "" or "a". Match semantic can be modified with the functions below, allowing one to choose which of these is preferable.

By default, the leftmost branch of alternations is preferred, and repetitions are greedy.

Note that the existence of matches cannot be changed by specifying match semantics. seq [ bos; str "a"; non_greedy (opt (str "b")); eos ] will match when applied to "ab". However if seq [ bos; str "a"; non_greedy (opt (str "b")) ] is applied to "ab", it will match "a" rather than "ab".

Also note that multiple match semantics can conflict. In this case, the one executed earlier takes precedence. For instance, any match of shortest (seq [ bos; group (rep (str "a")); group (rep (str "a")); eos ]) will always have an empty first group. Conversely, if we use longest instead of shortest, the second group will always be empty.

val longest : Re.t -> Re.t @@ portable

Longest match semantics. That is, matches will match as many bytes as possible. If multiple choices match the maximum amount of bytes, the one respecting the inner match semantics is preferred.

val shortest : Re.t -> Re.t @@ portable

Same as longest, but matching the least number of bytes.

val first : Re.t -> Re.t @@ portable

First match semantics for alternations (not repetitions). That is, matches will prefer the leftmost branch of the alternation that matches the text.

val greedy : Re.t -> Re.t @@ portable

Greedy matches for repetitions (opt, rep, rep1, repn): they will match as many times as possible.

val non_greedy : Re.t -> Re.t @@ portable

Non-greedy matches for repetitions (opt, rep, rep1, repn): they will match as few times as possible.

Groups (or submatches)

val group : ?name:string -> Re.t -> Re.t @@ portable

Delimit a group. The group is considered as matching if it is used at least once (it may be used multiple times if is nested inside rep for instance). If it is used multiple times, the last match is what gets captured.

val no_group : Re.t -> Re.t @@ portable

Remove all groups

val nest : Re.t -> Re.t @@ portable

When matching against nest e, only the group matching in the last match of e will be considered as matching.

For instance:

  let re = compile (rep1 (nest (alt [ group (str "a"); str "b" ]))) in
  let group = Re.exec re "ab" in
  assert (Group.get_opt group 1 = None);
  (* same thing but without [nest] *)
  let re = compile (rep1 (alt [ group (str "a"); str "b" ])) in
  let group = Re.exec re "ab" in
  assert (Group.get_opt group 1 = Some "a")
val mark : Re.t -> Re.Mark.t * Re.t @@ portable

Mark a regexp. the markid can then be used to know if this regexp was used.

Character sets

val set : string -> Re.t @@ portable

Any character of the string

Sourceval rg : char -> char -> Re.t @@ portable

Character ranges

val inter : Re.t list -> Re.t @@ portable

Intersection of character sets

val diff : Re.t -> Re.t -> Re.t @@ portable

Difference of character sets

val compl : Re.t list -> Re.t @@ portable

Complement of union

Predefined character sets

val any : Re.t @@ portable

Any character

Sourceval notnl : Re.t @@ portable

Any character but a newline

Sourceval alnum : Re.t @@ portable
Sourceval wordc : Re.t @@ portable
Sourceval alpha : Re.t @@ portable
Sourceval ascii : Re.t @@ portable
Sourceval blank : Re.t @@ portable
Sourceval cntrl : Re.t @@ portable
Sourceval digit : Re.t @@ portable
Sourceval graph : Re.t @@ portable
Sourceval lower : Re.t @@ portable
Sourceval print : Re.t @@ portable
Sourceval punct : Re.t @@ portable
Sourceval space : Re.t @@ portable
Sourceval upper : Re.t @@ portable
Sourceval xdigit : Re.t @@ portable

Case modifiers

val case : Re.t -> Re.t @@ portable

Case sensitive matching. Note that this works on latin1, not ascii and not utf8.

val no_case : Re.t -> Re.t @@ portable

Case insensitive matching. Note that this works on latin1, not ascii and not utf8.

Internal debugging

val pp : Stdlib.Format.formatter -> Re.t -> unit @@ portable
Sourceval pp_re : Stdlib.Format.formatter -> Re.re -> unit @@ portable
Sourceval print_re : Stdlib.Format.formatter -> Re.re -> unit @@ portable

Alias for pp_re. Deprecated

Experimental functions

val witness : Re.t -> string @@ portable

witness r generates a string s such that execp (compile r) s is true.

Be warned that this function is buggy because it ignores zero-width assertions like beginning of words. As a result it can generate incorrect results.

Deprecated functions

Sourcetype substrings = Re.Group.t

Alias for Group.t. Deprecated

  • deprecated Use Group.t
Sourceval get : Re.Group.t -> int -> string @@ portable

Same as Group.get. Deprecated

  • deprecated Use Group.get
Sourceval get_ofs : Re.Group.t -> int -> int * int @@ portable

Same as Group.offset. Deprecated

  • deprecated Use Group.offset
Sourceval get_all : Re.Group.t -> string array @@ portable

Same as Group.all. Deprecated

  • deprecated Use Group.all
Sourceval get_all_ofs : Re.Group.t -> (int * int) array @@ portable

Same as Group.all_offset. Deprecated

  • deprecated Use Group.all_offset
Sourceval test : Re.Group.t -> int -> bool @@ portable

Same as Group.test. Deprecated

  • deprecated Use Group.test
Sourcetype markid = Re.Mark.t

Alias for Mark.t. Deprecated

  • deprecated Use Mark.
Sourceval marked : Re.Group.t -> Re.Mark.t -> bool @@ portable

Same as Mark.test. Deprecated

  • deprecated Use Mark.test
Sourceval mark_set : Re.Group.t -> Re.Mark.Set.t @@ portable

Same as Mark.all. Deprecated

  • deprecated Use Mark.all
Sourcemodule Stream : sig ... end

An experimental for matching a regular expression by feeding individual string chunks.

Sourceval replace : ?pos:int -> ?len:int -> ?all:bool -> Re__.Compile.re -> f:(Group.t -> string) -> string -> string @@ portable

replace ~all re ~f s iterates on s, and replaces every occurrence of re with f substring where substring is the current match. If all = false, then only the first occurrence of re is replaced.

Sourceval replace_string : ?pos:int -> ?len:int -> ?all:bool -> Re__.Compile.re -> by:string -> string -> string @@ portable

replace_string ~all re ~by s iterates on s, and replaces every occurrence of re with by. If all = false, then only the first occurrence of re is replaced.

Examples:

  # let regex = Re.compile (Re.char ',');;
  val regex : re = <abstr>

  # Re.replace_string regex ~by:";" "[1,2,3,4,5,6,7]";;
  - : string = "[1;2;3;4;5;6;7]"

  # Re.replace_string regex ~all:false ~by:";" "[1,2,3,4,5,6,7]";;
  - : string = "[1;2,3,4,5,6,7]"
Sourcemodule View : sig ... end

A view of the top-level of a regex. This type is unstable and may change

Sourcemodule Emacs : sig ... end

Emacs-style regular expressions

Sourcemodule Glob : sig ... end

Shell-style regular expressions

Sourcemodule Perl : sig ... end

Perl-style regular expressions

Sourcemodule Pcre : sig ... end

NOTE: Only a subset of the PCRE spec is supported

Sourcemodule Posix : sig ... end

References:

Sourcemodule Str : sig ... end

Module Str: regular expressions and high-level string processing