diff options
Diffstat (limited to 'tde-i18n-nl/docs/kdebase/kate/regular-expressions.docbook')
-rw-r--r-- | tde-i18n-nl/docs/kdebase/kate/regular-expressions.docbook | 658 |
1 files changed, 0 insertions, 658 deletions
diff --git a/tde-i18n-nl/docs/kdebase/kate/regular-expressions.docbook b/tde-i18n-nl/docs/kdebase/kate/regular-expressions.docbook deleted file mode 100644 index 860f2e36c8f..00000000000 --- a/tde-i18n-nl/docs/kdebase/kate/regular-expressions.docbook +++ /dev/null @@ -1,658 +0,0 @@ -<appendix id="regular-expressions"> - -<title>Regular Expressions</title> - -<synopsis> This Appendix contains a brief but hopefully sufficient and -covering introduction to the world of <emphasis>regular -expressions</emphasis>. It documents regular expressions in the form -available within &kate;, which is not compatible with the regular -expressions of perl, nor with those of for example -<command>grep</command>.</synopsis> - -<sect1> - -<title>Introduction</title> - -<para><emphasis>Regular Expressions</emphasis> provides us with a way -to describe some possible contents of a text string in a way -understood by a small piece of software, so that it can investigate if -a text matches, and also in the case of advanced applications with the -means of saving pieces or the matching text.</para> - -<para>An example: Say you want to search a text for paragraphs that -starts with either of the names <quote>Henrik</quote> or -<quote>Pernille</quote> followed by some form of the verb -<quote>say</quote>.</para> - -<para>With a normal search, you would start out searching for the -first name, <quote>Henrik</quote> maybe followed by <quote>sa</quote> -like this: <userinput>Henrik sa</userinput>, and while looking for -matches, you would have to discard those not being the beginning of a -paragraph, as well as those in which the word starting with the -letters <quote>sa</quote> was not either <quote>says</quote>, -<quote>said</quote> or so. And then of cause repeat all of that with -the next name...</para> - -<para>With Regular Expressions, that task could be accomplished with a -single search, and with a larger degree of preciseness.</para> - -<para>To achieve this, Regular Expressions defines rules for -expressing in details a generalization of a string to match. Our -example, which we might literally express like this: <quote>A line -starting with either <quote>Henrik</quote> or <quote>Pernille</quote> -(possibly following up to 4 blanks or tab characters) followed by a -whitespace followed by <quote>sa</quote> and then either -<quote>ys</quote> or <quote>id</quote></quote> could be expressed with -the following regular expression:</para> <para><userinput>^[ -\t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput></para> - -<para>The above example demonstrates all four major concepts of modern -Regular Expressions, namely:</para> - -<itemizedlist> -<listitem><para>Patterns</para></listitem> -<listitem><para>Assertions</para></listitem> -<listitem><para>Quantifiers</para></listitem> -<listitem><para>Back references</para></listitem> -</itemizedlist> - -<para>The caret (<literal>^</literal>) starting the expression is an -assertion, being true only if the following matching string is at the -start of a line.</para> - -<para>The stings <literal>[ \t]</literal> and -<literal>(Henrik|Pernille) sa(ys|id)</literal> are patterns. The first -one is a <emphasis>character class</emphasis> that matches either a -blank or a (horizontal) tab character; the other pattern contains -first a subpattern matching either <literal>Henrik</literal> -<emphasis>or</emphasis> <literal>Pernille</literal>, then a piece -matching the exact string <literal> sa</literal> and finally a -subpattern matching either <literal>ys</literal> -<emphasis>or</emphasis> <literal>id</literal></para> - -<para>The string <literal>{0,4}</literal> is a quantifier saying -<quote>anywhere from 0 up to 4 of the previous</quote>.</para> - -<para>Because regular expression software supporting the concept of -<emphasis>back references</emphasis> saves the entire matching part of -the string as well as sub-patterns enclosed in parentheses, given some -means of access to those references, we could get our hands on either -the whole match (when searching a text document in an editor with a -regular expression, that is often marked as selected) or either the -name found, or the last part of the verb.</para> - -<para>All together, the expression will match where we wanted it to, -and only there.</para> - -<para>The following sections will describe in details how to construct -and use patterns, character classes, assertions, quantifiers and -back references, and the final section will give a few useful -examples.</para> - -</sect1> - -<sect1 id="regex-patterns"> - -<title>Patterns</title> - -<para>Patterns consists of literal strings and character -classes. Patterns may contain sub-patterns, which are patterns enclosed -in parentheses.</para> - -<sect2> -<title>Escaping characters</title> - -<para>In patterns as well as in character classes, some characters -have a special meaning. To literally match any of those characters, -they must be marked or <emphasis>escaped</emphasis> to let the regular -expression software know that it should interpret such characters in -their literal meaning.</para> - -<para>This is done by prepending the character with a backslash -(<literal>\</literal>).</para> - - -<para>The regular expression software will silently ignore escaping a -character that does not have any special meaning in the context, so -escaping for example a <quote>j</quote> (<userinput>\j</userinput>) is -safe. If you are in doubt whether a character could have a special -meaning, you can therefore escape it safely.</para> - -<para>Escaping of cause includes the backslash character it self, to -literally match a such, you would write -<userinput>\\</userinput>.</para> - -</sect2> - -<sect2> -<title>Character Classes and abbreviations</title> - -<para>A <emphasis>character class</emphasis> is an expression that -matches one of a defined set of characters. In Regular Expressions, -character classes are defined by putting the legal characters for the -class in square brackets, <literal>[]</literal>, or by using one of -the abbreviated classes described below.</para> - -<para>Simple character classes just contains one or more literal -characters, for example <userinput>[abc]</userinput> (matching either -of the letters <quote>a</quote>, <quote>b</quote> or <quote>c</quote>) -or <userinput>[0123456789]</userinput> (matching any digit).</para> - -<para>Because letters and digits have a logical order, you can -abbreviate those by specifying ranges of them: -<userinput>[a-c]</userinput> is equal to <userinput>[abc]</userinput> -and <userinput>[0-9]</userinput> is equal to -<userinput>[0123456789]</userinput>. Combining these constructs, for -example <userinput>[a-fynot1-38]</userinput> is completely legal (the -last one would match, of cause, either of -<quote>a</quote>,<quote>b</quote>,<quote>c</quote>,<quote>d</quote>, -<quote>e</quote>,<quote>f</quote>,<quote>y</quote>,<quote>n</quote>,<quote>o</quote>,<quote>t</quote>, -<quote>1</quote>,<quote>2</quote>,<quote>3</quote> or -<quote>8</quote>).</para> - -<para>As capital letters are different characters from their -non-capital equivalents, to create a caseless character class matching -<quote>a</quote> or <quote>b</quote>, in any case, you need to write it -<userinput>[aAbB]</userinput>.</para> - -<para>It is of cause possible to create a <quote>negative</quote> -class matching as <quote>anything but</quote> To do so put a caret -(<literal>^</literal>) at the beginning of the class: </para> - -<para><userinput>[^abc]</userinput> will match any character -<emphasis>but</emphasis> <quote>a</quote>, <quote>b</quote> or -<quote>c</quote>.</para> - -<para>In addition to literal characters, some abbreviations are -defined, making life still a bit easier: - -<variablelist> - -<varlistentry> -<term><userinput>\a</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> bell character (BEL, 0x07).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\f</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> form feed character (FF, 0x0C).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\n</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> line feed character (LF, 0x0A, Unix newline).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\r</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> carriage return character (CR, 0x0D).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\t</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> horizontal tab character (HT, 0x09).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\v</userinput></term> -<listitem><para> This matches the <acronym>ASCII</acronym> vertical tab character (VT, 0x0B).</para></listitem> -</varlistentry> -<varlistentry> -<term><userinput>\xhhhh</userinput></term> - -<listitem><para> This matches the Unicode character corresponding to -the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (i.e., -\zero ooo) matches the <acronym>ASCII</acronym>/Latin-1 character -corresponding to the octal number ooo (between 0 and -0377).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>.</userinput> (dot)</term> -<listitem><para> This matches any character (including newline).</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\d</userinput></term> -<listitem><para> This matches a digit. Equal to <literal>[0-9]</literal></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\D</userinput></term> -<listitem><para> This matches a non-digit. Equal to <literal>[^0-9]</literal> or <literal>[^\d]</literal></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\s</userinput></term> -<listitem><para> This matches a whitespace character. Practically equal to <literal>[ \t\n\r]</literal></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\S</userinput></term> -<listitem><para> This matches a non-whitespace. Practically equal to <literal>[^ \t\r\n]</literal>, and equal to <literal>[^\s]</literal></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\w</userinput></term> -<listitem><para>Matches any <quote>word character</quote> - in this case any letter or digit. Note that -underscore (<literal>_</literal>) is not matched, as is the case with perl regular expressions. -Equal to <literal>[a-zA-Z0-9]</literal></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\W</userinput></term> -<listitem><para>Matches any non-word character - anything but letters or numbers. -Equal to <literal>[^a-zA-Z0-9]</literal> or <literal>[^\w]</literal></para></listitem> -</varlistentry> - - -</variablelist> - -</para> - -<para>The abbreviated classes can be put inside a custom class, for -example to match a word character, a blank or a dot, you could write -<userinput>[\w \.]</userinput></para> - -<note> <para>The POSIX notation of classes, <userinput>[:<class -name>:]</userinput> is currently not supported.</para> </note> - -<sect3> -<title>Characters with special meanings inside character classes</title> - -<para>The following characters has a special meaning inside the -<quote>[]</quote> character class construct, and must be escaped to be -literally included in a class:</para> - -<variablelist> -<varlistentry> -<term><userinput>]</userinput></term> -<listitem><para>Ends the character class. Must be escaped unless it is the very first character in the -class (may follow an unescaped caret)</para></listitem> -</varlistentry> -<varlistentry> -<term><userinput>^</userinput> (caret)</term> -<listitem><para>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para></listitem> -</varlistentry> -<varlistentry> -<term><userinput>-</userinput> (dash)</term> -<listitem><para>Denotes a logical range. Must always be escaped within a character class.</para></listitem> -</varlistentry> -<varlistentry> -<term><userinput>\</userinput> (backslash)</term> -<listitem><para>The escape character. Must always be escaped.</para></listitem> -</varlistentry> - -</variablelist> - -</sect3> - -</sect2> - -<sect2> - -<title>Alternatives: matching <quote>one of</quote></title> - -<para>If you want to match one of a set of alternative patterns, you -can separate those with <literal>|</literal> (vertical bar character).</para> - -<para>For example to find either <quote>John</quote> or <quote>Harry</quote> you would use an expression <userinput>John|Harry</userinput>.</para> - -</sect2> - -<sect2> - -<title>Sub Patterns</title> - -<para><emphasis>Sub patterns</emphasis> are patterns enclosed in -parentheses, and they have several uses in the world of regular -expressions.</para> - -<sect3> - -<title>Specifying alternatives</title> - -<para>You may use a sub pattern to group a set of alternatives within -a larger pattern. The alternatives are separated by the character -<quote>|</quote> (vertical bar).</para> - -<para>For example to match either of the words <quote>int</quote>, -<quote>float</quote> or <quote>double</quote>, you could use the -pattern <userinput>int|float|double</userinput>. If you only want to -find one if it is followed by some whitespace and then some letters, -put the alternatives inside a subpattern: -<userinput>(int|float|double)\s+\w+</userinput>.</para> - -</sect3> - -<sect3> - -<title>Capturing matching text (back references)</title> - -<para>If you want to use a back reference, use a sub pattern to have -the desired part of the pattern remembered.</para> - -<para>For example, it you want to find two occurrences of the same -word separated by a comma and possibly some whitespace, you could -write <userinput>(\w+),\s*\1</userinput>. The sub pattern -<literal>\w+</literal> would find a chunk of word characters, and the -entire expression would match if those were followed by a comma, 0 or -more whitespace and then an equal chunk of word characters. (The -string <literal>\1</literal> references <emphasis>the first sub pattern -enclosed in parentheses</emphasis>)</para> - -<para>See also <link linkend="backreferences">Back references</link>.</para> - -</sect3> - -<sect3 id="lookahead-assertions"> -<title>Lookahead Assertions</title> - -<para>A lookahead assertion is a sub pattern, starting with either -<literal>?=</literal> or <literal>?!</literal>.</para> - -<para>For example to match the literal string <quote>Bill</quote> but -only if not followed by <quote> Gates</quote>, you could use this -expression: <userinput>Bill(?! Gates)</userinput>. (This would find -<quote>Bill Clinton</quote> as well as <quote>Billy the kid</quote>, -but silently ignore the other matches.)</para> - -<para>Sub patterns used for assertions are not captured.</para> - -<para>See also <link linkend="assertions">Assertions</link></para> - -</sect3> - -</sect2> - -<sect2 id="special-characters-in-patterns"> -<title>Characters with a special meaning inside patterns</title> - -<para>The following characters have meaning inside a pattern, and -must be escaped if you want to literally match them: - -<variablelist> - -<varlistentry> -<term><userinput>\</userinput> (backslash)</term> -<listitem><para>The escape character.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>^</userinput> (caret)</term> -<listitem><para>Asserts the beginning of the string.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>$</userinput></term> -<listitem><para>Asserts the end of string.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>()</userinput> (left and right parentheses)</term> -<listitem><para>Denotes sub patterns.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>{}</userinput> (left and right curly braces)</term> -<listitem><para>Denotes numeric quantifiers.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>[]</userinput> (left and right square brackets)</term> -<listitem><para>Denotes character classes.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>|</userinput> (vertical bar)</term> -<listitem><para>logical OR. Separates alternatives.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>+</userinput> (plus sign)</term> -<listitem><para>Quantifier, 1 or more.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>*</userinput> (asterisk)</term> -<listitem><para>Quantifier, 0 or more.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>?</userinput> (question mark)</term> -<listitem><para>An optional character. Can be interpreted as a quantifier, 0 or 1.</para></listitem> -</varlistentry> - -</variablelist> - -</para> - -</sect2> - -</sect1> - -<sect1 id="quantifiers"> -<title>Quantifiers</title> - -<para><emphasis>Quantifiers</emphasis> allows a regular expression to -match a specified number or range of numbers of either a character, -character class or sub pattern.</para> - -<para>Quantifiers are enclosed in curly brackets (<literal>{</literal> -and <literal>}</literal>) and have the general form -<literal>{[minimum-occurrences][,[maximum-occurrences]]}</literal> -</para> - -<para>The usage is best explained by example: - -<variablelist> - -<varlistentry> -<term><userinput>{1}</userinput></term> -<listitem><para>Exactly 1 occurrence</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>{0,1}</userinput></term> -<listitem><para>Zero or 1 occurrences</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>{,1}</userinput></term> -<listitem><para>The same, with less work;)</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>{5,10}</userinput></term> -<listitem><para>At least 5 but maximum 10 occurrences.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>{5,}</userinput></term> -<listitem><para>At least 5 occurrences, no maximum.</para></listitem> -</varlistentry> - -</variablelist> - -</para> - -<para>Additionally, there are some abbreviations: - -<variablelist> - -<varlistentry> -<term><userinput>*</userinput> (asterisk)</term> -<listitem><para>similar to <literal>{0,}</literal>, find any number of occurrences.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>+</userinput> (plus sign)</term> -<listitem><para>similar to <literal>{1,}</literal>, at least 1 occurrence.</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>?</userinput> (question mark)</term> -<listitem><para>similar to <literal>{0,1}</literal>, zero or 1 occurrence.</para></listitem> -</varlistentry> - -</variablelist> - -</para> - -<sect2> - -<title>Greed</title> - -<para>When using quantifiers with no maximum, regular expressions -defaults to match as much of the searched string as possible, commonly -known as <emphasis>greedy</emphasis> behavior.</para> - -<para>Modern regular expression software provides the means of -<quote>turning off greediness</quote>, though in a graphical -environment it is up to the interface to provide you with access to -this feature. For example a search dialog providing a regular -expression search could have a check box labeled <quote>Minimal -matching</quote> as well as it ought to indicate if greediness is the -default behavior.</para> - -</sect2> - -<sect2> -<title>In context examples</title> - -<para>Here are a few examples of using quantifiers</para> - -<variablelist> - -<varlistentry> -<term><userinput>^\d{4,5}\s</userinput></term> -<listitem><para>Matches the digits in <quote>1234 go</quote> and <quote>12345 now</quote>, but neither in <quote>567 eleven</quote> -nor in <quote>223459 somewhere</quote></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\s+</userinput></term> -<listitem><para>Matches one or more whitespace characters</para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>(bla){1,}</userinput></term> -<listitem><para>Matches all of <quote>blablabla</quote> and the <quote>bla</quote> in <quote>blackbird</quote> or <quote>tabla</quote></para></listitem> -</varlistentry> - -<varlistentry> -<term><userinput>/?></userinput></term> -<listitem><para>Matches <quote>/></quote> in <quote><closeditem/></quote> as well as -<quote>></quote> in <quote><openitem></quote>.</para></listitem> -</varlistentry> - -</variablelist> - -</sect2> - -</sect1> - -<sect1 id="assertions"> -<title>Assertions</title> - -<para><emphasis>Assertions</emphasis> allows a regular expression to -match only under certain controlled conditions.</para> - -<para>An assertion does not need a character to match, it rather -investigates the surroundings of a possible match before acknowledging -it. For example the <emphasis>word boundary</emphasis> assertion does -not try to find a non word character opposite a word one at its -position, instead it makes sure that there is not a word -character. This means that the assertion can match where there is no -character, i.e. at the ends of a searched string.</para> - -<para>Some assertions actually does have a pattern to match, but the -part of the string matching that will not be a part of the result of -the match of the full expression.</para> - -<para>Regular Expressions as documented here supports the following -assertions: - -<variablelist> - -<varlistentry> -<term><userinput>^</userinput> (caret: beginning of -string)</term> -<listitem><para>Matches the beginning of the searched -string.</para> <para>The expression <userinput>^Peter</userinput> will -match at <quote>Peter</quote> in the string <quote>Peter, hey!</quote> -but not in <quote>Hey, Peter!</quote> </para> </listitem> -</varlistentry> - -<varlistentry> -<term><userinput>$</userinput> (end of string)</term> -<listitem><para>Matches the end of the searched string.</para> - -<para>The expression <userinput>you\?$</userinput> will match at the -last you in the string <quote>You didn't do that, did you?</quote> but -nowhere in <quote>You didn't do that, right?</quote></para> - -</listitem> -</varlistentry> - -<varlistentry> -<term><userinput>\b</userinput> (word boundary)</term> -<listitem><para>Matches if there is a word character at one side and not a word character at the -other.</para> -<para>This is useful to find word ends, for example both ends to find -a whole word. The expression <userinput>\bin\b</userinput> will match -at the separate <quote>in</quote> in the string <quote>He came in -through the window</quote>, but not at the <quote>in</quote> in -<quote>window</quote>.</para></listitem> - -</varlistentry> - -<varlistentry> -<term><userinput>\B</userinput> (non word boundary)</term> -<listitem><para>Matches wherever <quote>\b</quote> does not.</para> -<para>That means that it will match for example within words: The expression -<userinput>\Bin\B</userinput> will match at in <quote>window</quote> but not in <quote>integer</quote> or <quote>I'm in love</quote>.</para> -</listitem> -</varlistentry> - -<varlistentry> -<term><userinput>(?=PATTERN)</userinput> (Positive lookahead)</term> -<listitem><para>A lookahead assertion looks at the part of the string following a possible match. -The positive lookahead will prevent the string from matching if the text following the possible match -does not match the <emphasis>PATTERN</emphasis> of the assertion, but the text matched by that will -not be included in the result.</para> -<para>The expression <userinput>handy(?=\w)</userinput> will match at <quote>handy</quote> in -<quote>handyman</quote> but not in <quote>That came in handy!</quote></para> -</listitem> -</varlistentry> - -<varlistentry> -<term><userinput>(?!PATTERN)</userinput> (Negative lookahead)</term> - -<listitem><para>The negative lookahead prevents a possible match to be -acknowledged if the following part of the searched string does match -its <emphasis>PATTERN</emphasis>.</para> -<para>The expression <userinput>const \w+\b(?!\s*&)</userinput> -will match at <quote>const char</quote> in the string <quote>const -char* foo</quote> while it can not match <quote>const QString</quote> -in <quote>const QString& bar</quote> because the -<quote>&</quote> matches the negative lookahead assertion -pattern.</para> -</listitem> -</varlistentry> - -</variablelist> - -</para> - -</sect1> - -<sect1 id="backreferences"> - -<title>Back References</title> - -<para></para> - -</sect1> - -</appendix> |