summaryrefslogtreecommitdiffstats
path: root/tdecore/README.tdestringmatcher
blob: dc9e6bca5c51ebd152143f2773757c711b58ef1b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
The TDEStringMatcher class provides string matching against a list of one
or more match patterns along with associated options. A single pattern with
its associated options will be referred to herein as a "match specification".

Current match specification options include:

  * Type of match pattern:

      REGEX: Pattern is a regular expression.
      WILDCARD: Pattern is a wildcard expression like that used
        in POSIX shell file globbing.
      SUBSTRING: Pattern is a simple substring that matches any
        string in which it occurs. Substring characters do not
        have any other meaning that controls matching.

  * Alphanumeric character handling in a pattern:

      NONE: each unescaped alphanumeric character in a pattern
        is distinct and will match only itself.
      CASE INSENSITIVE: each unescaped letter in a pattern
        will match its lower and upper case variants.
      EQUIVALENCE: Each unescaped variant of an alphanumeric
        character will match all stylistic and accented
        variations of that character.

  * Desired outcome of matching

      TRUE: match succeeds if a string matches the match pattern.
      FALSE: match succeeds if a string does NOT match the match pattern.

Applications may set and get match specification lists either directly or
indirectly (using an encoded match specifications string). The matching
functions provided are:

  matchAny(): strings match if they match any pattern in list.
  matchAll(): strings match only if the match all patterns in list.

MATCH SPECIFICATIONS STRING

The TDEStringMatcher class provides applications an encoded match
specifications string solely intended to be used for storing and retrieving
match specifications. These strings are formatted as follows:

  OptionString <Tab> PatternString [ <Tab> OptionString <Tab> PatternString ...]

Option strings may contain only the following characters:

  'r' - Match pattern is a regular expression [default]
  'w' - Match pattern is a wildcard expression
  's' - Match pattern is a simple substring
  'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
  'i' - Letter case variants are equivalent (e.g. case-insensitive)
  'e' - All letter & number character variants are equivalent
  '=' - Match succeeds if pattern matches [default]
  '!' - Match succeeds if pattern does NOT match (inverted match)

Option strings should ideally contain exactly 3 characters indicating match
pattern type, alphanumeric character handling, and desired outcome of matching.
Specifying fewer option characters is possible but may result in unexpected
inferred values. Specifying additional and possibly contradictory option
characters is also possible, with later characters overriding earlier ones.

Pattern strings may not be empty. Invalid pattern strings will cause the
entire match specifications string to be rejected.

Match specifications strings that are stored in TDE configuration files will
be modified as follows:

  '\' characters in original pattern are encoded as '\\'
  The <Tab> separator is encoded as '\t'

Using file name matching as an example, the match specifications string:
  wc=	.*	rc=	~$	se!	e	ri=	^a.+\.[0-9]+$
encoded in a TDE configuration file as:
  wc=\t.*\trc=\t~$\tse!\te\tri=\t^a.+\\.[0-9]+$
will match file names as follows:

  * All "dotfiles" would be matched with wildcard matching.
  * All file names ending with '~' (e.g kwrite backup names) would be
    matched with case-sensitive regex matching.
  * All filenames that do NOT contain an equivalent variant of the letter
    'e' (e.g. 'e','ê','Ě','E') would be matched with substring matching.
  * All file names starting with letter 'a' or 'A' and ending with '.'
    followed by one or more numeric digits would be matched with case-
    insensitive regex matching.

IMPLEMENTATION NOTES:

  * Regular expressions are currently supported by TQRegExp and are
    thereby subject to its limitations and bugs. This may be changed
    in the future (e.g. direct access to PCRE2, porting of Qt 5.x
    QRegularExpression).

  * Wildcard pattern matching on GLIBC systems is done using the fnmatch
    function with GNU extended patterns supported. Consult the fnmatch(3)
    and glob(7) manual pages for more information. On non-GLIBC systems,
    basic (not extended) wildcard patterns are converted to basic regular
    expressions and processed by the underlying regular expression engine.

  * Simple substrings are also supported as match patterns. These are
    currently processed by the TQString.find() function. In the future,
    these may be converted and processed by the underlying regex engine,
    depending on the tradeoff between code simplification and efficiency.

  * Alphanumeric equivalence is conceptually similar to [=x=] POSIX
    equivalence class bracket expressions (which are not supported)
    but is intended to apply globally in patterns. The following
    are caveats when this option is utilized:

    - There is potentially significant overhead due to the fact that
      match patterns and match strings must be converted prior to
      matching. Conversion requires character-by-character lookup
      and replacement using a pre-built table.

    - The table contains equivalents for [0-9A-Z] which should work
      well for Latin-derived languages. It also contains support for
      other numeric and non-latin letter characters, the efficacy of
      which is not as certain.

    - Due to the 16-bit size limitation of TQChar, the table does not
      contain mappings for codepoints greater than U+FFFF.