diff options
Diffstat (limited to 'tdecore/README.tdestringmatcher')
-rw-r--r-- | tdecore/README.tdestringmatcher | 121 |
1 files changed, 121 insertions, 0 deletions
diff --git a/tdecore/README.tdestringmatcher b/tdecore/README.tdestringmatcher new file mode 100644 index 000000000..dc9e6bca5 --- /dev/null +++ b/tdecore/README.tdestringmatcher @@ -0,0 +1,121 @@ +The TDEStringMatcher class provides string matching against a list of one +or more match patterns along with associated options. A single pattern with +its associated options will be referred to herein as a "match specification". + +Current match specification options include: + + * Type of match pattern: + + REGEX: Pattern is a regular expression. + WILDCARD: Pattern is a wildcard expression like that used + in POSIX shell file globbing. + SUBSTRING: Pattern is a simple substring that matches any + string in which it occurs. Substring characters do not + have any other meaning that controls matching. + + * Alphanumeric character handling in a pattern: + + NONE: each unescaped alphanumeric character in a pattern + is distinct and will match only itself. + CASE INSENSITIVE: each unescaped letter in a pattern + will match its lower and upper case variants. + EQUIVALENCE: Each unescaped variant of an alphanumeric + character will match all stylistic and accented + variations of that character. + + * Desired outcome of matching + + TRUE: match succeeds if a string matches the match pattern. + FALSE: match succeeds if a string does NOT match the match pattern. + +Applications may set and get match specification lists either directly or +indirectly (using an encoded match specifications string). The matching +functions provided are: + + matchAny(): strings match if they match any pattern in list. + matchAll(): strings match only if the match all patterns in list. + +MATCH SPECIFICATIONS STRING + +The TDEStringMatcher class provides applications an encoded match +specifications string solely intended to be used for storing and retrieving +match specifications. These strings are formatted as follows: + + OptionString <Tab> PatternString [ <Tab> OptionString <Tab> PatternString ...] + +Option strings may contain only the following characters: + + 'r' - Match pattern is a regular expression [default] + 'w' - Match pattern is a wildcard expression + 's' - Match pattern is a simple substring + 'c' - Letter case variants are distinct (e.g. case-sensitive) [default] + 'i' - Letter case variants are equivalent (e.g. case-insensitive) + 'e' - All letter & number character variants are equivalent + '=' - Match succeeds if pattern matches [default] + '!' - Match succeeds if pattern does NOT match (inverted match) + +Option strings should ideally contain exactly 3 characters indicating match +pattern type, alphanumeric character handling, and desired outcome of matching. +Specifying fewer option characters is possible but may result in unexpected +inferred values. Specifying additional and possibly contradictory option +characters is also possible, with later characters overriding earlier ones. + +Pattern strings may not be empty. Invalid pattern strings will cause the +entire match specifications string to be rejected. + +Match specifications strings that are stored in TDE configuration files will +be modified as follows: + + '\' characters in original pattern are encoded as '\\' + The <Tab> separator is encoded as '\t' + +Using file name matching as an example, the match specifications string: + wc= .* rc= ~$ se! e ri= ^a.+\.[0-9]+$ +encoded in a TDE configuration file as: + wc=\t.*\trc=\t~$\tse!\te\tri=\t^a.+\\.[0-9]+$ +will match file names as follows: + + * All "dotfiles" would be matched with wildcard matching. + * All file names ending with '~' (e.g kwrite backup names) would be + matched with case-sensitive regex matching. + * All filenames that do NOT contain an equivalent variant of the letter + 'e' (e.g. 'e','ê','Ě','E') would be matched with substring matching. + * All file names starting with letter 'a' or 'A' and ending with '.' + followed by one or more numeric digits would be matched with case- + insensitive regex matching. + +IMPLEMENTATION NOTES: + + * Regular expressions are currently supported by TQRegExp and are + thereby subject to its limitations and bugs. This may be changed + in the future (e.g. direct access to PCRE2, porting of Qt 5.x + QRegularExpression). + + * Wildcard pattern matching on GLIBC systems is done using the fnmatch + function with GNU extended patterns supported. Consult the fnmatch(3) + and glob(7) manual pages for more information. On non-GLIBC systems, + basic (not extended) wildcard patterns are converted to basic regular + expressions and processed by the underlying regular expression engine. + + * Simple substrings are also supported as match patterns. These are + currently processed by the TQString.find() function. In the future, + these may be converted and processed by the underlying regex engine, + depending on the tradeoff between code simplification and efficiency. + + * Alphanumeric equivalence is conceptually similar to [=x=] POSIX + equivalence class bracket expressions (which are not supported) + but is intended to apply globally in patterns. The following + are caveats when this option is utilized: + + - There is potentially significant overhead due to the fact that + match patterns and match strings must be converted prior to + matching. Conversion requires character-by-character lookup + and replacement using a pre-built table. + + - The table contains equivalents for [0-9A-Z] which should work + well for Latin-derived languages. It also contains support for + other numeric and non-latin letter characters, the efficacy of + which is not as certain. + + - Due to the 16-bit size limitation of TQChar, the table does not + contain mappings for codepoints greater than U+FFFF. |