summaryrefslogtreecommitdiffstats
path: root/tdecore/README.tdestringmatcher
diff options
context:
space:
mode:
Diffstat (limited to 'tdecore/README.tdestringmatcher')
-rw-r--r--tdecore/README.tdestringmatcher121
1 files changed, 121 insertions, 0 deletions
diff --git a/tdecore/README.tdestringmatcher b/tdecore/README.tdestringmatcher
new file mode 100644
index 000000000..dc9e6bca5
--- /dev/null
+++ b/tdecore/README.tdestringmatcher
@@ -0,0 +1,121 @@
+The TDEStringMatcher class provides string matching against a list of one
+or more match patterns along with associated options. A single pattern with
+its associated options will be referred to herein as a "match specification".
+
+Current match specification options include:
+
+ * Type of match pattern:
+
+ REGEX: Pattern is a regular expression.
+ WILDCARD: Pattern is a wildcard expression like that used
+ in POSIX shell file globbing.
+ SUBSTRING: Pattern is a simple substring that matches any
+ string in which it occurs. Substring characters do not
+ have any other meaning that controls matching.
+
+ * Alphanumeric character handling in a pattern:
+
+ NONE: each unescaped alphanumeric character in a pattern
+ is distinct and will match only itself.
+ CASE INSENSITIVE: each unescaped letter in a pattern
+ will match its lower and upper case variants.
+ EQUIVALENCE: Each unescaped variant of an alphanumeric
+ character will match all stylistic and accented
+ variations of that character.
+
+ * Desired outcome of matching
+
+ TRUE: match succeeds if a string matches the match pattern.
+ FALSE: match succeeds if a string does NOT match the match pattern.
+
+Applications may set and get match specification lists either directly or
+indirectly (using an encoded match specifications string). The matching
+functions provided are:
+
+ matchAny(): strings match if they match any pattern in list.
+ matchAll(): strings match only if the match all patterns in list.
+
+MATCH SPECIFICATIONS STRING
+
+The TDEStringMatcher class provides applications an encoded match
+specifications string solely intended to be used for storing and retrieving
+match specifications. These strings are formatted as follows:
+
+ OptionString <Tab> PatternString [ <Tab> OptionString <Tab> PatternString ...]
+
+Option strings may contain only the following characters:
+
+ 'r' - Match pattern is a regular expression [default]
+ 'w' - Match pattern is a wildcard expression
+ 's' - Match pattern is a simple substring
+ 'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
+ 'i' - Letter case variants are equivalent (e.g. case-insensitive)
+ 'e' - All letter & number character variants are equivalent
+ '=' - Match succeeds if pattern matches [default]
+ '!' - Match succeeds if pattern does NOT match (inverted match)
+
+Option strings should ideally contain exactly 3 characters indicating match
+pattern type, alphanumeric character handling, and desired outcome of matching.
+Specifying fewer option characters is possible but may result in unexpected
+inferred values. Specifying additional and possibly contradictory option
+characters is also possible, with later characters overriding earlier ones.
+
+Pattern strings may not be empty. Invalid pattern strings will cause the
+entire match specifications string to be rejected.
+
+Match specifications strings that are stored in TDE configuration files will
+be modified as follows:
+
+ '\' characters in original pattern are encoded as '\\'
+ The <Tab> separator is encoded as '\t'
+
+Using file name matching as an example, the match specifications string:
+ wc= .* rc= ~$ se! e ri= ^a.+\.[0-9]+$
+encoded in a TDE configuration file as:
+ wc=\t.*\trc=\t~$\tse!\te\tri=\t^a.+\\.[0-9]+$
+will match file names as follows:
+
+ * All "dotfiles" would be matched with wildcard matching.
+ * All file names ending with '~' (e.g kwrite backup names) would be
+ matched with case-sensitive regex matching.
+ * All filenames that do NOT contain an equivalent variant of the letter
+ 'e' (e.g. 'e','ê','Ě','E') would be matched with substring matching.
+ * All file names starting with letter 'a' or 'A' and ending with '.'
+ followed by one or more numeric digits would be matched with case-
+ insensitive regex matching.
+
+IMPLEMENTATION NOTES:
+
+ * Regular expressions are currently supported by TQRegExp and are
+ thereby subject to its limitations and bugs. This may be changed
+ in the future (e.g. direct access to PCRE2, porting of Qt 5.x
+ QRegularExpression).
+
+ * Wildcard pattern matching on GLIBC systems is done using the fnmatch
+ function with GNU extended patterns supported. Consult the fnmatch(3)
+ and glob(7) manual pages for more information. On non-GLIBC systems,
+ basic (not extended) wildcard patterns are converted to basic regular
+ expressions and processed by the underlying regular expression engine.
+
+ * Simple substrings are also supported as match patterns. These are
+ currently processed by the TQString.find() function. In the future,
+ these may be converted and processed by the underlying regex engine,
+ depending on the tradeoff between code simplification and efficiency.
+
+ * Alphanumeric equivalence is conceptually similar to [=x=] POSIX
+ equivalence class bracket expressions (which are not supported)
+ but is intended to apply globally in patterns. The following
+ are caveats when this option is utilized:
+
+ - There is potentially significant overhead due to the fact that
+ match patterns and match strings must be converted prior to
+ matching. Conversion requires character-by-character lookup
+ and replacement using a pre-built table.
+
+ - The table contains equivalents for [0-9A-Z] which should work
+ well for Latin-derived languages. It also contains support for
+ other numeric and non-latin letter characters, the efficacy of
+ which is not as certain.
+
+ - Due to the 16-bit size limitation of TQChar, the table does not
+ contain mappings for codepoints greater than U+FFFF.