debian/uncrustify-trinity/uncrustify-trinity-0.78.0/documentation/theory.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129


-------------------------------------------------------------------------------
This document is too incomplete to be of much use.
Patches are welcome!


Theory of operation
-------------------

Uncrustify goes through several steps to reformat code.
The first step, parsing, is the most complex and important.


Step 1 - Tokenize
-----------------
C code must be understood to some degree to be able to be properly indented.
The parsing step reads in a text buffer and breaks it into chunks and puts
those chunks in a list.

When a chunk is parsed, the original column and line are recorded.

These are the chunks that are parsed:
 - punctuators
 - numbers
 - words (keywords, variables, etc)
 - comments
 - strings
 - whitespace
 - preprocessors

See token_enum.h for a complete list.
See punctuators.cpp and keywords.cpp for examples of how they are used.

In the code, chunk types are prefixed with 'CT_'.
The CT_WORD token is changed into a more specific token using the lookup table
in keywords.cpp


Step 2 - Tokenize Cleanup
-------------------------

The second step is to change the token type for certain constructs that need
to be adjusted early on.
For example, the '<' token can be either a CT_COMPARE or CT_ANGLE_OPEN.
Both are handled very differently.
If a CT_WORD follows CT_ENUM/CT_STRUCT/CT_UNION, then it is marked as a CT_TYPE.
Basically, anything that doesn't depend on the nesting level can be done at this
stage.


Step 3 - Brace Cleanup
-------------------------

This is possibly the most difficult step.
do/if/else/for/switch/while bodies are examined and virtual braces are added.
Brace parent types are set.
Statement start and expression starts are labeled.
And #ifdef constructs are handled.

This step determines the levels (m_braceLevel, level and m_ppLevel).

REVISIT:
  The code in brace_cleanup.cpp needs to be reworked to take advantage of being
  able to scan forward and backward.  The original code was going to be merged
  into tokenize.cpp, but that was WAY too complex.


Step 4 - Fix Symbols (combine.cpp)
----------------------------------

This step is no longer properly named.
In the original design, neighboring chunks were to be combined into longer
chunks.  This proved to be a silly idea.  But the name of the file stuck.

This is where most of the interesting identification stuff goes on.
Colons type are detected, variables are marked, functions are labeled, etc.
Also, all the punctuators are classified. Ie, CT_MINUS become CT_NEG or CT_ARITH.

 - Types are marked.
 - Functions are marked.
 - Parenthesis and braces are marked where appropriate.
 - finds and marks casts
 - finds and marks variable definitions (for aligning)
 - finds and marks assignments that may be aligned
 - changes CT_INCDEC_AFTER to CT_INCDEC_BEFORE
 - changes CT_STAR to either CT_PTR_TYPE, CT_DEREF or CT_ARITH
 - changes CT_MINUS to either CT_NEG or CT_ARITH
 - changes CT_PLUS and CT_ADDR to CT_ARITH, if needed
 - other stuff?


Casts
-----
Casts are detected as follows:
 - paren pair not part of if/for/etc nor part of a function
 - contains only CT_QUALIFIER, CT_TYPE, '*', and no more than one CT_WORD
 - is not followed by CT_ARITH

Tough cases:
(foo) * bar;

If uncertain about a cast like this: (foo_t), some simple rules are applied.
If the word ends in '_t', it is a cast, unless followed by '+'.
If the word is all caps (FOO), it is a cast.
If you use custom types (very likely) that aren't detected properly (unlikely),
the add them to the config file like so: (example Using C-Sharp types)
type UInt32 UInt16 UInt8 Byte
type Int32 Int16 Int8


Step 6+ Everything else
-------------------------

From this point on, many filters are run on the chunk list to change the
token columns.

indent.cpp sets the left-most column.
align.cpp set the column for individual chunks.
space.cpp sets the spacing between chunks.
Others insert newlines, change token position, etc.


Last Step - Output
-------------------------

At the final step the list is printed to the output.
Everything except comments are printed as-is.
Comments are reformatted in the output stage.