Maintainer: Ariya Hidayat (ariya@kde.org)
Some portions by Tomas Mecir (mecirt@gmail.com)
Revision: September 2004.
This document contains information about internal structure of KSpread as well as some notes of upcoming redesign. The sources for this document are mainly the discussions which take place in koffice-devel mailing-list and the source code itself.
Status: IN PROGRESS.
MVC (Model/View/Controller) means that the application consists of three big parts, the Model which holds the data structure and objects, the View which shows the model to the user and the Controller which handles user inputs and changes the model accordingly. Like other office applications, KSpread uses the Document/View architecture, a slightly different variant of MVC where the View and Controller are put together as one part.
In order of its complexity scale, KSpread code has to be well separated, i.e. the Document and the View. We may also call them as back-end and front-end respectively. Right now part of which should belong to the Document sometimes has access to the View. For example, a cell stores information about its metrics in pixels (which is zoom dependent), knows whether it is visible to the user or not (which is view dependent), etc. This needs to be changed.
One easy way to decide whether some stuff or relationship must really really alienated in the Document is to imagine that somebody wants to create another View (front-end) to the Document object model (back-end) that is being worked on. Say, one decent guy would like to copy the look-and-feel of classic Lotus 1-2-3 (for whatever reason we are not really interested in here); so basically to some extent he can take most part of the KSpread back-end and glue a new user interface around the code.
Status: IN PROGRESS.
When a cell holds a formula, then it is likely that it depends on other cell(s) for calculating the result. For example, if cell A11 has the formula "=SUM(A1:A10)", this means that values in cells A1, A2, A3, until A10 must be correctly calculated first before the sum can be obtained for cell A11. This is called dependency.
As for now, KSpread tries to manage dependency by storing the dependent cells or ranges in the cell itself. This is not too efficient. If a cell is very simple, i.e. stored only value, not formula, such scheme will just waste a couple of bytes of pointers for the dependency data structure. It is much more wiser to simply create one dependency manager for each worksheet; it should be responsible for maintaining and handling cell dependencies for that sheet. Also KSpread always stores ranges which depend on one particular cell and ranges whose one of its dependent is that cell (and these are all in the cell structure itself). This is not necessary as that information is redundant. The dependency manager should be able to handle both cases.
Let us have a look at this simple example:
A | B | C | D | |
1 | 14 | 36 | ||
2 | 3 | |||
3 | 77 | |||
4 | =SUM(A1:A3) | =A4+SUM(B1:B3) | =100*B4 |
Such sheet should produce dependencies like:
Reference | Dependent(s) |
A4 | A1:A3 |
B4 | A4 and B1:B3 |
C4 | B4 |
When we want to recalculate cell B4, from the dependencies shown above we may know that first we need to know values of cell A4 and range B1:B3. Further on, cell A4 needs to know values of cells in range A1:A3. Therefore, given one reference cell (e.g. B4), the dependency manager must be able to return all dependents, cells and/or ranges (e.g. A4, B1:B3). Do we need to go recursively when searching for dependencies? That really depends on the implementation, but it is not a big problem, though.
In another case, say the user has changed cell A3 so we need to update the calculation. We should not recalculate the whole sheet because it wastes time. We just need to recalculate cells that depend on A3, in this case A4, B4 and C4. So the dependency manager has another responsibility: given a cell it should find all cells and/or ranges which depend on that particular cell. It is a matter of iterating over all dependencies and checking whether the cell is within the dependent(s) and returning the reference cell. In this example, cell A3 is in the range A1:A3, a dependent range of cell A4. Hence, we just return A4. Recursive or not, we can either continue finding dependents of A4 or just stop here.
Note also that dependency manager should not store cell pointers, but rather only the location of the cell (i.e. the sheet that owns the cell, row number and column number). This is because on some cases the dependent cell may not exist yet. As illustrated in the example, dependents of cell B4 are A4, B1, B2 and B3 but here cells B2 and B3 are still empty. Of course, when we just want to know which cells we need to recalculate for one reference cell, the dependency manager is allowed to return only non-empty cells (e.g. A4 and B1 in our case) as empty cells have no effect and will not be recalculated anyway.
By the same manner, dependency manager can also held responsible when chart comes into play. Any charts placed in the sheet (that are actually KChart parts) depend on some values of the cells. An action by the user to changing those cells, directly or indirectly, should trigger the update of the respective charts.
Inter-sheet dependencies can be well handled if we store the owner of each dependent. This is not shown yet in the explanation above to avoid unnecessary complication. But let have one example now: if Sheet2!A1 is "=SUM(Sheet1!A1:A10)" then changing Sheet1!A1 (the dependent) means updating Sheet2!A1 (the reference). Of course during recalculation we must take care that all sheets in the document must be processed, even though only one single cell in one sheet has been changed.
Implementation-wise, there will be one instance of the dependency manager for each sheet. This class will fully manage all dependencies and trigger cell recalculation. An important part of this concept is this: The cell itself knows nothing about dependencies, and it doesn't care about them either. The cell will just inform about the fact, that its value has been changed, and the dependency manager will do the rest. In addition, this gives us recursive dependency calculation at almost no cost.
Status: PLANNED.
Currently, every operation on a cell or on a range of cells is quite complex. You need to ensure correct repainting, recalculation, iterate on a range and so on.
To address this issue, manipulators shall be implemented. A manipulator will implement one operation (formatting change, sequence fill, ..., ...).
Basically, usage of a manipulator should look like this:
Manipulator *manip = manipulatorManager::self()->getManip ("seqfill"); manip->setArgument ("type", 1); ... (more setArgument's) manip->exec (selection);
That's all...
What concerns manipulator implementation, you'll derive from the base manipulator and reimplement constructor and methods initialize() (called just before the operation starts), processCell(), and maybe done(). The constructor or initialize() would set some properties for the cell-walking algorithm, and then it won't care about it anymore. The base class will walk the range and call processCell() for each cell, possibly creating it if it doesn't exist (if the manipulator wants so).There will also be some methods that can be used to process the whole range or row/column at once, if the manipulator wants to do so (useful for, say, formatting manipulators that will be able to set attributes of a whole range or row/col, in accordance with thoughts about format storage below.
In addition, the manipulator can implement the undo/redo functionality - the base manipulator will provide some common stuff needed to accomplish this.
Status: PLANNED
The selection shall be an instance of some RangeList class, or however we want to call it - this will contain a list of cells/ranges/rows/whatever - like current selection, but will contain more entries. This will allow easy implementation of CTRL-selections and so, because thanks to manipulators, each operation will automatically support these.
Status: PLANNED
As mentioned above, the interface between the core and the GUI needs to be kept at minimum. Also, the number of repaints needs to be as low as possible, and repaints should be groupped whenever possible. To achieve all this, the following approach can be used:
When a cell is changed, it calls some method in KSpread::Sheet - valueChanged() or formattingChanged(). These methods then trigger everything necessary, like a call to the painting routine or dependency calculation.
This simple system would work on itself, but it would be slow. If you do a sequence fill on A1:A1000 and you have a SUM(A1:A1000) somewhere, why would you want to compute that SUM 1000 times, when you can simply compute it after the sequence fill has been finished? Hence, the sheet will offer some more methods - disableUpdates(), enableUpdates(), rangeListChanged() and rangeListFormattingChanged(). All these will be used (solely?) by manipulators, preferably by the base manipulator class, so that we don't have to call these functions in each operation. After a call to disableUpdates(), there will be no repainting and no dependency calculation. Note that a call to enableUpdates() won't cause any repaints either, as the sheet cannot remember all the calls (due to loss of range information). Hence, the base manipulator class needs to call the correct rangeList*Changed method to trigger an update in an effective way. The base manipulator needs to be configurable by the manipulators that derive from it, so that it knows whether it changed cell's content or formatting.
Status: FINISHED.
This formula engine is just an expression evaluator. To offer better performance, the expression is first compiled into byte codes which will be executed later by a virtual machine.
Before compilation, the expression is separated into pieces, called tokens. This step, which is also known as lexical analysis, takes places at once and will produce sequence of tokens. They are however not stored and used only for the purpose of the subsequent step. Tokens are supplied to the parser, also known as syntax analyzer. In this design, the parser is also a code generator. It involve the generation of byte codes which represents the expression. Evaluating the formula expression is now basically running the virtual machine to execute compiled byte codes. No more scanning or parsing are performed during evaluation, this saves time a lot.
The virtual machine itself (and of course the byte codes) are designed to be as simple as possible. This is supposed to be stack-based, i.e. the virtual machine has an execution stack of values which would be manipulated as each byte code is processed. Beside the stack, there will be a list of constant (sometimes also called as "constants pool") to hold Boolean, integer, floating-point or string values. When a certain byte code needs a constant for the operand, an index is specified which should be used to look up the constant in the constants pool.
There are only few byte code, sufficient enough to perform calculation. Yes, this is really minimalist but yet does the job fairly well. The following provides brief description for each type of bytecode.
Nop means no operation.
Load means loads a constant and push it to the stack. The constant can be found at constant pools, at position by 'index', it could be a Boolean, integer, floating-point or string value.
Ref means gets a value from a reference. Member variable 'index' will refers to a string value in the constant pools, i.e. the name of the reference. Typically the reference is either a cell (e.g. A1), range of cells (A1:B10) or possibly function name. Example: expression A2+B2 will be compiled as:
Constants:
#0: "A2"
#1: "B2"
Codes:
Ref #0
Ref #1
AddFunction. Example: expression "sin(x)" will be compiled as:
Constants:
#0: "sin"
#1: "x"
Codes:
Ref #0
Ref #1
Function 1Neg is a unary operator, a value is popped from stack and negated and then pushed back to the stack. If it is not number (Boolean or string), it will be converted first.
Add, Sub, Mul, Div and Pow are binary operators, two values are popped from stack and processed (added, subtracted, multiplied, divided, or power) and the result is pushed to the stack.
Concat is string operation, two values are popped from stack (and converted to string if they are not string values), concatenated, and the result is pushed to the stack.
Not is a logical operation, a value is popped from stack and its Boolean not is pushed into the stack. When it is not Boolean value, there will be a cast.
Equal, Less, and Greater are comparison operators, two values are popped from stack and compared appropriately. The result, which is a Boolean value, is pushed into the stack. To simplify, there no "not equal" comparison because it can be regarded as "equal" followed by "not" byte codes. Same goes for "less than or equal to" and "greater than or equal to".
The expression scanner is based on finite state acceptor. The state denotes the position of cursor, e.g. inside a cell token, inside an identifier, etc. State transition is following by emitting the associated token to the result buffer. Rather than showing the state diagrams here, it is much more convenience and less complicated to browse the scanner source code and try to follow its algorithm from there.
The parser is designed using one of bottom-up parsing technique, namely based on Polish notation. Instead of ordering the tokens in suffix Polish form, the parser (which is also the code generator) simply outputs byte codes. In its operation, the parser requires the knowledge of operator precedence to correctly translate unparenthesized infix expression and thus requires the use of a syntax stack.
The parser algorithm is given as follows:
Repeat the following steps:
Step 1: Get next token
Step 2: If it is an identifier
- push it to syntax stack
- generated "Ref"
Step 3: If it is a Boolean, integer, float or string value
- push it to syntax stack
- generated "Load"
Step 4: If it is an operator
- check for reduce rules
- when no more rules applies, push token to the syntax stack
The reduce rules are:
Rule A: function argument: if token is semicolon or right parenthesis, if syntax stack looks as:
Rule B: last function argument
if syntax stack looks as:
Rule C: function without argument
if syntax stack looks as:
Rule D: parenthesis removal
if syntax stack looks as:
Rule E: binary operator
if syntax stack looks as:
Rule F: unary operator
if syntax stack looks as:
Percent operator is a special case and not handled the above mentioned rule. When the parser finds the percent operator, it checks whether there's a non-operator token right before the percent. If yes, then the following code is generated: load 0.01 followed by multiply.
Status: FINISHED.
to be written.
Status: IN PROGRESS.
Until lately, to implement undo and redo, KSpread creates corresponding KSpreadUndo classes for each action and runs them when the user undoes those actions. KSpreadUndo also has redo function whose job is to redo again the action after being undone.
All this needs to be converted to manipulators - these will be KCommand, hence we should be able to undo/redo every operation (provided that the corresponding manipulator provides methods to store/recall the undo information).
Status: PLANNED.
Cells are grouped together, and then hashed.
Status: PLANNED.
Formatting specifies how a cell should look like. It involves font attributes like bold or italics, vertical and horizontal alignment, rotation angle, shading, background color and so on. Each cell can have its own format, but bear also in mind that a whole row or column format should also apply.
Current way of storing formatting information is rather inefficient: pack it together inside the cell. The reason is because most of cells are either very plain (no formatting) and/or only have partial attribute (e.g. only marked as bold, no font family or color is specified). Therefore the approximately 20 bytes used to hold formatting information are quite a waste of memory. Even worse, this requires that the cell must exist even if it is not in use. As illustration, imagine a worksheet where within range A1:B20 only 5 cells are not empty. When the user selects this range and changes the background color to yellow, then those 5 cells must store this information in their data structure but how about the other 35 cells? Since the formatting is attached to the cell, there is no choice but to create them. Doing this, just for the sake of storing format, is actually not really good.
A new way to store formatting information is proposed below.
For each type of format a user can use, we have the corresponding formatting piece, for example "bold on", "bold off", "font Arial", "left-border 1 px", etc. Whenever the user applies formatting to a range (could also be a whole column, row, or worksheet), we save appropriate respective formatting piece in a stack. Say the user has marked column B as bold, row 2 as italics, and set range A1:C5 with yellow background color. Our formatting stack would look like:
Range | Formatting Piece |
Column B | Bold on |
Row 2 | Italics on |
A1:C5 | Yellow background |
Now let try to figure out the overall format of cell B2. From the first we know it should be bold, from the second it should be italics, and last it should have yellow background. This complete format is the one which we used to render cell B2 on screen. In similar fashion, we can know that cell A1 on the other only specifies the yellow background, because the first and second pieces do not apply there.
Another possible way to see the format storage is by using 3-D perspective. For each formatting piece, imagine there is a surface which covers the formatted range (the xy-plane). The formatting information is simply attached to the surface (say, as surface attribute). Every surface is stacked together, its depth (the z-axis) denotes the sequence, i.e. the first surface is the deepest. For the example above, we can view the pieces as one surface specifies "bold on" which is a vertical of column B, one surface specifies "italics on" which is a horizontal band of row 2 and one last surface which specifies "yellow background" stretched in the range A1:C5. How to find complete format for A2? This is now a matter of surface determination. Traversing in the direction of z-axis from B2 reveals that we hit the last and second surfaces only; thereby we can know the complete format is "italics, yellow background".
It is clear that a format storage corresponds to one sheet only. For each sheet, there should be one format storage. Cells can still have accessors to its formatting information, these simply become wrapper for proper calls to the format storage. Since each formatting piece holds information about the applied range, we must take care that the formatting storage is correctly updated whenever cells, rows or columns are inserted and deleted.
In order to avoid low performance, we must use a smart way to iterate over all formatting pieces whenever we want to find out complete format for given cell(s). When the sheet gets very complex, it is likely that we will have many many formatting pieces that are not even overlap. This means, when we need formatting of cell A1, it is no use to check formatting pieces of range Z1:Z10 or A100:B100 which do not include cell A1 and are even very far from A1. One possible solution is to arrange the formatting pieces in a quad-tree. Because one piece can cover a very large area, it is possible that it will be in more than one leaf in the quad-tree. Details on the possible use of quad-tree or other methods should be explored further more.
Status: IN PROGRESS.
Relevant mailing-list threads:
Toolbars are utilized to place most frequently used actions. It is important to present the user with default toolbars which make sense, i.e. they do not contain unnecessary buttons. In-depth usability analysis and/or further discussions are needed to make decision which buttons need to be in the toolbar and which don't.
For reference, here is a list of default shown toolbars in some spreadsheet applications:
Microsoft Excel 2002:
OpenOffice 2.0:
Gnumeric 1.4:
Status: IN PROGRESS.
Relevant mailing-list threads:
It is well known that writing clean and easily understandable module will lead to better maintenance. However testing that particular module everytime there is a significant change requires considerable amount of time and effort. Since KSpread and other applications of its scale consist of hundreds of modules, in this case automatic testing of each module will help a lot, not to mention that it might catch bug as early as possible.
KSpread has a simple test framework to facilitate such kind of test. This can be activated using the shortcut Ctrl+Shift+T. This test is however not accessible via menu, because it is intended to be used only by the developers. Ideally, there should be tests for all modules contained in KSpread. It is the responsibility of the developer to create the corresponding tester for the code that he or she is working on. All tests should be kept in koffice/kspread/tests/.
Making a new tester is not difficult. The easiest way is to copy an already existing tester and modify it. Basically, it must be a subclass of class Tester (see koffice/kspread/tests/tester.h). Just reimplement the virtual function run() and it is ready. In order to make it possible to run the new tester, add an instance of the class in TestRunner (for details, see koffice/kspread/tests/testrunner.cc).
A tester must be self-contained, it should not use any test data from current document. If necessary, it must create (or hard code) the data by itself.
Whenever parts of KSpread features are improved or rewritten, it is always a good idea to run the related tests to ensure that all the changes do not do any harm. However, bear in mind that there is no 100% guarantee that the new code is bug-free.
Also, if there is a bug which is not caught by the tester (i.e. it does not fail the tester, but the bug is confirmed), then the relevant tester must be modified to include one or more test cases similar to the offending bug. When the bug is finally fixed, from that point the test should always pass all test cases.
Status: IN PROGRESS.
(to be written in details).
Write clean code. To be correct is better than to be fast. KSpread source code is known to grow very fast in its early days and but later on also more difficult to understand.
Put comment as documentation for classes and member functions. There is still lack of documentation as for now, whoever understands something about the classes and functions should write the documentation.
In complex source files, list of header includes can be very long. Unless there is special reason not do it, try to group them together, i.e. standard C/C++ headers come first, followed by Qt headers, and then KDE headers, KOffice core/UI headers and application specific headers. For each group, sort the header files alphabetically.
Write test cases. This will ease further maintenance. See also the section on Test Framework above.
Do not use the term table. It was incorrectly invented quite likely because of the term Tabelle (German, literally means table). The correct term is sheet or worksheet. The English version of Microsoft uses sheet while the German version uses Tabelle.
Use d-pointer trick (also known pimpl) whenever possible. Such practice will help when later on we want to expose the API and need to maintain binary compatibility. But the most important thing is to separate the interface and the implementation. Furthermore, build time is reduced since modification on the implementation would not cause tons of recompile.
When creating a new class, use namespace KSpread. Do not use KSpread prefix anymore. Example: use KSpread::Foo instead of KSpreadFoo. Also source file name should not contain kspread prefix anymore, i.e. foo.h and foo.cc (but not kspread_foo.h and kspread_foo.cc) for the above example.