Access:

» Developing a custom database - LhimkDB. Part 3

Related categories: Other DB | Database programming

Pawe³ Marciniak
Viewed: 4746 | Article date: 2006-04-21 09:52:27

In this article we will be working on the layer which provides key-based access to data and makes it possible to browse data in certain order.

Two months ago I started writing a series of articles, describing how to build a database from scratch. We are building an embedded database with key-value architecture. In the first article I have presented motivation - why it is worth it to build such a database - and the project as a whole, with its assumptions and elements.

About the author

Pawe³ Marciniak has been developing software, mainly in C/C++, for 20 years. For three years he has been developing a dynamic compilation system called Lhimk. Contact with the author: pawel@lhimk.org

As a reminder, Figure 1 shows a diagram of the whole LhimkDB database. In the previous article I have shown how to construct the data storage layer - an UDB (Unordered Database). In this article we will be working on the layer which provides key-based access to data and makes it possible to browse data in certain order.

Figure 1. Architecture of LhimkDB

Key-based access to data will be carried out using the structure SkipList. First I will show how to build a general class SkipList, after which we will connect it with the UDB using a certain trick, thus obtaining a database. Although all these elements are of course parts of LhimkDB, both their tasks and implementations are independent enough to possibly interest someone who simply wants to construct persistent memory or an interesting structure, like sorted associative memory, for his programs. In other words, both SkipList and UDB can be used as independent libraries.

LhimkDB is written in the Lhimk language. This language resembles C/C++ so much that it doesn't make sense to learn it or to describe how to write programs in it. Elements that are present in Lhimk but absent in C/C++ are discussed in the frame "Lhimk" or within the article itself. Should someone prefer not to use Lhimk (which is perfectly understandable for me), it will be easy to rewrite the code presented here into C++. It is also possible to use Lhimk inside programs written in C/C++.

More about UDB

Of course I wouldn't be myself if I didn't change something that has been described in previous articles. Attentive readers remember that in the UDB solution I have presented we remembered two values for each object - the present and the previous one. This allowed us to simultaneously read and alter data - unfortunately if a reading transaction lasted so long that two writing transactions took place in the meantime, it could case reading conflicts. This problem didn't let me sleep at night and finally I have found a solution which tracks data history for all ongoing transactions. The solution is trivial: whenever a value of an object is changed we simply create a new PID, in which we store a pointer to the old version. This gives us a list of of historic values of object data. They are removed from the database at the moment of termination of the oldest reading transaction which might access that data. As a result, we get something which database vendors advertise as OLAP functionality - the possibility of executing long, analytic queries without collisions with writing transactions.

At this point either of the two solutions for UDB can be used. Incidentally, it has turned out the code is really highly modular - the change involved introducing modifications only to a few methods of UDBIndex.

Assumptions

Just like in the previous article, we will begin by defining what we would like to achieve. We have already written the UDB, which can be compared in handling with heap memory - the difference being that the UDB is persistent and supports transactions. In the UDB every object has an assigned PID (a remote cousin of a pointer), which allows one to read, alter and delete objects. Now we will add the functionality of writing data addressed by arbitrary keys (e.g. text) and browsing it ordered by these keys.

The design of the API is general, with no knowledge of the type of keys (K) and values (T). Since Lhimk supports templates and generic programming, one can provide arbitrary types - we don't have to know them while writing our code.

Here is our API:

operator[[K k]] - read the value stored under the key k

operator[[K k]]=T t - write the value t under the key k

remove(K k) - delete the value stored under the key k

startTransaction() - initiate a writing transaction

commitTransaction() - finalise a writing transaction, committing introduced changes

+ iterator code (end, next, prev etc.)

At this moment I had better explain a bit how one can define custom operators in Lhimk. First of all, what I have just written above is pseudocode - in Lhimk all operator names are legal names of methods, with no additional characters. For example:

operator_bb - means code for the operator [[]]

operator_bba - means code for the operator [[]]=

operator_eq - means code for the operator ==

etc.

Still, this is just details. Much more interesting is the operator [[]]= (operator_bba). The thing is, in C++ when we write code of the class e.g. Vector and want to perform an assignment to the ith cell, it looks more-or-less like this:

T& operator[](K k)

This means that the operator method returns a reference to the object, so in case of code like:

v[10] = 20;

the 10th cell will somehow, "magically", be assigned the value 20. I do not want to criticise C++ because I really like this language and have written quite a few large programs in it. Even so, references is something I hate the guts of; perhaps I'm too stupid to understand them. Therefore, even though I was tempted several times, I didn't add them to Lhimk on purpose.

While writing code for the class Vector, what we really want is to have some operator which will be an equivalent of the method insert. There is such an operator in Lhimk, it's: [[]]= (or: []= or {}=). If we defined such an operator, then the code:

v[[10]] = 20 ;

will call

operator_bba(10,20)

Simple. And what we wanted.

I use [[]] everywhere instead of single [] because there is such a convention in Lhimk (is the word "convention" is appropriate for something that young). The operator [[]] is a reference to an array with an additional "something", the operator [] quickly reads values from an array.

SkipList

First we will take care of building a data structure which will give us access to stored values by key, with the possibility of browsing in accordance with the key order. For now we will forget that we're building a database and write an "ordinary" class.

Databases make use of modified versions of algorithms known from ordinary data structures: hash tables, B-Trees etc. In the most common scenario, a database uses some variant of a B+-tree or B*-tree. More about this can be read in the article by Thomas Niemann, "A compact guide to searching and sorting" (http://epaperpress.com/sortsearch/index.html). One of the reasons for this is that traditional databases want to pack everything into a single page and have as few page read or page write operations as possible.

Personally I am not particularly convinced (and I'm not the only one) that this approach still has any good reasons behind it. Modern hardware and operating systems have greatly changed the situation in the last few years. Discs (or to be exact, their controllers) have several levels of memory, their own resource queuing algorithms etc. It doesn't make much sense any more to include code like this into databases. In my opinion, it is much better to employ the most simple solutions which will make good use of modern hardware - e.g. the processor cache. Therefore, LhimkDB uses the Skip List invented in 1989 by Bill Pugh (http://www.cs.umd.edu/~pugh/). This algorithm is so simple that I can explain it here in just two paragraphs.

Let us take the most simple linked list with the record structure in the form [key, data, pointer to the next record]. In such a list, searching for a record with a certain keys involves walking through subsequent records until we have found the desired key. We can keep those records sorted, then once we have found a record with a greater key we know we don't have to look any further - we won't find our key anywhere (let us call such a sorted list L0 - see Figure 2.).

Figure 2. An ordinary sorted list (L0)

Imagine that we've got 100 records in such a list. We take every other record and connect them with additional pointers, creating another list (let's call it L1 - see Figure 3.).

Figure 3. A sorted list with links between every other element (L1)

Page: 1 2 3
Buy article Buy subscription
Buy now add to cart
add to cart
Standard price: 2€/$3 Standard price: 25€/$30
Buy article for as little as (2€/$3) each allow access to individual articles. Buy a full access to our Software Developers's Journal archive portal. You will be able to read the articles from all archive issues from year 2005 and 2006. For just 25€/$30 you get unrestricted access to the entire website for the whole year.
SDJhakin9

.SDJ Users:


.:Login
.:Password

[Register]
[Forgotten your password?]

...Shopping Cart

sum: 0 €
Choose currency:

...Topics

...Advertisement

www.acunetix.com www.verifysoft.com

...Conferences




...Print Edition Archive

...Affiliate Program



 

 

Subscribe | Contact Us | Newsletter | Privacy policy | Regulations | See all issues | About SDJ
Copyright C 2006 by Software Developer's Journal. All rights reserved.