Search:
|
Access:
» Developing a custom database-LhimkDB. Part 1Related categories: C/C++ | Other DB Pawel MarciniakViewed: 6852 | Article date: 2006-01-13 13:00:06 Lhimk is a dynamic compilation framework for a C-based language of the same name. In this four-part series of articles, Pawe³ goes step by step through the process of designing and implementing a modern embedded database. The first article presents an overview of LhimkDB architecture.
Pawel has been involved with software development for 20 years, using mainly C/C++. For the past 3 years he has been working on the Lhimk dynamic compilation framework. Contact with the author: pawel@software.com.pl Saying there are many database systems available would be stating the obvious. A number of excellent databases are even available for free with full source codes and no licensing problems (PostgreSQL comes to mind). So why am I writing an article on creating yet another database system? There are actually several very good reasons, if you'll bear with me. Some time ago I set about looking for an embedded key-value database. Although many suitable open source solutions exist, very few of them can be freely used in commercial applications. The most popular embedded database is Berkeley DB (http://www.sleepycat.com/), but its licence does not allow use in closed-source applications. The PBL database (http://mission.base.com/peter/source/) is very interesting and LGPL-licensed, but its key size is limited to 255 bytes. MetaKit is another noteworthy (and BSD-licensed) project (http://www.equi4.com/metakit.html), but it is column based, not key-value. Many other projects also exist, but none of them had everything I was looking for - some had restrictive licences, others were missing transaction support or some other features. All in all, I could find no suitable embedded key-value database, so I decided to write one myself. At first glance, it seemed the easiest thing to do would be to take the PostgreSQL code responsible for assigning data to indexes (as key-value pairs). PostgreSQL uses the Gist library, so the necessary code is already in a separate unit. However, PostgreSQL is essentially based on algorithms from the 1970s and 80s - revamped and updated, to be sure, but pretty plain fare nonetheless. Since we're going to build something new altogether, why not reformulate the whole approach? The practical and theoretical groundwork for contemporary database systems was laid in times when computers had just a few kilobytes of RAM and at most several megabytes of hard drive space, with database developers working on the assumption that the most time-consuming operations are reading and writing fixed-size pages on disk. While this generally still holds true today, we can now take advantage of a host of features found in modern operating systems (such as virtual memory or file mapping) and hardware (CAS operations, persistent storage using FLASH and RAM etc.). Just a few years ago, a database had to have its own disk partition and nobody in their right mind would use a standard file system to store data. Database systems that could dynamically resize their data files were a revolutionary step forward (look, no administrator!). What seems obvious today was once a novelty - so maybe it's time for change once again? Of course, no-one even remotely sane is going to write their own database for each application they develop (at least hardly anyone). Even if available solutions have unsuitable licences or are obsolete, it's always better to take a tried and tested product - after all, a database has to be completely reliable. On the other hand, it's well worth taking a closer look at database structure or even taking a crack at writing your own database system. The knowledge of inner database workings gained thereby will help you make more effective use of existing products, especially as all databases are actually pretty similar. As a bonus, you will find that reading advertising materials from Serious Corporate Database Vendors will start to inject a healthy dose of humour into your everyday life. This article is the first of a four-part series looking at the development of a complete modern embedded database. We will start with an overview of the whole project and examine the design decisions that will influence the development process. Subsequent parts will take you step by step through implementing the database system, frequently referring to the decisions outlined in this article.
LhimkLhimk is a dynamic compilation framework for a C-based language of the same name. Lhimk is an object-oriented language with very simple syntax, so practically anyone familiar with C/C++ can get programming straight away. It is basically C with multi-inheritance, operator overloading, templates, signals, exceptions and a few other features. The Lhimk environment provides dynamic compilation, with the program code being compiled at runtime. The compiler is extremely fast, so the whole process takes less time than loading many libraries into traditionally compiled applications. Lhimk is released partly on the LGPL licence and partly on the BSD licence. The duality is due to the use of LGPL-licenced TCC code - it is my intention to release all my own code with a BSD licence. AimsLet's start by outlining the aims for our new database system. As already mentioned, our primary requirement is for an embeddable database (i.e. a library) with a key-value architecture. I don't like self-imposed limitations in software design, so it would be best to have no limits on key and value sizes, nor on the size of the database itself. Naturally, we want support for transactions, preferably including nested transactions or even long-running transactions. The actual structure of the database should be open, allowing extensions (such as new index types) to be easily added. The database must also be capable of quick recovery following a system failure and should support hot backups. The latter feature is especially important in web applications, where the system must be up and running non-stop, without so much as a few minutes offline for administrative tasks. DesignWithout much ado, let me get straight to the most controversial design decision: LhimkDB is written in the Lhimk language and uses the Lhimk runtime environment. Before you stop reading, allow me to explain. I initially started writing LhimkDB in C (based on the code of SkipDB by Steve Dekorte, http://www.dekorte.com/) and intended to develop it as a library that could later be dynamically linked into Lhimk. However, as work progressed, it turned out it would be much more convenient to rewrite the whole thing in Lhimk. If anyone has issues with Lhimk, they can take the original C code and work on that (the C code is a late beta and includes all the features discussed here), but Lhimk shouldn't be a problem for anyone looking for a database library for C/C++ (or any language that can use C libraries). Lhimk is based on C, backwards-compatible with C and can be used as a library in C code. It's also possible to compile a C program as a library and embed it into Lhimk. To cut a long story short, LhimkDB can readily be used in C/C++ applications - see inset for more information on Lhimk itself. With that out of the way, let's get to database design proper. Any database consists of several cooperating layers (see Figure 1). The lowest layer supports physical data storage (typically in disk files) and can be pictured as a memory heap that can be written to disk. Write operations should be arranged so as to allow the database to return to its last consistent state after a system failure. Next comes a layer that organises data so it can be located by key. It is basically a kind of hash table, except we also need the ability to browse data sorted by key. The layer is typically implemented using various types of balanced trees (B-trees), but we will use an alternative data structure - a skip list.
Figure 1. LhimkDB architecture The topmost layer supports application transactions. LhimkDB was developed using non-locking algorithms, so no transaction deadlock is possible - transaction conflicts are resolved by cancelling one of the transactions. This solution is still prone to livelock situations, where a transaction is continuously being cancelled and cannot complete, but the problem can be resolved for example by ensuring that any transaction which commenced earlier will complete successfully. So much for the overview - let's take a closer look at our database layers.
|
|
Copyright C 2006 by Software Developer's Journal. All rights reserved.






SDJ Users:
Shopping Cart









