Where to Store Data | Applying Domain-Driven Design and Patterns: With Examples in C# and .NET

Assume we start with a clean sheet of paper; how would we like to store the data? We have at least four choices:

RAM
File system
Object database
Relational database

Note

I realize that this may be a case of apples and oranges because it is a two-part problem: what to store (Objects, Hierarchies, such as XML, or Tables) and where to store it: (RAM, File system, Object database, or Relational database). But that's not without problems, either. After thinking through it a couple of times, I think my original way of describing it is good enough, so let's move on.

I'll start with RAM.

RAM

Storing the data in RAM isn't necessarily as silly as it might first sound. What I'm thinking about is seeing the Domain Model as the database itself while at the same time keeping persistent storage of all changes as a log on disk so that the Domain Model can be re-created in case of a system crash.

The beauty is that you avoid mapping from the Domain Model onto something else. The Domain Model is persistent by itself.

You could also store hierarchies in memory, with XML documents as a typical example. Then you do get some impedance mismatch because you are using two models and need to transform between them. On the other hand, you do get some functionality for free, such as querying if you find XPath and XQuery to be good languages for that.

No matter what you "store" in RAM, one problem is the size of RAM. If you have 2 GB of RAM in your machine leftover when the system and other applications have taken their part, your database shouldn't be larger than 2 GB or performance will suffer.

On the other hand, 64-bit servers are becoming more and more common, and RAM prices are dropping all the time. For a majority of applications, the current as well as the next generation of hardware is often enough, at least when considering the possible maximum size of the database.

Another problem is that it takes time to recreate the Domain Model after a system crash, because working through a huge log will take time. The problem is minimized if the RAM-based solution takes snapshots to disk every now and then of the current Domain Model. But the problem is still there, and taking snapshots might be a problem in itself because it will bring the system on its knees and might cause periods of waitstate for other requests when the snapshot is taken.

Note

Gregory Young pointed out that to make the issue of snapshots smaller, context boundaries can be used within the domain and they can be snapshotted separately.

One of the worst problems with this approach comes in making changes to the schema. What you'll have to do is serialize the objects to disk, and then deserialize them into the Domain Model again with the new schema, which is a tedious task. Working with XML instead of pure classes helps with this problem.

Another big problem is that of transactions. Fulfilling Atomicity, Consistency, Isolation and Durability (ACID) is not easily done without hurting scalability a lot. First, instead of using the "try and see" approach in this case it's better to prepare for a transaction as much as possible in order to investigate whether the task is likely to succeed. ("I need to do this; will that get me into trouble?") Of course, this won't solve the whole problem, but at least it will reduce it. It's good to be proactive here, especially considering that you have very good efficiency because there are probably no process boundaries or network involved.

Note

This depends on the topology. If you consume the Domain Model (and therefore the "database" in this case) from another machine, there are process boundaries and networks to cross.

Again, being proactive didn't provide any transactional semantics, it just made a rollback less likely. One approach I've heard about requires a RAM twice the size of the Domain Model if transactions are needed so that the changes can be done in the copy of the Domain Model. If all changes are successful, they are redone in the ordinary Domain Model. Meanwhile, the Domain Model is inherently single-user. Keep in mind that operations are usually extremely fast, and yet this does not seem scalable to me. I also consider it to be a sign of immaturity, so expect better solutions to this later on. At the time of this writing, there are lightweight transaction managers in the works or being introduced, which might be a good solution to the problem.

One more problem is that there is no obvious choice for a query language. There are some open source alternatives, but again, this feels a bit immature at the moment. This problem gets bigger when you consider reporting and the need for ad-hoc querying done by end users and not just querying from within the application. For example, typical report writer tools for end users are pretty useless in this case.

Note

If possible, reporting should be done on a dedicated server and dedicated database anyway, so this might be less of a problem than what was first anticipated. On the other hand, in reality there is often at least a grey zone of what are reports and what are lists for supporting the daily transactional work, so to speak. And voilá, the problem is back.

Yet another problem is that navigation in the Domain Model is typically based on traversing lists. There might not be built-in support for indexing. Sure, you can use hash tables here and there, but they only solve part of the problem. You can, of course, add an in-memory indexing solution if you need it. On the other hand, you should note that this certainly won't be a problem as early on as it is for disk-based solutions.

Finally, as I've already said a couple of times, I consider this approach to be a bit immature, but very interesting and promising for the future, at least in certain situations.

File System

Another solution is to use the file system instead of RAM. What to persist is the same as with RAM, namely the Domain Model objects or XML. As a matter of fact, this solution could be very close to the RAM solution. It could be the same if the database is small, and it might "only" spill out to disk when the RAM is filled to a certain level.

This approach has similar problems to the previous one, except that the size of RAM isn't as much of a limiting factor in this case. On the other hand, the performance characteristics will probably be less impressive.

I believe that it might be pretty appealing to write your own solution for persisting the Domain Model (or XML documents also), but as always when you decide to roll your own infrastructure, you have to be prepared for a lot of work. I know that nothing is more inspiring to a developer than hearing that "it's too hard" or "too complex," so if they haven't already done so, now just about everybody will be writing their own file system-based solution, right? Just be prepared for it to be deceptively simple at first, but the devil is in the details, and the complexity increases as you move forward.

If you do decide to build a Domain Model that could spill out to disk when persisting, what you actually create is quite like an object database. (Perhaps that gives a better sense of the amount of work and complexity.)

Object Database

Historically, there have been many different styles of object databases, but the common denominator was that they tried to avoid the transformation between objects and some other storage format. This was done for more or less the same reasons as I have been talking about when wanting to delay adding infrastructure to the Domain Model, as well as for performance reasons.

Note

The number of styles increases even more if we also consider the hybrids, such as object-relational databases, but I think those hybrids have most often come from a relational background and style rather than from the object-oriented side.

As it turned out, the number of distractions was not zero. In fact, you could say that the impedance mismatch was still there, but compared to bridging the gap between objects and a relational database, using object databases was pretty clean.

So far, the problems with object databases have been as follows:

Lack of standards
No critical mass
Maturity
More of a niche product
Integration with other systems
Reporting

Note

My evil friend Martin Rosén-Lidholm pointed out that many of the same arguments actually could be used against DDD and O/R Mapping compared to data-oriented solutions in a .NET-world, considering Microsoft's application blocks, papers, guidelines, and so on.

I'll pretend I didn't hear that. And a focus on the domain will probably become more and more popular for Microsoft's guidance as well, which Martin totally agrees with.

I'm certainly no expert on object databases. I've played with a couple of them over the years, and that's about the extent of it. For more information, my favorite books on the subject are [Cattell ODM] and [Connolly/Begg DB Systems].

There was a time, around 1994, when I thought object databases were taking over as the de facto standard. But I based that idea on purely technical characteristics, and life isn't as simple as that. Object databases were promising a decade ago. They are used very much in certain situations today, but above all, they are still just promising. As I see it today, the de facto standard is still relational databases.

Relational Database

As I said, the de facto solution for storing data in applications is to use a relational database, and this is the case even if you work with a Domain Model.

Storing the data in a relational database means that the data is stored in tabular format, where everything is data, including the relationships. This has proved to be a simple and yet effective (enough) solution in many applications. But no solution is without problems, and in this case when we want to persist a Domain Model in a relational database, the problem is the impedance mismatch. However, I talked about that at length in Chapter 1, "Values to Value," so I won't repeat it here.

If we go this route, the most common solution is to use an implementation of the Data Mapper pattern [Fowler PoEAA]. The purpose of the Data Mapper pattern is to bridge the gap between the Domain Model and the persistent representation, to shuffle the data both ways. We'll come back to that pattern in a few minutes.

Choosing what storage solution to use isn't obvious. Still, we have to make a choice.

Before choosing and moving forward, I'd like to think about a couple of other questions.

One or Several Resource Managers?

Another, completely different, question to ask is whether one resource manager should be used or several. It might turn out that you don't have a choice.

The Domain Model excels in a situation where there are several resource managers because it can completely hide this complexity from the consumers if desired. But we should also be clear that the presence of multiple resource managers adds to the complexity of mapping the Domain Model to its persistence. In order to make things simpler in the discussion, I'll only assume one resource manager here.

Other Factors

In reality, we rarely start with a clean sheet of paper. There are factors that color our decision, such as what we know ourselves. A good way of becoming efficient is to work with technology that you know.

Other typical factors that come into play, apart from the raw technology factors that we talked about earlier and that didn't prove a clear winner, are what systems the customer has invested in (bought and trained the staff in).

Maturity in solutions is also a very influential factor when it comes to the data. Losing data vital to the business processes just isn't an option, so customers are often picky if you choose what they think is an unproven solution.

Choose and Move On

Taking the technological reasons, as well as the other factors I have mentioned, into consideration, it's no wonder the relational database is a common choice. Therefore, let's assume this and move on. It feels like a decent choice, and will probably be the default choice for a long time to come.

It actually makes me want to add some requirements to the list of our requirements on the persistence infrastructure:

Dealing carefully with the relational database. (Sticking as closely as possible to how we would program it manually.)
Strong querying support. (Which is, to a large degree, what the previous bullet was about.)
Support for concurrency collision detection.
Support for advanced mapping, such as different inheritance strategies (even if I will probably be careful using "inheritance" in the database) and fine-grained types in the Domain Model.

As I said, what is then needed is an implementation of the Data Mapper pattern. The question is how to implement that pattern.