michelangelus - Fotolia
A database legend is going back to the drawing board for new ways to support modern scale-out applications.
Jim Starkey cut his teeth as a college graduate designing a data computer for ARPANET, designed some of the earliest relational databases at Digital Equipment Corp. and developed the Falcon database storage engine for MySQL.
In an interview with SearchDataCenter, Starkey explained why traditional transactional databases and newer NoSQL designs aren't up to the task of supporting modern applications -- and how his startup, NuoDB, aims to change database infrastructure.
How are applications changing? What makes a new kind of database infrastructure necessary?
Jim Starkey: The problem with applications is that -- especially mobile applications -- developers have the problem of going from two or three dozen test subjects to going live and suddenly facing the possibility that, next week, they might have 10 million users. Being able to deal with scalability is pretty complicated. As long as you can scale up to what an existing database system designed to run on a single machine can handle, you're fine. But once you're past that point, the database system isn't going to help you very much. The scalability problem then falls back to the application developer, and that's not something they're very good at. And when they suddenly find they have a performance problem -- or, really, a scalability problem -- it comes at a bad time. Suddenly, they've got an artificial barrier to reaching all the customers they worked very, very hard to acquire.
What we've tried to do with NuoDB is build a relational database system that's completely familiar, standards-compliant, using a standard interface that somebody can build an application around, take it public and just get more capacity by plugging in more computers. Application developers don't have to worry about the question of scalability; they don't have to go back and redefine the application for partitioning or change the locus of intelligence with sharding. They can just concentrate on the things that are important to their business success, and the database will take care of itself.
There are also NoSQL databases. What makes those inadequate?
Starkey: When you look at the combination of an application program and a database system, the question is, 'Where's the locus of intelligence?' Is it in the application program, or the database system? Do you let the database worry about where stuff is, and what indexes exist and what everyone else wants, and just get back what you need? Or do you have to build all of that into your application?
Typical database systems are high-level systems: Give it a SQL query, it'll figure out how to do it and return results back to you, and manage interaction with any other users on the system, so that you don't have to worry about it. NoSQL databases are really dumb. They're a key-value store, where, essentially, you have a primary key, you give it the primary key and you get back a blob -- and what the blob is, that's your job to figure out, rather than the database's. The database can't do much for you. It's a really good solution to a shopping cart app, [but] it's not a very good solution for other things.
How does NuoDB's approach to modern applications differ from traditional database infrastructure?
Starkey: As a standard SQL solution, the way you build an application is not significantly different from how you build a system against other relational database management systems. The difference is that it scales. If you're running Oracle on a single machine, and you reach the capacity of a SQL machine, you switch to Oracle RAC, and that gets you some more performance. But when that gets exhausted, you're done. With NuoDB, you can take an intuitive database application design, and rather than changing the application to handle more scalability, you just plug in more computers.
How is consistency handled in that model?
Starkey: The way MVCC [multiversion concurrency control] works is that when the database stores a record, it puts a transaction number on that record. If someone modifies the record, it keeps the old record there [and] stores a new record in the same place, with a new transaction ID pointing to the old one. When a transaction starts, it keeps track of the state of all the other transactions going on concurrently, so when you look at a record, it can tell whether that record was committed when it started. If it wasn't committed when it started, then it'll go to an older version and apply the test until it finds a version of the record that's consistent with when it started. It doesn't keep the record lock at all. Readers don't block writers; writers don't block readers.
So, how does NuoDB use MVCC in a new way?
Starkey: The difference between NuoDB and all other relational database systems is that all other database systems are based on disk pages and disk files. NuoDB isn't. The SQL engine gets layered on what we call atoms, which are distributed objects, and there can be many instances of an atom within a cluster. They all know each other and they all replicate each other -- they're distributed objects. Inside of the data object are records, and it keeps track of old versions and new versions. When you access an atom through a transaction, it figures out which version was the current one when you started, and that's the one that you see.
The thing that's really interesting about this is that database systems that are designed around disk files, the disk file only exists in one place. When you make a change in one place, it doesn't automatically go everywhere else. But with distributed objects, each object is consistent and they know about replication to other objects.
When you plug in a new node to a NuoDB cluster, it does a crypto-handshake with somebody else and says, 'Here's your master catalog, take it from there.' Within the catalog, everybody knows each atom, who has it and where to find a copy of it. Everybody pings each other once in a while, and they know who's responsive. It doesn't matter if the machine is overloaded and it's slow, or at the other end of the world, it just knows it takes a longer time to get a ping back. So, when a node needs access to an atom, it knows who has it [and] picks the one that's most responsive -- which is likely to be another node in the same rack -- and says, 'Give me a copy of this; everybody else, I've got it.' It's a fairly simple idea, but it works like a charm.
What the future looks like
In part two of this interview, Starkey explores the future of databases and scalable applications, moving away from the SQL model.
What is the replacement for NoSQL databases?
Looking ahead at NoSQL design issues
Choosing a database for IoT applications