SQL Database Usage and Transaction Essentials for Programmers
This is a quick article designed to be an introduction to some of the basics of using a database, for non-database admins. This is not intended to be a full database course, or to turn you into a database admin or expert – we won’t be covering database or set theory, or getting into the other deeper workings. This is intended to be a high-level overview for people who spend most of their time coding logic, not worrying about backend storage.
For this article we’re mainly talking about SQL or RDBMS databases. NoSQL databases like Object Storage or Document Storage databases often share some of the base characteristics, but may need to be approached differently.
What is a database?
A database at it’s core is storage for sets of related data. The relations between data can be one to one, one to many, many to many, or many to one. This is typically done by storing data in tables, which can then be referenced to other tables.
A quick example is these three tables:
Customer
Name | Address | Phone | CustomerID |
Bob | 132 Court | 555 | 7 |
Jane | 673 Street | 777 | 0 |
Items
ItemID | Description | Price |
8 | Apple | 3 |
6 | Orange | 5 |
Order
OrderID | CustomerID | ItemID | Num |
10 | 7 | 8 | 2 |
10 | 7 | 6 | 1 |
11 | 0 | 6 | 1 |
13 | 7 | 6 | 3 |
One to one data is data like in the ‘Customer’ or ‘Items’ tables: Each row is a single set of data. While this dataset doesn’t have a good example of one-to-many (perhaps each customer should have more than one possible phone number?), you can see it has a many-to-many in the items to orders, where each order is multiple items and item is in multiple orders. There’s also many-to-one, in that there are multiple orders for the customer ‘Bob’, or in that there’s many items in a single order.
Note that this is just one possible way to structure this data in the database. If you want to get into how to structure the data and store it, you may want to read up on database normalization and when to use denormalization instead. (Typically normalization is preferred unless there’s specific performance needs for denormalization.)
Why are you using a database?
The obvious answer here is ‘to store sets of related data’ – however in our experience that’s often not the primary reason to use a database. The primary reason is typically concurrent use. That is, that a database will let multiple processes/users access the storage at the same time without significant slowdowns or interruptions. A secondary reason is often redundancy and scaling – it is typically far easier to have multiple database instances kept in sync, that are all read by multiple front end systems (each running multiple processes) than it is to try to share a filesystem effectively over the same scale in real-time.
These are not bad reasons to use a database – database systems are well-designed to handle large numbers of concurrent requests from multiple requestors, and are designed to scale well. However if you’re focusing solely on them without any understanding of the database you’ll find yourself in trouble.
The Issue: Maintaining Relations
The trouble you’re likely to find yourself in, is that if you’re focused on the ease of use you can forget about the fact that the data in the database is related, and this needs to be maintained while being concurrent.
In our example above, there’s an obvious issue with the orders: To it takes multiple entries in the ‘order’ table to for a single order. The obvious way to enter that is:
INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(10,7,8,2); INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(10,7,6,1);
However, there’s a problem with that: Those are two separate statements. If another process was doing at the same time the following:
SELECT ItemID,Num FROM order WHERE OrderID = 10;
You could get one of three results: No entries, two entries, or one entry, if the second query was handled between the first and the second insert.
If you’re a bit more experienced, you’ll know the simple solution to this issue in this case is the following:
INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(10,7,8,2), (10,7,6,1);
But that’s only applicable in simple cases. More complicated operations, especially if it involved multiple tables, can’t always be combined into a single SQL statement. For the moment we’ll continue using the simple example.
Solution: Transactions.
Databases however have had to have a solution to this for a long time. The solution is fairly simple: Set a start and an end point, and make sure the entire block of statements in between operates as one piece.
START TRANSACTION; INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(10,7,8,2); INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(10,7,6,1); COMMIT;
Now the database will make sure no other query can see what’s happening in between the START TRANSACTION and COMMIT statements until the COMMIT has occurred. No other query will see just one of those inserts: They will either see both or neither.
Transactions can get quite complicated. Consider the following case: Jane wants to add one item to her current order. If it’s an apple, the following will work:
START TRANSACTION; INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(11,0,8,1); COMMIT;
However, that assumes you know the OrderID. You may need to get that first:
START TRANSACTION; SELECT DISTINCT OrderID FROM order WHERE CustomerID = '0'; < Your code receives the OrderID, then passes it to the next statement > INSERT INTO order (OrderID,CustomerID,ItemID,Num) VALUES(11,0,8,1); COMMIT;
Again, you likely could do this with one SQL statement and be more efficient. However there’s a larger problem: What if instead she wanted to add an orange? Then you need something like the following:
START TRANSACTION; SELECT DISTINCT OrderID FROM order WHERE CustomerID = '0'; < Your code receives the OrderID, then passes it to the next statement > SELECT ItemID, Num FROM order WHERE OrderID = '11'; < Your code computes the new number of oranges > UPDATE order SET Num = '2' WHERE OrderID = '11' AND ItemID = '6'; COMMIT;
Except you still have a problem with the above: While the transaction means that changes you make won’t affect other transactions until you are done, it doesn’t mean changes that other transactions won’t affect the table. If there’s another transaction that adds 10 oranges to Jane’s order, depending on what order they run in you could have any of the following:
- Jane’s order has 22 oranges.
- Jane’s order has 21 oranges.
- Jane’s order has 2 oranges.
The problem is that while the database knows that you want to work with a consistent set of data, and that you want all changes to apply together, it doesn’t know when you are looking at data you intend to change later.
Unless, of course, you tell it with a FOR UPDATE clause:
START TRANSACTION; SELECT DISTINCT OrderID FROM order WHERE CustomerID = '0'; < Your code receives the OrderID, then passes it to the next statement > SELECT ItemID, Num FROM order WHERE OrderID = '11' FOR UPDATE; < Your code computes the new number of oranges > UPDATE order SET Num = '2' WHERE OrderID = '11' AND ItemID = '6'; COMMIT;
Now when the database runs the second select, it locks that row so no other updates can be done to it until your transaction ends.
The Downside: Locks and Rollbacks.
So, what happens now when the second transaction tries to do it’s select for update? The simple answer is that it waits for this first transaction (who locked the row) to be done. Assuming nothing goes wrong, that’s a slight delay, and is basically the cost of doing concurrent access to changing data. The database itself will do what it can to minimize that delay – typically for example it will only lock that one row in the one table, so other changes can be made without causing delays. Different databases and database engines will optimize this in different ways – and may optimize it differently for different tables, depending on the amount of data, which indexes are in use, etc.
Of course, you don’t want to wait forever. And it’s entirely possible to get two transactions each waiting on locks that the other is holding to release so that neither can complete. What’s a database to do?
The basic answer in that case is that the database will have to decide when a transaction is waiting to long, and will eventually kill that transaction. To avoid causing issues, when it does the database will roll back all of the changes that were made since the start of the transaction, and fail the commit. (Or last statement.) The end result is the same as if the transaction had never been started at all, and hopefully any deadlocks are freed and will complete. So you can then run the transaction again (possibly getting new values from SELECT statements and similar) and have it complete.
For the programmer using the database, the important part is to know that potentially any operation on a database can fail. And that failure may be temporary and limited to just that transaction in time. You’ll want to be able to handle that failure. That may mean aborting the operation to the user, or it may mean trying the operation again in the database and seeing if it works the second time. Failures should be rare if you’re using the database well – if you’re seeing large numbers of failures due to deadlocks it’s typically a sign that you need to change the design of your database.