@ShahidNShah
Guest Article: Beware of NULLs in healthcare databases
Many readers have been asking for more “practical” advice on their database models so I’ve asked a fellow healthcare data architect to lend a hand. Tom Maloney is a Senior Data Architect for Stockamp and Associates with over 25 years of experience and knowledge working with and designing databases for most industries. Tom has done a lot of freelance contracting through his own company where he lives and breathes data modeling. In this guest article Tom is discussing the pitfalls of allowing NULLs in healthcare databases and his suggestions and arguments have a lot of merit. As usual, recommendations like the ones Tom is making do not apply in all circumstances but they are worth reviewing.
On our patient claim system the hospital wanted to know the average payment amount received between two periods of time. As we all know the average is the sum of the values divided by the number of rows. So we constructed a simple query to calculate the average.
Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]
WHERE
EffectiveDate BETWEEN ‘20060330’
AND ‘20060410’
After the answer was returned; we remembered that SQL
provides a function that calculates and returns the average, so we decided to
use it. When we ran this query
Select AVG(Amount)
FROM [dbo].[Payment]
WHERE EffectiveDate BETWEEN
‘20060330’ AND ‘20060410’
we had a completely different answer. Which one was correct or could they both be wrong? What cause the difference? The culprit turned out to be NULL values in some of the columns.
To prove this we performed a little experiment by creation and population the following table called Payment populated it with ten rows. We allowed some of the column in a few rows to contain NULL values (Note: the syntax is Transact-SQL from Microsoft SQL Server).
Table Creation:
CREATE TABLE [dbo].[Payment](
[Payment_Key] [int] NOT
NULL,
[Amount] [money] NULL,
[EffectiveDate] [datetime] NULL,
[Note] [varchar](1024) NOT NULL DEFAULT (”),
[PaymentRefNumber] [varchar](16) NOT NULL,
PRIMARY KEY CLUSTERED
(
[Payment_Key] ASC
)WITH (IGNORE_DUP_KEY = OFF)
)
Go
Populate the Table:
INSERT INTO [dbo].[Payment]
VALUES (1, $125.00, NULL, ‘Payment 1’, ‘1’) —ß Null in date
INSERT INTO [dbo].[Payment]
VALUES (2, $125.00, ‘20060403’, ‘Payment 2’, ‘2’)
INSERT INTO [dbo].[Payment]
VALUES (3, Null, ‘20060404’, ‘Payment 3’, ‘3’) —ß Null in Amount
INSERT INTO [dbo].[Payment]
VALUES (4, $125.00, ‘20060404’, ‘Payment 4’, ‘4’)
INSERT INTO [dbo].[Payment]
VALUES (5, $125.00, ‘20060405’, ‘Payment 5’, ‘5’)
INSERT INTO [dbo].[Payment]
VALUES (6, $125.00, ‘20060405’, ‘Payment 6’, ‘6’)
INSERT INTO [dbo].[Payment]
VALUES (7, NULL, NULL, ‘Payment 7’, ‘7’) —ß Null Amount and date
INSERT INTO [dbo].[Payment]
VALUES (8, $125.00, ‘20060406’, ‘Payment 8’, ‘8’)
INSERT INTO [dbo].[Payment]
VALUES (9, $125.00, NULL, ‘Payment 9’, ‘9’) —ß Null date
INSERT INTO [dbo].[Payment]
VALUES (10, $125.00, ‘20060407’, ‘Payment 10’, ’10’)
GO
If you add all of the values in the Amount column we have $1000.00, there are ten rows, so the average should be $100.00, right? Not really. We were only looking for rows between Mar 30 and Apr 10, 2006. A NULL can represent any value, it is unknown at this time, and we do not know if NULL means not entered or if it is a place holder for valid date within the range we are interested in.
As we will find out a little later SQL counts NULL sometime and other times it ignores it. Using the BETWEEN predicate SQL ignores NULLs. This gives us actually seven rows to divide into the sum of the amount. Also, one of the rows with a valid date range contains a NULL amount giving us only six Amounts to sum (both the SUM and AVG SQL function ignores NULLs), giving a total amount of $750.00, divided by seven rows we have an average of $107.14 rounded). Not $1000.00 divided by 10 rows yielding $100.00 average.
Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]
WHERE
EffectiveDate BETWEEN ‘20060330’
AND ‘20060410’
The result is $107.1428, this is the answer we were looking for. Let’s try using the AVG function and see what is returned:
Select AVG(Amount)
FROM [dbo].[Payment]
WHERE EffectiveDate BETWEEN
‘20060330’ AND ‘20060410’
This time for the same set of data we got $125.00 as an average. What’s going on here? The answer lies in how SQL treats NULLs.
The following provides a deeper look into how SQL handled NULLs.
Other that the special way the SQL Server has to store a null and the extra logic SQL Server has to do to identify a column containing a null, here are a few favorites (comments from C. J. Date1):
SELECT AVG(Status) FROM S —
Result: x
SELECT SUM(Status) FROM S —
Result: y
SELECT COUNT(*) FROM S —
Result: z
there is no guarantee that x = y/z
- Likewise, the function reference SUM (F) is not semantically equivalent to the expression: f1 + f2 + … + fn where f1, f2…, fn are the values appearing in field F at the time the function is evaluated. Perhaps, even more counter-intuitively, the expression
SUM (F1 + F2)
Is not equivalent to the expression
SUM (F1) + SUM (F2)
- Since by definition NULL represents an unknown value, we define the results in every case to be unknown (i.e., NULL) also, rather than true or false. To deal with NULL values properly, therefore, it is necessary to adopt 3-valued logic in place of the usual 2-valued logic. The 3-valued logic is defined by the truth tables shown below. Note that unknown or null truth-value can reasonably be interpreted as “maybe.”
AND T ? F OR
T ? F NOT
T T ? F
T T T T T F
? ? ? F
? T ? ? ? ?
F F F F F
T ? F F T
- Consider the question of whether set are allowed to contain NULL values. Suppose, for example, that the collection C = {1, 2, 3, ?} is to be permitted as a legal set. There are two possibilities.
- The particular null value appears in C is of course unknown, but is known to be distinct from 1, 2, and 3.
- The NULL value in C is completely unknown (i.e., it may in fact stand for one of the values 1, 2, 3), in which case the cardinality of C in turn is unknown (it may be either 3 or 4).
On the whole I set all data type not to allow NULLs and provide default values when the column is not required. But what about dates and Boolean value columns where the data is not know at time of entry? There are times in designing a healthcare database schema where NULLs values may be allowed, these columns needs to be handled on an exception basis and rather than the norm. Usually I find a workaround, for example, if a Patient’s Gender is not always known at the time of entry, instead of using a Boolean, use a single character to hold a code (F=Female, M=Male, O=Other, U=Unknown, A=Ambiguous, N=Not applicable).
Another example is for dates. If a date is not known at time of entry I provide a default. Most dates record an event in time or a date range (Begin and End dates). For dates that represent a date in time I choose a low (e.g., 01/01/1900) or high (e.g., 11/27/4637) default date depending on the how the column would be used in a query. If developers or anyone else want to see a NULL value returned, I create a view over the table that contains a NULLIF[2] returning a NULL (e.g., NULLIF(MyDate, 19000101). Columns that represent a date range I default the From Date with 01/01/1900 and the Thru Date with 46371127.
When my queries are ran against range date with defaults I get the behavior I expect. When the range date columns contain NULLs the result are not what is expected or I have to write special SQL to handle it. In the example, by changing the table to not allow NULLs and replacing unknown dates with 19000101 and unknown Amounts with $0.00, both AVG and SUM(Amount)/Count(*) returns the same value.
All Character and Variable-Character data types I default with an empty string. When the column is selected, it is displayed with as an invisible value. By using NOT NULL data types you are assured that the results returned will be as you expect without having to handle NULLs differently. In addition, some vendor database’s performance improves because columns that allow NULLs must be evaluated differently for comparisons. If Nulls are used try and minimize it use.
Shahid N. Shah
Shahid Shah is an internationally recognized enterprise software guru that specializes in digital health with an emphasis on e-health, EHR/EMR, big data, iOT, data interoperability, med device connectivity, and bioinformatics.