|
The Deduplication Numbers Game
By Kevin Davidson,
Netsmart Technologies
What are the numbers?
So Netsmart scored 100 percent on the CDC Immunization Registry deduplication test dataset. Does this mean we're perfect? I wish it were so, but deduplication test scores only provide one measure of the quality of a deduplication system.
Given any arbitrary pair of records in a patient database, a deduplication system might do one of four things: declare the pair a duplicate (both records represent the same patient), declare the pair not a duplicate, declare the pair a possible duplicate requiring manual evaluation, or say nothing at all. A mistake can be made in each of these four cases with different consequences for patient care. Let's look at each.
Finding and declaring a duplicate record accurately is a good thing. Netsmart found all the duplicates in the CDC test database. That means that we recognized all the typographical variations in the test data and successfully modeled the decision process of the test designers. (More on how we did that later.)
The CDC test data is not characteristic of real data because it is consistent. If a certain set of conditions are present, then the pair is always a duplicate. In real data the same set of conditions may have different results. There is no answer sheet, and no deterministic rules that work 100 percent of the time. If a patient enters the witness protection program, their new patient demographic information is not going to match the old demographic information, no matter how good the deduplication system is. Witness protection is an extreme example, but in practice there will be variations in data for whatever reason that puts records outside what any automated system, or human evaluator can match. Whenever someone claims to have found 99 percent of the duplicates in a database, understand that they are making some claim other than what the "real" percentage duplicates is--because there is no "answer sheet" for real data.
Declaring a duplicate by mistake is a bad thing. In an immunization registry this "false positive" means that a child may be under-immunized (having immunizations on file that belong to someone else) or over-immunized (if their record was deleted as part of merging the "duplicate").
Finding and declaring a record not a duplicate correctly is also a good thing, but not all that important. "False negatives", on the other hand, lead to over-immunization (since the vaccination history is split over two records). Again Netsmart had a perfect score in the CDC test, correctly modeling the test designer's decision process. We didn't declare any duplicates falsely. And again, whenever someone claims a .1 percent "false positive" rate in a real-world database, this is an estimate based on some criterion other than the "real" rate, unless every record was individually researched and double-checked.
There are some cases, particularly ones involving suspected twins, where human intervention is needed. It may be necessary to consult a paper chart to resolve the question since good computer deduplication algorithms already work on pure demographic data as well or better than human reviewers most of the time. An important measure of a deduplication system is how many records are referred for manual review. If the system declares too many for manual review, there may not be enough resources to perform the review. A good deduplication system minimizes records for review. Netsmart was able to resolve all cases in the CDC database and not refer any for manual review (although this would not be the case for real data).
There is a fourth case that causes the numbers to go awry. These are the pairs that the deduplication system never looks at. In a moderately-large metropolitan immunization registry, there might be one million patient records. That gives 499,999,500,000 possible pairs of records. With current computing resources, it's just not possible to score that many pairs of records. Deduplication systems use various techniques to limit the number of pairs examined. The fewer pairs, the faster the process. The danger, which is very real, is that some pairs are never examined. The number of duplicate records in these unexamined pairs add to the true "false negative" number.
How Netsmart got 100 Percent
Let me say up front that our standard deduplication system doesn't score 100 percent on the CDC test. It doesn't score 100 percent because we don't agree with the answer sheet. Here's an example:
Last Name |
First Name |
Middle Name |
DOB |
Mother Maiden |
Mom First |
Mom Last |
Mom Middle |
PAK |
SU POK |
|
12/9/1996 |
HYO |
PAK |
KYOTA |
ONO |
PAK |
SU-POK |
|
12/9/1996 |
HYO |
PAK |
KYOTA |
ONO |
PAK |
SUE |
POK |
12/19/1996 |
HYE |
PAK |
KYO |
|
PAK-POK |
SUE |
|
12/19/1996 |
HYE |
PAIK |
KYO |
|
The answer sheet said the first two were duplicates, and the last two were duplicates. Our standard system says that they are all four duplicates of each other. Interpreting a slash (/) as the number one (1) is a very common typographical error, explaining the difference in dates of birth. All of the other name differences are easily within typographic or sound-alike errors. In the CDC answer sheet, mother's maiden is law; it is never wrong; it never has a typographical error and it is rarely missing. In real public health data, the mother's maiden name is usually missing, often wrong and as often as not contains a note about something that there's no other data field for.
The way we were able to tune our algorithm to score 100 percent points out the significant difference between the Mass Deduplication system and others on the market. First a look at conventional deduplication systems.
"Expert" Systems
These systems are rule based. "If the first and last name matches and the date of birth and Social Security Number matches, then it's a duplicate". Rules sound like a useful approach, but there are exceptions to every rule and the number of rules grows endlessly to accommodate exceptions. Rule based system can make a solid decision in declaring a duplicate; it's just that there are many cases it won't be able to decide. Eventually one realizes the inadequacy of trying to maintain all the rules, and then tries a statistically based approach.
Statistical Approaches
Statistically based deduplication systems all identify a set of variables and then assign significance to them.
What's the probability that the last name is the same between two different patients? Same date of birth, similar Social Security number? One can take a set of independent variables and multiply the probabilities to arrive at an overall score. Distributions of values in the database can be used to assign probabilities based on how common a value is; for example, two patients named "MXYZPTLK" is unlikely to happen by chance. This approach was described by A Theory for Record Linkage - Fellegi, Sunter - JASA, 1969 and advanced in many other articles. Statistical systems work well when the information is very limited (for example just a name and date of birth).
Fellegi and Sunter provided a sound statistical underpinning for deduplication work, but the application of their work is difficult. What probabilistic deduplication systems often do is to take a shortcut assuming variables are independent when they are not.
For example, let's take a very simple question: what's the probability that a randomly selected person is a male named John. In mathematical notation we write "P(A)" for "the probability that A is true" and "P(A|B)" to mean "the probability that A is true given that B is true". In this case if A deontes "the person's name is John" and B "the person is male", then the probability that the random person is a male named John is computed:
P(AB) = P(A) * P(B|A)
If A and B are independent variables, the equation can be simplified to:
P(AB) = P(A) * P(B)
Let's see how that plays out in our example:
In the 1990 Census, persons named John made up about 3.318 percent of the US population, so P(A) = .03318. Let's assume that half the population is male. The simplified answer comes out to .03318 * .5 = .01659. But in reality virtually everyone named John is male, so P(B|A) is probably on the order of .99999, so the correct answer is around .03318. In practice, if you have gender-based name statistics you could accurately compute the conditional probability for John, and for Mary and for Pat. What you probably won't do is figure out the probability that someone is named John given that their surname is Garcia or Smith.
Variables are not truly independent. Asian surnames may be concentrated in "Chinatown" and that means that surnames can be correlated with street addresses and phone numbers. Social Security Numbers are correlated to dates of birth and address.
It is also very difficult to answer probability questions when you introduce data entry errors. What is the probability that a person named "Gracia" is the same person as one named "Garcia"?
A different statistical approach (I'll call it the "factor" approach) assigns weights to statements about the pairs. For example:
- The first name is similar
- The date of birth is the same
- The child is under two years of age and was born on the west side of town
The statements are constructed by the software designer. Rather than assign probabilities to statements, a weight is assigned. Weights are added when the statement is true. Such systems use a "training database" of known results to select the weights so as to maximize the quality of result (many correct matches, few wrong matches, few for manual review). Mathematical approaches, such as linear programming, can be used to assign weights for best results.
There are issues with the factor approach. First, a considerable amount of experience, creativity and skill is required of the system designer when selecting the "statements" to consider. The second issue is that the technique requires significant resources to "train it" in order to assign weights--and the weights are only statistically valid insofar as the training database is representative of the real database. But I think the most important limitation of the factor approach is that there is no interaction between the statements--each one is evaluated independently.
The Mass Deduplication Approach
Rather than ignoring the interdependence of variables to simplify our probability calculations, Netsmart exploits the dependencies to tease out more information from the data than could ever be found by treating data values independently. Variability in data is the reflection of real things that happen. By studying data records as a whole, we can assign probabilities that are informed by patterns of behavior in patients, their caregivers, automated systems and their operators.
In the Mass Deduplication system, candidate pairs of records are assigned two scores: a probability score based on the likelihood that the pair is a duplicate and a confidence level based on how much similar data we have studied. The user can choose to link or match records based on their threshold settings for these two values.
Since individual data elements are not considered separately, we don't have to make any assumptions about their probability distribution and independence. Rather than stumbling over problems in the data, we adapt to and thrive in an environment of typographical errors and inconsistency.
Summary
A good deduplication system has these features:
- Effective identification of duplicate candidates from all possible record pairs (the CDC calls this "sensitivity")
- Good performance for real-time deduplication
- Sophisticated typographical error recognition system (including misspelling, transposition, sound errors, swapped fields, missed keys, nicknames, address parsing and date proximity)
- Recognizes twins
- Record scoring based on research into actual data, recognizing patterns in data and not just single factors
- Flexibility to adapt to differing database characteristics (such as disease control databases where patients intentionally try to hide their identity)
- High accuracy of results (the CDC calls this "selectivity")
- Minimum number of records referred for manual review
- Efficient management of manual review process
- Ability to inject human decision policy into the scoring process
Contact us today for additional information about Deduplication.
Return to Netsmart Solutions |