Randomization—An Interview with Ken Traub—Part 2: Properties of Randomization

April 23, 2014 Ken Traub 3 Comments

This is the second of a five part interview with Ken Traub, GS1 standards expert and independent consultant, on GS1 serial number randomization. The full series includes essays covering:

GS1 Serial Number Considerations
Properties of Randomization (this essay)
Threat Analysis
Algorithmic Approach
Other Approaches to Randomization

This week Ken introduces three properties of randomization. — Dirk.

_____________________________________

Dirk Rodgers: OK. I think we’re ready to talk about randomization. Let me start off with this question. The Drug Supply Chain Security Act (DSCSA) does not require serial number randomization on drugs, so, why would you think that a manufacturer might want to choose to randomize anyway?

Ken Traub: Well, I think that there are regulations elsewhere which require randomization…I’m aware of the EFPIA regulation in Europe, and so if a manufacturer is either selling in Europe, or they may have a different division that serves Europe but they want to have a consistent policy enterprise-wide, that may be one reason.

Another reason might be if you assign serial numbers sequentially starting at “1”, then anybody who can see your product in the supply chain can estimate your manufacturing volumes. I can walk into my local pharmacy and look at XYZ Pharmaceutical. If I’m a pharmacist I can look at any of them, or if it’s an over the counter drug any random person can, and I’ll note that, OK, on January 1^st, the serial number for this product was number 3, and on February 1^st it was 427, and on March 1^st it was such-and-such, and with that, I’m not going to get extremely accurate information but I’m going to get certainly some indication of volume. And particularly if I’m comparing different drugs from the same manufacturer that I expect are being distributed through the supply chain in a similar manner, then the differential analysis might actually be quite accurate.

Now I’ve not spoken to any pharmaceutical manufacturer that has said that is a concern for them, and, in fact, when I have talked with some pharmaceutical companies and pressed them, “is that a concern?”, everybody has said, “nah, you know, that’s not really a big deal for us”. But that’s the only reason I can imagine somebody wanting to randomize.

DR: I agree that it’s not really a concern in the U.S. pharmaceutical industry. However, you didn’t mention, and this is the one I was thinking of, it may be much easier for a counterfeiter to guess valid serial numbers than if they were randomized, and all they had were examples of a random set of serial numbers and they wanted to make up their own, well the chances are they would make incorrect guesses.

KT: Well, before we get to that, let’s tease apart what is really meant by “random,” because I think people use that term kind of informally but there are actually several different properties that a sequence of serial numbers might have that all kind of fall under this heading of “random.”

One property is “sparseness.” And by sparseness I mean if you take the complete available set of serial numbers then sparseness means you’re allocating them in such a way that you’re only going to use a fraction of them, kind of scattered through the space. If you can imagine putting all the possible serial numbers on a number line, then you could say, “well, now I’m only going to use a small fraction of them and they’re going to be distributed out so that out of every 10,000 consecutive numbers I’m only going to use one of them.” Sparseness is what allows you to say if somebody doesn’t know which ones I’ve selected, then if they try to pick one at random they are unlikely to pick one that I would have chosen myself, and therefore that may be an effective countermeasure if what I’m trying to do is prevent someone from guessing a serial number that I might use someday but haven’t already used.

Now, if you are asking for sparseness in your serial numbers, that has an effect on the total serial number capacity. If you are saying, “I only want one out of every 10,000 possible numbers to be used,” and if I had said that my serial numbers are going to be 11 digits, well, if I’m going to use all serial numbers that are 11-digits long, that gives me a capacity for 100,000,000,000. But if I’ve said I only want to use one out of every 10,000, I’ve just cut that capacity by a factor of 10,000, so instead of 100 billion products I can mark, I can only mark 10 million before I run out.

Now, what’s not clear is, if I’ve used 10 million, I’ve now kind of staked out a claim on the serial number for every part of that number line divided up by segments of size 10,000, is it OK to go back and use serial numbers from that same number line? That would mean, overall, my serial numbers won’t be one in 10,000, but over any reasonable stretch of time I will have only used one serial number from each block of 10,000.

So anyway, that was kind of a digression, but we do have this tradeoff between capacity and sparseness.

Now “sparseness” is a different property than “randomness.” Randomness refers to choosing the serial number in a way that is unpredictable. So imagine I’m choosing one out of every 10,000 numbers on my number line. Well, I could do that according to a completely predictable pattern. I could say my first number is going to be 1, my next one is going to be 10,001, my next one is going to be 20,001, the next one is going to be 30,001, and you can see that I’m only picking one in every 10,000, so if somebody else chooses a number at random it is unlikely they are going to pick one of the numbers I would pick because the last four digits are probably not “0001”. On the other hand, if somebody took the trouble to actually look at a bunch of my serial numbers, they could probably quickly discern the pattern.

So now randomization refers to doing it in a way that is unpredictable. Random might be, out of each contiguous 10,000 numbers I’m going to either flip a coin or use a random number generator to randomly pick one of those 10,000. Now I’ve got sparseness because I’m doing one out of every 10,000, however, there is also some degree of unpredictability, and so now it’s very difficult even if somebody looks at a lot of my serial numbers to guess a serial number and have it be one that I’m actually going to assign.

You can also have randomness without sparseness. Let’s say you were using your 3-digit serial number example where you don’t have many serial numbers, I actually expect to use all or most of them. I can shuffle them like a deck of cards, so I’ll assign them in random order. Now seeing a sequence of them it is difficult to pick which one comes next, but I’m not getting sparseness because I’m still going to use all those numbers. So if I pick one at random, then chances are it is going to be one that is used someday.

Now, if the total number of serial numbers you have is far in excess of the number you actually expect to use, shuffling them first, and then dealing them out from the top of the deck, then you’re back to sparseness again and the sparseness comes about not because you’re choosing one out of every 10,000, but because, since you’re unlikely to use the bottom part of the deck and it’s kind of unpredictable which numbers are in the top and which are in the bottom, it will effectively be sparse and random at the same time.

Another important contrast there is, if we go back to my sales volume prediction thing, if I were just taking consecutive blocks of 10,000, and then picking one at random out of each of those blocks of 10,000, then it may be difficult for somebody to guess a number that I’m actually going to assign. But on the other hand it doesn’t prevent me at all from guessing the sales volume, because all I have to do is disregard the lower four digits and then I’ve got numbers that are going in sequential sequence, because I’m picking one out of every 10,000. Seeing a progression of serial numbers, it may take me a while to guess that’s what’s going on, but if I have some idea of what the order of magnitude of the sales volume is then I might be able to figure that pattern out. Once I have, then it’s very easy for me to still predict the sales volume because the numbers are ascending in an expected way. So there you’ve got a kind of randomization but without achieving all of the benefits that may have led you to want randomness.

And that gets us to the third property and that’s whether the numbers are getting assigned monotonically or non-monotonically. “Monotonically” meaning in an ascending way where each successive serial number is always greater than the one that preceded it, or whether it jumps around back and forth. So you can see when people say “random,” you really must tease it apart and ask, “well, what do you mean, by ‘random’?” And in order to decide what policy you really want to have, it’s important to look at it from the perspective of threat analysis.

For the next installment of this interview, see Threat Analysis.

3 thoughts on “Randomization—An Interview with Ken Traub—Part 2: Properties of Randomization”

Riz says:

April 23, 2014 at 4:38 pm

Hi Dirk and Ken
Thanks for the explanation on Random numbers. Boy! must admit was heavy stuff.
Can you elaborate when you mentioned this on sparseness “That would mean, overall, my serial numbers won’t be one in 10,000, but over any reasonable stretch of time I will have only used one serial number from each block of 10,000.”
Because I thought the block will reduce over a stretch of time.

Thanks
Riz
Ken Traub says:

April 23, 2014 at 8:06 pm

Dear Riz,
Thanks for your question. Let me explain.
Imagine all the numbers from 0 to 99,999,999,999 stretched out on a line – 100 billion numbers total. Now divide it into blocks of 10,000. Starting with the first block, pick one of the 10,000 numbers and use it on a product. Then go to the next block. When you have done this for 10 million products, you reach the end of your number line. Let’s say that takes you 5 years. But you’re still making products, so you go back to the beginning and choose a second number out of that first block of 10,000 (taking care to be sure it’s different than the first number you picked from that block), then you continue to the next block in the same manner. It takes you another 5 years to reach the end of the number line for a second time, assuming your average volume stays constant.

Now, looking back, you don’t have 1 in 10,000 sparseness overall, because you’ve taken two numbers from each block of 10,000. So your sparseness is 2 in 10,000, or 1 in 5000. But if you look at any 5-year period, within that 5 year period the numbers you use are 1 in 10,000 sparse. That is what I meant.

Make sense?
1. Riz says:
  
  April 24, 2014 at 4:05 am
  
  Dear Ken
  Thanks for the lucid explanation. That helps in understanding. However I must say that sparseness will decrease over a period of time but that would be a fair trade-off given the time it takes to consume the 10m S/Ns(or whatever I assume on my number line!)
  
  Thanks again
  Riz