Randomization—An Interview with Ken Traub—Part 4: The Algorithmic Approach

April 28, 2014 Ken Traub 3 Comments

This is the fourth installment of a five part interview with Ken Traub, GS1 standards expert and independent consultant, on GS1 serial number randomization. The full series includes essays covering:

In this installment, Ken explains the algorithmic approach to serial number randomization. – Dirk.

_____________________________________________

Dirk Rodgers: …I love the fact that you’ve sort of debunked the value of intense randomization (see “Part 3: Threat Analysis“). The way I see it, there are some easy, practical ways that have less complexity or difficulties than some of the more rigorous approaches to randomization where you’re doing fully, or true randomizing, and those are the techniques I was hoping we would be able to talk about.

Ken Traub: So now I’ll play a different devil’s advocate and argue from the other direction.

DR: OK, good.

KT: Somebody might argue, well that’s all well and good—a very nice approach, and all that—but we don’t know what all the threats are and who the threat agents are. We don’t know what’s going to arise in the future. So, yeah, we take the point there may be simpler forms of randomization that are sufficient for the actual threat, but if we don’t know what the threats are, then maybe we should consider the most rigorous possible [approach to] randomization. And if that is not difficult to do, or is not more difficult to do, then we don’t have to think about doing a threat analysis, and just say, “well to the extent that randomization is going to help at all, we’ve done the most that is possible and so we’re prepared for any of that.” You know, it’s kind of a “Pascal’s Wager” argument. That is a way of reasoning and I think really doing full randomization does not have to be onerous, however existing hardware and software systems are not really prepared for it out-of-the-box, and that’s really the only thing that makes it difficult. If hardware and software products for serialization were designed properly, then I think full randomization might be the default for everybody.

DR: Yeah. Well, in fact, I’m aware of a couple of approaches that are at least…maybe not in use yet, but that are being considered, and one an algorithmic approach to randomization.

KT: That’s BayCoder. BayCoder is the product and Bayer Technology Services is the vendor.

DR: ….and I think Kezzler may do the same thing.

The algorithmic approach at least has some benefits. What you’re really trying to do is make it so that it’s not difficult to confirm that the randomized serial number you are about to put on a product has not already been used two years earlier. That’s kind of complex for some total randomizing approaches, but I think this algorithmic approach includes the ability to ensure that every number that you get has not been used at any time in the past.

KT: I can take you through all these variations because I’m familiar with all of them.

DR: Yeah. Good. Do that.

KT: OK. So the question is, if you wanted to randomize…you remember…you asked about practicality and I said there were two aspects to it, one is “what’s worth doing?”, and the other is “what technical approach is easier to implement or harder to implement?”. We’ve addressed the first question so now let’s go back and address the second question.

I think you’ve got to put it into the proper context, because if you had every instance of a particular product that you were going to manufacture sitting in front of you…let’s say it was a limited run of a drug that you were only going to make once, and you were only going to make 100 units of it for the whole world, and you were never going to make it again, and you made all 100 at once and you’re going to put a serial number on all of them before they go out the door. Then you have lots of ways of choosing serial numbers available to you, it’s all pretty easy. I mean, because you’re doing it all at once, you don’t have any kind of long term bookkeeping problem, but that’s not the context in which anything is serialized, or certainly not most things.

Instead, the issue is, I’m going to be assigning numbers over a long period of time, over the life of the product. That could be over a period of years. And moreover, that product could be manufactured in more than one place. That could be because today I have more than one manufacturing line that’s doing the labeling and I need to assign serial numbers simultaneously on those lines; they may be in different buildings; or in different countries; or I may want to shift manufacturing from one place to another. In general, companies do serialization using some kind of centralized bookkeeping mechanism.

You would solve that by having a piece of software at the company headquarters whose job it is to choose the next serial number for each product. And each time you would put a label on a product you would go back to that oracle and ask, “give me the next serial number for this product”.

Well, of course, doing that one at a time is going to be inefficient from a variety of perspectives, so usually the way this works is there is a central system that is issuing blocks of serial numbers, and a manufacturing line will say “I’ve got a work order for 10,000 units of this product, give me 10,000 serial numbers I can use.” And it does that at the start of the production order and then it puts all those serial numbers on products over the course of the day to fill that order without having to go back to the central system. And before it goes back to the well again, some other manufacturing line in a different building or in a different country may have asked for a block of serial numbers for that same product.

Now if you’re assigning serial numbers sequentially, that’s pretty simple to do. What that central system does is keep a database with a row for each product that you manufacture. In the simplest implementation it would keep a number associated with each product that says “what’s the next available serial number”, and then if I go in and say “I need 1,000 serial numbers for Product X”, it’s going to look up in his database for Product X and see that the next serial numbers is number 7000, for example. And so it’s going to say “use serial number 7000 through 7999”, and then it’s going to update the database to say that the next available serial number for product X is 8000. So basically it’s going to take the number that was in there before and add the quantity I’ve asked for and store the sum back. That’s very simple and you can see how, if multiple lines are making such requests, as long as the system handles those requests one at a time, then each line is going to get a different block of serial numbers.

And there are many systems that do that today, and probably in the pharma industry the most widely used is the SAP system. That feature is available in both their AII and OER products, and they have an interface for that responds to that request by saying, “Here is your block of serial numbers, here is the starting number and the ending number”. So in my example, it would say “use serial number 7000 through 7999”. It would give you those two numbers 7000 and 7999 and I would know that all the numbers in between I get to use.

There are lots of Level 2, Level 3 manufacturing systems from companies like Systech and Antares which are capable of using that interface.

Now, when you start saying, “I want to do randomization”, you run into a number of challenges. Some relate to the central system. How is the central system going to keep track of what comes next, if what’s next is a random number? And then there is another issue that says, if I’m delivering serial numbers to the factory floor, I can’t just say “here’s the starting number and the ending number” if the numbers in between are supposed to be random. I’ve either got to let the system on the floor choose the random numbers, in which case it raises some questions about what the central system keeps track of, or, I have to supply the system on the floor with a list of numbers rather than a starting and an ending number. In the latter case, the manufacturing equipment would just put on whatever serial numbers I tell it to. With or without randomization, if you want to use alphanumeric serial numbers having a starting and ending number may not work because the sequential ordering is not as clear, so again an interface that provides a list of individual numbers may be needed. Those are a number of technical challenges that arise.

Let’s get to some of those technical issues. So how would I manage a system where I have central allocation of blocks of serial numbers to manufacturing line if what I have to do is do things randomly? Well, it depends on the method of randomization that you’re going to use. Let’s go from simple to complicated. One possibility is to say that, well, up in the central system, I need to assign things randomly, so I’m going to use some kind of random number generator. Typically the way a random number generator works is, there’s a thing called a seed, which is keeping track of the state of the algorithm being used to generate the numbers. I can store that seed, and if I turn off my computer and turn it on again, if I restore that seed into the random number generator algorithm it will pick up where it left off. If you use an ordinary random number generator algorithm then to keep track of where you are, you only have to store the seed.

Now, two things follow from that. The numbers I’m going to generate won’t be sequential in any way so if I’m going to deliver that to the edge system, I need to be able to provide those numbers in a list rather than just giving the starting and the ending numbers. If you ask for a thousand numbers, I’m going to generate a thousand random numbers. I’m going to give those serial numbers to you in a list and you’ll just use those numbers, so that requires an interface change.

It actually leads to a better architecture to do that because then if you want to change your policy, you do that centrally and the system at the edge is just putting on the products whatever numbers you give it. It doesn’t know whether they’re random of sequential or how the randomness is assigned. You know, the edge systems just do what they’re told. I think that’s a better architecture because it puts all the control in one place rather than scattering it among the configuration of a bunch of manufacturing machines.

Now, but let me go back to what the central system has to do in this example. Just using a random number generator doesn’t quite work because a traditional kind of number generator is not concerned about avoiding duplicates. So if you take an off-the-shelf random number generator and ask it to generate random 11-digit numbers, there is a small but finite chance that it will produce a duplicate. You may say, “well with 11 digits the possibility of one number duplicating the previous number is 1 in 100 billion”, and that’s true, but if you ask the question differently and say if I assign a million numbers, what’s the chance that any two of them are duplicates, the probability of that is actually much higher than you would expect. This is the so-called “birthday paradox”. The way it’s often expressed is, if you’ve got a room full of people, and none of them are twins, what’s the probability that any two of them have the same birthday? Most people are surprised to learn that you only need 23 people in the room for the probability to be 50-50.

DR: Wow.

KT: And the mathematics behind that are kind of interesting. I won’t delve into it here but you can look up “birthday paradox” in Wikipedia and it explains it. It also gives some formulas…

DR: I’ll provide a link to it.

KT: So, you can’t just use a random number generator.

Another technique that you can use, though, is to do a permutation, and this is more like shuffling a deck of cards. What you can imagine doing is when I initialize my central system what I can do is take all the possible serial numbers and I’m going to shuffle them like a pack a cards. You do that with a random number generator…there’s a well-known shuffling algorithm that’s used in all kinds of computer card games.

You basically start with the first card and pick a random number between 1 and the number of cards you have left, and that tells you which one to swap it with, and then you go through the rest of the deck and do that, and that has all possible combinations. Unfortunately, that’s completely impractical to do. You can do that calculation with 52 cards, where you’re choosing 52 random numbers to do that calculation, but if you have 100 billion numbers that you want to shuffle, you’d have to pick 100 billion random numbers and that would take a very long time and then you’d have to store that list so you could just keep track of where you are on the list. But storing a 100 billion numbers is not practical, particularly if you’re going to do it for each product. So that doesn’t really work.

But there is another technique that is kind of mathematical, but the way it works is, imagine you had a mathematical calculation you could do where it takes a number as an input and produces another number as output. And this mathematical function has the property that if I put X into that function I get back X-prime, and if I put Y into that function I get Y-prime. And imagine this function has the property that for any X and Y that are different, X-prime and Y-prime are also going to be different. In other words, if I put two different things in I always get different things out. If I put the same thing in, of course, I always get the same thing out. And furthermore, if the inputs are drawn from a range of numbers and the outputs are also within that same range, then essentially, if I put in numbers from 1 to N in order into that function, I’m going to get back a bunch of numbers in a different order, no two of them will be the same, because that was the property that I posited for this function, and they are within the same range. If I use all of them, what that function actually gives me is a shuffling.

So let’s just pretend that such a function exists. I haven’t shown to you that it is actually possible to construct such a function, but if I had a function like that, then what I would do is, in my database I would keep track of my next available number, just like I did before, and when a request comes in for 1,000 numbers, I add 1,000 to the number in my database and store the new number. But instead of handing that new number directly to my edge system, what I do is, I take that number and pass it through my magic function, and that tells me the actual serial number we’re going to use. And so the guy comes along and says I need 1,000 numbers, and my database says start with number 7,000, what I do then is list out the numbers 7,000 through 7,999, pass each of those numbers through my magic function. I’ll get 1,000 different numbers which won’t be in the range 7,000 through 7,999—they’ll be all over the place—but they’ll be within the overall range that I specified, and I’ll take those 1,000 numbers and give them back to my edge system, and in my database I’ll update the database to say 8,000 is the next number that I’m keeping track of centrally.

You can see that if you do that, my bookkeeping is the same as if I’m doing things sequentially, but the numbers I actually put on the product are random.

DR: Right. Now this is what I referred to earlier as the algorithmic approach.

KT: Yeah, that’s right, and this is what I believe BayCoder and perhaps Kezzler are doing. They’re trying to be secretive about what they’re doing but it seems pretty obvious…

Anyhow, this is a technique that can be used. You can see that because the bookkeeping is the same as for sequential numbers, it’s pretty attractive. The interface to the edge systems have to be changed to provide a list of numbers rather than a range, but that’s actually a positive as well because then it allows you to change policies centrally and not have to reconfigure your edge systems. I think, one of the reasons it’s not more widely adopted is that it’s a little esoteric and people don’t fully understand it, they don’t know that the algorithms for this are actually widely available and published.

As I’ve said, let’s imagine such a function exists, but I haven’t shown how to create such a function. Well, it’s not so easy to make one, however, it turns out that, if you think about what this function is doing: you’re giving it an input and it’s translating it into an output which is different, with the property that it is hard to predict what the input was from what the output is. That’s actually what an encryption cipher does. It takes an input and turns it into an output where every input gives a different output, in a way that doesn’t let you predict the input just looking at the output. And it turns out that there are many cryptographic ciphers that have exactly the properties needed for doing randomization this way, so you just have to find a cipher that has the appropriate input and output range. And there is a little trickiness to that, and which is probably well beyond the scope of this interview so we won’t go into it.

DR: Exactly…no.

KT: But that’s essentially what’s done. But, I mean, to put it into simpler terms, the idea is…you can think of it as translating into a different language. You’re central system is keeping track of the serial numbers in English, where the numbers go sequentially, 1, 2, 3, 4, 5, but what you’re doing is you’re translating each number into French before you put it on the product, and the numbers, when translated into French come out in random order, or seemingly random order. So you’re doing the bookkeeping in English, you’re doing the printing in French, but it all works out, you don’t get any duplicates.

For the next installment of this interview, see Other Approaches to randomization. Stay tuned!

3 thoughts on “Randomization—An Interview with Ken Traub—Part 4: The Algorithmic Approach”

Clive Hohberger says:

April 28, 2014 at 11:05 am

I completely agree with Ken’s algorithmic approach. A set of sequential numbers can generate a set of corresponding random numbers in a very straight forward manner. This greatly simplifies the manufacturing serial number creation database.

My US Patent 7,137,000 describes such a method, designed to generate random numbers to be stored in an HF RFID tag, based on the sequential tag id serial numbers. Here, these were stored in a tag on a ribbon spool. The card printer had a secure but undernourished $2 16-bit microprocessor which read the ribbon tag id, recreated the random number using the same algorithm, and compared it to the one stored on the tag to authenticate the ribbon’s manufacturer.

As seen here, the amount of computation to generate the random number does not necessarily have excessive. The key is to invest the time in the elegance of the randomization algorithm to minimize the computation. That’s also part of what the patent is about. This can give every company a unique way of generating random numbers.
Joe B. says:

May 19, 2014 at 11:47 am

Thank you both for an excellent and informative Q&A. I was wondering what your respective views might be on the potential benefits of a cipher(encrypted) serialization process, insofar as the avoidance of Big Data issues. In particular, indexing challenges and shear data infrastructure costs when dealing with 100’s of millions (or even billions) of product codes per annum. I understand these issues may be mitigated by leveraging cipher keys…thereby avoiding individual code storage after on-line application.
1. Ken Traub says:
  
  May 19, 2014 at 4:20 pm
  
  Using a cipher to generate random serial numbers has the advantage that you can keep track of which numbers you have allocated through a simple counter, instead of keeping the complete list of random serials. In that sense, it reduces the stored data content.
  
  However, it is likely that you *also* want records of which serial numbers were issued on which date, with which batch/lot, which ones were discarded due to manufacturing errors, etc. So you will end up needing some sort of data record (a “commissioning event”) for each serial issued. Once you do that, the fact that the serial was generated by cipher doesn’t really help – you may as well store the actual serial with each commissioning event rather than the pre-ciphered index.