A neat little data trick
Mar. 28th, 2008 03:31 pmI work with data. And I suspect that the sort of thing I do isn't done by many people. So this will probably either be meaningless to you, painfully obvious, or boring - take your pick.
That doesn't mean I won't subject you to it, of course. :D
One of the things I do involves cleaning name data. This generally arrives in Excel format (and if it doesn't, I get pissed off). It can range anywhere from a few dozen names to a few thousand, and I've had files of over a hundred thousand names come in. Fortunately, that's rare.
The problem is that the data is almost always really shitty. In part this is because it was put together or maintained by people who either didn't care about the data, or didn't really understand it. It's also true that in some cases the ultimate source of that data is a competitor who knew that we'd be working with that data. I'm pretty sure they try to make it as difficult as possible.
Just to give you an idea, it's not uncommon to see data like this, all in a single column:
My goal is to create clean last and first name fields as quickly and accurately as possible. I simply don't have the time to go through each record in a thousands-plus name file, though. So I've evolved a lot of little tricks over the years. These generally involve searching and replacing with wildcards in various ways. If you don't know, the wildcard characters are "*" and "?". The asterisk represents any number of characters, and the question mark represents a single character.
But lately I've been seeing a new problem: somebody is throwing wildcards into the data. Specifically, asterisks. The problem is that if I try to do a global search and replace to remove "*", I end up deleting ALL the data - the "*" is taken by Excel as a wildcard. It's quite an annoying little trap.
It may not be intentional, of course; it could just be various idiots trying to "footnote" the data, although I've never seen an actual footnote in a name file.
But I thought of a solution, and had the pleasure of having the first thing I tried work perfectly. I copied the entire column and threw it into a Windows NotePad file. My hunch was that NotePad was just basic enough to not have the wildcard replace option. And it doesn't; a search and replace of "*" with nothing works perfectly. Then I copied the data from NotePad, pasted it back into Excel, and voila! clean data.
Which still has to be worked on, of course, but at least I've gotten rid of those damned asterisks.
That doesn't mean I won't subject you to it, of course. :D
One of the things I do involves cleaning name data. This generally arrives in Excel format (and if it doesn't, I get pissed off). It can range anywhere from a few dozen names to a few thousand, and I've had files of over a hundred thousand names come in. Fortunately, that's rare.
The problem is that the data is almost always really shitty. In part this is because it was put together or maintained by people who either didn't care about the data, or didn't really understand it. It's also true that in some cases the ultimate source of that data is a competitor who knew that we'd be working with that data. I'm pretty sure they try to make it as difficult as possible.
Just to give you an idea, it's not uncommon to see data like this, all in a single column:
Brown, Robert Q.
Harold ZbeodkwsiMr.
Neil
ABSOLOMPANDATHANIMA
John B BodyLiesAMoulderinInTheGrave
OBRIEN, Seamus
My goal is to create clean last and first name fields as quickly and accurately as possible. I simply don't have the time to go through each record in a thousands-plus name file, though. So I've evolved a lot of little tricks over the years. These generally involve searching and replacing with wildcards in various ways. If you don't know, the wildcard characters are "*" and "?". The asterisk represents any number of characters, and the question mark represents a single character.
But lately I've been seeing a new problem: somebody is throwing wildcards into the data. Specifically, asterisks. The problem is that if I try to do a global search and replace to remove "*", I end up deleting ALL the data - the "*" is taken by Excel as a wildcard. It's quite an annoying little trap.
It may not be intentional, of course; it could just be various idiots trying to "footnote" the data, although I've never seen an actual footnote in a name file.
But I thought of a solution, and had the pleasure of having the first thing I tried work perfectly. I copied the entire column and threw it into a Windows NotePad file. My hunch was that NotePad was just basic enough to not have the wildcard replace option. And it doesn't; a search and replace of "*" with nothing works perfectly. Then I copied the data from NotePad, pasted it back into Excel, and voila! clean data.
Which still has to be worked on, of course, but at least I've gotten rid of those damned asterisks.