A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Friday, March 2, 2018

Python pandas repeating character tester

Python pandas repeating character tester

At one of our clients, we are data profiling. They have a mainframe, it's been running for so long, they no longer have SMEs for their data. We've been able to leverage Service Broker to provide a real-time, under 3 seconds, remote file store for their data. It's pretty cool but now they are trying to do something with the data so we need to understand what the data looks like. We're using a mix of TSQL and python to understand nullability, value variances, etc. One of the "interesting" things we've discovered is that they loved placeholder values. Everyone knows a date of 88888888 is a placeholder for the actual date which they'll get two steps later in the workflow. Except sometimes we use 99999999 because the eights are the placeholder for the time.

Initially, we were just searching for one sentinel value, then two values until we saw the bigger pattern of "repeated values probably mean something." For us, this matters because we then need to discard those rows for data type suitability. 88888888 isn't a valid date so our logic might determine that column is best served by a numeric data type. Unless we exclude the eights value in which we get a 100% match rate on the column's ability to be converted to a date.

How can we determine if a string is nothing but repeated values in python? There's a very clever test from StackOverflow

source == source[0] * len(source) I would read that as "is the source variable exactly equal to the the first character of source repeated for the length of source?"

And that was good, until we hit a NULL (None in python-speak). We then took advantage of the ternary equivalent in python to make it

(source == source[0] * len(source)) if source else False

Enter Pandas (series)

Truth is a funny thing in an Pandas Series. Really, it is. The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. We were trying to apply the above function as we were doing everything else

df.MyColumn.str.len()
# this will fail magnificantly
(df.MyColumn == df.MyColumn[0] * len(df.MyColumn)) if df.MyColumn else False

It took me a while since I hadn't really used the pandas library beyond running what my coworker had done. What I needed to do, was get a row context to apply the calculations for true/false. As it stands, the Series stuff wants to try and aggregate the booleans or something like that. And it makes sense from a SQL perspective, you can't really apply aggregates to bit fields (beyond COUNT).

So, what's the solution? As always, you're likely to say the exact thing you're looking for. In this case, apply was the keyword.

df.MyColumn.apply(lambda source: (source == source[0] * len(source)) if source else False)

Full code you can play with would be

import pandas
import pprint

def isRepeated(src):
    return (src == src[0] * len(src)) if src else False
    
df = pandas.DataFrame({"MyCol":pandas.Series(['AB', 'BC', 'BB', None])})

pprint.pprint(df)

print()
# What rows have the same character in all of them?

pprint.pprint(df.MyCol.apply(lambda source:(source == source[0] * len(source)) if source else False))
#If you'd like to avoid the anonymous function...
pprint.pprint(df.MyCol.apply(isRepeated))

In short, python is glorious and I'm happy to writing in it again ;)


No comments: