Why You Should No Longer BotherImportant Notification For Everyone Who Wants To Overcome This Experience This Secret Solution Helps With: Longer **Satisfaction** Increased Vitality in The Other…
A few months ago I had the opportunity to collaborate with some Data Scientist porting PySpark queries to raw Python. One of the primary areas of concern was aggregation statements. These were seen as functionality that would be particularly troublesome to write in Python. As an example I was provided a Spark SQL query similar to this:
Outside of the raw functionality I was asked if the data could be structured to provide an interface with named columns in a style similar to SQL rather than having to reference positional data.
All of this seemed fairly straightforward. Provided with some sample data I pulled in the UDF that was already in Python and set out to apply the transformations first illustrating how we could interact with the data in a way similar to SQL with pipeline transformations and named references.
With the transformation and named values part out of the way I moved onto the
GROUPBY aggregations. Thinking about
GROUPBY the goal is to apply an aggregation function to a unique value. That unique value can be represented multiple ways, but I wanted to show the idea behind what was happening to help with future port efforts. So on my first pass I wrote:
Keeping in mind that this was to conceptualize what could be happening behind the scenes for the
AGG operation here we loop over our rows generating a hash from some values. Once we have this hash we check if we have seen it before by referencing a
set. If this is a new value we find all values of the hash in our transformed data, append the tokens, handle empty tokens and finally add the data to our final dataset. At the end we have lines of text (instead of individual tokens) that can be referenced by page, block, paragraph and line number.
While this works it’s horribly inefficient. It stands out that we are reiterating our transformed data every time we find a new key. But the goal for this wasn’t to be efficient. It was to show the ideas expressed in SQL with Python. Specifically it was highlighting how to express a
GROUPBY/AGG operation manually using hashes of values and tracking what we have and have not seen providing a final dataset that was the same as the output of the SQL statement.
Replacing the block above:
And with that change we have a clean, faster implementation. Additionally since this was a port of Spark SQL if the data was to get truly large it wouldn’t be much work to start iterating through all of the pipeline in batches since we can use generators all the way through.
So what was the point of sharing that here? Nothing specific. It was a fun exercise at the time, and it made me pause to consider how I would express
GROUPBY on my own. The exercise also helped introduce some of my colleagues to the
filter expression and in turn
reduce. Using those they were able to express a lot of their pipeline concepts without a lot of the iteration structures they were used to having abstracted away. If you find yourself doing a lot of pipelining I recommend checking out
functools. Both are built into the Python stdlib and provide a lot of helpful functionality.
De antemano gracias por acompañarnos en este espacio, en el cual hablaremos de lo que pasa en el mundo del deporte, donde analizaremos lo que pasa dentro y fuera del terreno de juego. Y…
Sitting waiting at Gatwick for our flight back to home — a two day journey, spending tonight in Boston, catching the bus for Portland on Wednesday where we will pick up the car and head for Port…
Location Intelligence is rapidly transforming the financial industry. A lot of location data is being constantly generated by financial organizations and they are trying to gain meaningful insights…