Beyond Pandas: SQL's OVER clause
Despite Python's dominance in data analysis due to its extensive ecosystem of machine learning and deep learning libraries, SQL continues to be a critical tool for processing large volumes of data. Its efficiency in querying and manipulating databases makes it a valuable asset in the data scientist's toolkit.
In certain scenarios, server-side computation is more efficient than processing data locally. For instance, attempting to handle terabytes of data from a cloud database on a local machine can lead to excessive network strain and quickly exhaust local computational resources. This article presents SQL-based alternatives that achieve equivalent results to three commonly used Pandas functions, enabling more efficient data processing directly on the server side.
OVER Clause
Since its standardization in 1986, SQL has continuously adapted to address new challenges. The OVER clause exemplifies this evolution, enabling the calculation of non-aggregated values alongside aggregated ones within a single query. Numerous resources offer valuable guidance and practical examples on using the OVER clause. Here are a few helpful references:
Google: Window function calls
Microsoft: Select OVER clause (transact sql)
Oracle: Analytic Functions
This article assumes a familiarity with the basic principles of the OVER clause. However, let’s highlight some key components of window specifications. Firstly, the scope of the window is always determined by the PARTITION BY clause. If no partition is defined, the window scope defaults to the entire source table. Secondly, the range of the window is controlled by the ROWS clause, which specifies where the window starts and ends relative to the current row. If the ROWS clause is missing, the range of the window is determined by the ORDER BY clause, starting from the first row up to the current row according to the specified order. If neither ROWS nor ORDER BY clauses are included, then the range of the window defaults to the entire window scope.
To better understand these cases, let's create a temporary table with a category column that we can use for partitioning. We'll then execute an SQL script to calculate various columns with the size of the current window range by including and excluding different components of the OVER clause. This will help us visualize how each component affects the final result.
WITH temp as (
SELECT 1 as x, "A" as category
union all SELECT 2 as x, "A" as category
union all SELECT 3 as x, "A" as category
union all SELECT 4 as x, "A" as category
union all SELECT 100 as x, "B" as category
union all SELECT 110 as x, "B" as category
union all SELECT 120 as x, "B" as category
union all SELECT 130 as x, "B" as category
)
SELECT
x,
category,
COUNT(x) OVER (ORDER BY x ASC) count_ob,
COUNT(x) OVER (PARTITION BY category
ORDER BY x ASC) count_pb_ob,
COUNT(x) OVER (PARTITION BY category) count_pb,
COUNT(x) OVER () count
FROM temp
order by x asc
The resulting table from the previous query demonstrates the impact of different OVER clause definitions on window sizes. Based on the category column, the table is partitioned into two groups: category A and category B. Consequently, the count_pb_ob and count_pb columns, which are partitioned by category, have a maximum window size of 4 (the number of rows in each category). In contrast, the count_ob and count columns, which lack a PARTITION BY clause, consider the entire table and thus have a maximum window size of 8.
| x | category | count_ob | count_pb_ob | count_pb | count |
|---|---|---|---|---|---|
| 1 | A | 1 | 1 | 4 | 8 |
| 2 | A | 2 | 2 | 4 | 8 |
| 3 | A | 3 | 3 | 4 | 8 |
| 4 | A | 4 | 4 | 4 | 8 |
| 100 | B | 5 | 1 | 4 | 8 |
| 110 | B | 6 | 2 | 4 | 8 |
| 120 | B | 7 | 3 | 4 | 8 |
| 130 | B | 8 | 4 | 4 | 8 |
It's crucial to note that OVER operations involving an ORDER BY clause have a dynamic window range that grows with each row processed.
Function Shift
The shift function in pandas is used to move (or "shift") the values in a Pandas Series up or down by a specified number of periods along the index. When shifting down, it introduces NaN values at the start of the series to fill in the missing values, and when shifting up, NaNs are introduced at the end. This function is particularly useful in time series analysis for calculating lagged or forward-looking values, which are often used to compare data points over different time steps or to create features for predictive modeling.
First approach
Let's construct a sample Pandas DataFrame with dummy data representing sensor measurements over time. The column value_shift3 is calculated by applying a shift operation to the value column with a window size of three. This effectively shifts the values three positions backward.
import pandas as pd
data = {
"day": ["2024-11-01", "2024-11-02", "2024-11-03", "2024-11-04", "2024-11-05", "2024-11-06"],
"sensor":["sensor_1","sensor_1","sensor_1","sensor_1","sensor_1","sensor_1",],
"value":[101,102,103,104,105,106]
}
df = pd.DataFrame(data = data)
df["value_shift3"] = df["value"].shift(3)
This Python script produces the following output:
| day | sensor | value | value_shift3 | |
|---|---|---|---|---|
| 0 | 2024-11-01 | sensor_1 | 101 | NaN |
| 1 | 2024-11-02 | sensor_1 | 102 | NaN |
| 2 | 2024-11-03 | sensor_1 | 103 | NaN |
| 3 | 2024-11-04 | sensor_1 | 104 | 101 |
| 4 | 2024-11-05 | sensor_1 | 105 | 102 |
| 5 | 2024-11-05 | sensor_1 | 106 | 103 |
Assuming we already have a table dummy_data in the database containing the dummy data from the previous example (but without the value_shift3 column, which we need to calculate), we can achieve the same result using the following SQL statement:
SELECT
*,
LAG(value,3) OVER (ORDER BY day) as value_shift3
FROM dummy_data
Shifting with partitioning
While the previous approach works, it doesn't differentiate between sensor names. To shift values within each sensor individually, we need to introduce a PARTITION BY clause to the SQL statement. Note that executing this query on the same dummy_data database will yield the same result as before.
SELECT
*,
LAG(value,3) OVER (PARTITION BY sensor
ORDER BY day) as value_shift3
FROM dummy_data
To observe a distinct behavior, let's introduce data from a second sensor, sensor_2. This would yield the following result
| day | sensor | value | value_shift3 | |
|---|---|---|---|---|
| 0 | 2024-11-01 | sensor_1 | 101 | NaN |
| 1 | 2024-11-02 | sensor_1 | 102 | NaN |
| 2 | 2024-11-03 | sensor_1 | 103 | NaN |
| 3 | 2024-11-04 | sensor_1 | 104 | 101 |
| 4 | 2024-11-05 | sensor_1 | 105 | 102 |
| 5 | 2024-11-05 | sensor_1 | 106 | 103 |
| 6 | 2024-11-01 | sensor_2 | 11 | NaN |
| 7 | 2024-11-02 | sensor_2 | 12 | NaN |
| 8 | 2024-11-03 | sensor_2 | 13 | NaN |
| 9 | 2024-11-04 | sensor_2 | 14 | 11 |
| 10 | 2024-11-05 | sensor_2 | 15 | 12 |
| 11 | 2024-11-05 | sensor_2 | 16 | 13 |
Dynamic window
As of this writing, window sizes are static and cannot be dynamically adjusted using the OVER clause. This means we can't directly use a window column as a parameter for the LAG function.
However, we can employ a more complex SQL approach to achieve dynamic window behavior. By combining the OVER clause to enumerate rows within each partition and applying conditional logic based on the enumerated rows and the window column, we can effectively simulate dynamic window functions.
WITH data_with_rn as (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY sensor
ORDER BY day asc) as rn
FROM dummy_data
)
SELECT *,
( SELECT sub.tag_value_imp
FROM data_with_rn sub
WHERE sub.sensor = jd.sensor AND
sub.rn = jd.rn - jd.window
) as tgval_shift
FROM data_with_rn jd
The following table demonstrates how to shift column values based on the specified window size. Sensor 1 has a window size of 3, while Sensor 2 has a window size of 2.
| day | sensor | window | value | value_shifted | |
|---|---|---|---|---|---|
| 0 | 2024-11-01 | sensor_1 | 3 | 101 | NaN |
| 1 | 2024-11-02 | sensor_1 | 3 | 102 | NaN |
| 2 | 2024-11-03 | sensor_1 | 3 | 103 | NaN |
| 3 | 2024-11-04 | sensor_1 | 3 | 104 | 101 |
| 4 | 2024-11-05 | sensor_1 | 3 | 105 | 102 |
| 5 | 2024-11-05 | sensor_1 | 3 | 106 | 103 |
| 6 | 2024-11-01 | sensor_2 | 2 | 11 | NaN |
| 7 | 2024-11-02 | sensor_2 | 2 | 12 | NaN |
| 8 | 2024-11-03 | sensor_2 | 2 | 13 | 11 |
| 9 | 2024-11-04 | sensor_2 | 2 | 14 | 12 |
| 10 | 2024-11-05 | sensor_2 | 2 | 15 | 13 |
| 11 | 2024-11-05 | sensor_2 | 2 | 16 | 14 |
Functions bfill and ffill
In Pandas, the ffill (forward fill) and bfill (backward fill) methods are used to handle missing data by filling NaN values in a DataFrame or Series. Here’s how each method works:
ffill: This method propagates the last valid observation forward to fill any missing values. It replaces eachNaNvalue with the most recent non-null value that appears before it in the column. This is particularly useful when you want to carry forward the last known data point, such as filling gaps in time series data.bfill: This method fillsNaNvalues by propagating the next valid observation backward. It replaces each missing value with the next non-null value that appears after it in the column. This can be helpful when you want to fill missing data points by looking forward to future values.
Both methods are often used when preparing data for analysis, particularly when dealing with time series or sequential datasets where carrying forward or backward existing values can make sense. Next code applies both fuctions over the column value which has missing values.
import pandas as pd
import numpy as np
data = {
"day": ["2024-11-01", "2024-11-02", "2024-11-03", "2024-11-04", "2024-11-05", "2024-11-06"],
"sensor":["sensor_1","sensor_1","sensor_1","sensor_1","sensor_1","sensor_1",],
"value":[np.nan,102,np.nan,np.nan,105,np.nan]
}
df = pd.DataFrame(data = data)
df["ffill"] = df["value"].ffill()
df["bfill"] = df["value"].bfill()
The table below displays the outcome of running the previous script. We can observe that the ffill method, which propagates the last valid observation forward, is unable to replace the first NaN value. Similarly, the bfill method, which propagates the next valid observation backward, cannot fill the last NaN value.
| day | sensor | value | ffill | bfill | |
|---|---|---|---|---|---|
| 0 | 2024-11-01 | sensor_1 | NaN | NaN | 102 |
| 1 | 2024-11-02 | sensor_1 | 102 | 102 | 102 |
| 2 | 2024-11-03 | sensor_1 | NaN | 102 | 105 |
| 3 | 2024-11-04 | sensor_1 | NaN | 102 | 105 |
| 4 | 2024-11-05 | sensor_1 | 105 | 105 | 105 |
| 5 | 2024-11-05 | sensor_1 | NaN | 105 | NaN |
We can replicate the results using the window function LAST_VALUE. This function returns the last value within a specified window range and includes an option to ignore null values. Both target functions can be implemented using the same operation, differentiated only by the order of processing: ffill applies in ascending order, while bfill applies in descending order. The following SQL script demonstrates this approach:
SELECT *,
LAST_VALUE(value IGNORE NULLS)
OVER (PARTITION BY sensor ORDER BY day ASC) as ffill
LAST_VALUE(value IGNORE NULLS)
OVER (PARTITION BY sensor ORDER BY day DESC) as bfill
FROM dummy_data