Nov. 12, 2024

Tags: , , ,

Despite Python's dominance in data analysis due to its extensive ecosystem of machine learning and deep learning libraries, SQL continues to be a critical tool for processing large volumes of data. Its efficiency in querying and manipulating databases makes it a valuable asset in the data scientist's toolkit.

In certain scenarios, server-side computation is more efficient than processing data locally. For instance, attempting to handle terabytes of data from a cloud database on a local machine can lead to excessive network strain and quickly exhaust local computational resources. This article presents SQL-based alternatives that achieve equivalent results to three commonly used Pandas functions, enabling more efficient data processing directly on the server side.

OVER Clause

Since its standardization in 1986, SQL has continuously adapted to address new challenges. The OVER clause exemplifies this evolution, enabling the calculation of non-aggregated values alongside aggregated ones within a single query. Numerous resources offer valuable guidance and practical examples on using the OVER clause. Here are a few helpful references:

This article assumes a familiarity with the basic principles of the OVER clause. However, let’s highlight some key components of window specifications. Firstly, the scope of the window is always determined by the PARTITION BY clause. If no partition is defined, the window scope defaults to the entire source table. Secondly, the range of the window is controlled by the ROWS clause, which specifies where the window starts and ends relative to the current row. If the ROWS clause is missing, the range of the window is determined by the ORDER BY clause, starting from the first row up to the current row according to the specified order. If neither ROWS nor ORDER BY clauses are included, then the range of the window defaults to the entire window scope.

To better understand these cases, let's create a temporary table with a category column that we can use for partitioning. We'll then execute an SQL script to calculate various columns with the size of the current window range by including and excluding different components of the OVER clause. This will help us visualize how each component affects the final result.

WITH temp as (
SELECT 1 as x, "A" as category
union all SELECT 2 as x, "A" as category
union all SELECT 3 as x, "A" as category
union all SELECT 4 as x, "A" as category
union all SELECT 100 as x, "B" as category
union all SELECT 110 as x, "B" as category
union all SELECT 120 as x, "B" as category
union all SELECT 130 as x, "B" as category
)

SELECT
x,
category,
COUNT(x) OVER (ORDER BY x ASC) count_ob,
COUNT(x) OVER (PARTITION BY category
        ORDER BY x ASC) count_pb_ob,
COUNT(x) OVER (PARTITION BY category) count_pb,
COUNT(x) OVER () count
FROM temp
order by x asc

The resulting table from the previous query demonstrates the impact of different OVER clause definitions on window sizes. Based on the category column, the table is partitioned into two groups: category A and category B. Consequently, the count_pb_ob and count_pb columns, which are partitioned by category, have a maximum window size of 4 (the number of rows in each category). In contrast, the count_ob and count columns, which lack a PARTITION BY clause, consider the entire table and thus have a maximum window size of 8.

xcategorycount_obcount_pb_obcount_pbcount
1A1 148
2A2 248
3A3 348
4A4 448
100B5 148
110B6 248
120B7 348
130B8 448

It's crucial to note that OVER operations involving an ORDER BY clause have a dynamic window range that grows with each row processed.

Function Shift

The shift function in pandas is used to move (or "shift") the values in a Pandas Series up or down by a specified number of periods along the index. When shifting down, it introduces NaN values at the start of the series to fill in the missing values, and when shifting up, NaNs are introduced at the end. This function is particularly useful in time series analysis for calculating lagged or forward-looking values, which are often used to compare data points over different time steps or to create features for predictive modeling.

First approach

Let's construct a sample Pandas DataFrame with dummy data representing sensor measurements over time. The column value_shift3 is calculated by applying a shift operation to the value column with a window size of three. This effectively shifts the values three positions backward.

import pandas as pd
data = {
"day": ["2024-11-01", "2024-11-02", "2024-11-03", "2024-11-04", "2024-11-05", "2024-11-06"],
"sensor":["sensor_1","sensor_1","sensor_1","sensor_1","sensor_1","sensor_1",],
"value":[101,102,103,104,105,106]
}
df = pd.DataFrame(data = data)
df["value_shift3"] = df["value"].shift(3)

This Python script produces the following output:

daysensorvaluevalue_shift3
02024-11-01sensor_1101NaN
12024-11-02sensor_1102NaN
22024-11-03sensor_1103NaN
32024-11-04sensor_1104101
42024-11-05sensor_1105102
52024-11-05sensor_1106103

Assuming we already have a table dummy_data in the database containing the dummy data from the previous example (but without the value_shift3 column, which we need to calculate), we can achieve the same result using the following SQL statement:

SELECT
*,
LAG(value,3) OVER (ORDER BY day) as value_shift3
FROM dummy_data

Shifting with partitioning

While the previous approach works, it doesn't differentiate between sensor names. To shift values within each sensor individually, we need to introduce a PARTITION BY clause to the SQL statement. Note that executing this query on the same dummy_data database will yield the same result as before.

SELECT
*,
LAG(value,3) OVER (PARTITION BY sensor
                ORDER BY day) as value_shift3
FROM dummy_data

To observe a distinct behavior, let's introduce data from a second sensor, sensor_2. This would yield the following result

daysensorvaluevalue_shift3
02024-11-01sensor_1101NaN
12024-11-02sensor_1102NaN
22024-11-03sensor_1103NaN
32024-11-04sensor_1104101
42024-11-05sensor_1105102
52024-11-05sensor_1106103
62024-11-01sensor_211NaN
72024-11-02sensor_212NaN
82024-11-03sensor_213NaN
92024-11-04sensor_21411
102024-11-05sensor_21512
112024-11-05sensor_21613

Dynamic window

As of this writing, window sizes are static and cannot be dynamically adjusted using the OVER clause. This means we can't directly use a window column as a parameter for the LAG function.

However, we can employ a more complex SQL approach to achieve dynamic window behavior. By combining the OVER clause to enumerate rows within each partition and applying conditional logic based on the enumerated rows and the window column, we can effectively simulate dynamic window functions.

WITH data_with_rn as (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY sensor
                  ORDER BY day asc) as rn
FROM dummy_data
)
SELECT *,
(   SELECT sub.tag_value_imp
FROM data_with_rn sub
WHERE sub.sensor = jd.sensor AND
  sub.rn = jd.rn - jd.window
) as tgval_shift
FROM data_with_rn jd

The following table demonstrates how to shift column values based on the specified window size. Sensor 1 has a window size of 3, while Sensor 2 has a window size of 2.

daysensorwindowvaluevalue_shifted
02024-11-01sensor_13 101NaN
12024-11-02sensor_13 102NaN
22024-11-03sensor_13 103NaN
32024-11-04sensor_13 104101
42024-11-05sensor_13 105102
52024-11-05sensor_13 106103
62024-11-01sensor_22 11NaN
72024-11-02sensor_22 12NaN
82024-11-03sensor_22 1311
92024-11-04sensor_22 1412
102024-11-05sensor_22 1513
112024-11-05sensor_22 1614

Functions bfill and ffill

In Pandas, the ffill (forward fill) and bfill (backward fill) methods are used to handle missing data by filling NaN values in a DataFrame or Series. Here’s how each method works:

  • ffill: This method propagates the last valid observation forward to fill any missing values. It replaces each NaN value with the most recent non-null value that appears before it in the column. This is particularly useful when you want to carry forward the last known data point, such as filling gaps in time series data.
  • bfill: This method fills NaN values by propagating the next valid observation backward. It replaces each missing value with the next non-null value that appears after it in the column. This can be helpful when you want to fill missing data points by looking forward to future values.

Both methods are often used when preparing data for analysis, particularly when dealing with time series or sequential datasets where carrying forward or backward existing values can make sense. Next code applies both fuctions over the column value which has missing values.

import pandas as pd
import numpy as np
data = {
"day": ["2024-11-01", "2024-11-02", "2024-11-03", "2024-11-04", "2024-11-05", "2024-11-06"],
"sensor":["sensor_1","sensor_1","sensor_1","sensor_1","sensor_1","sensor_1",],
"value":[np.nan,102,np.nan,np.nan,105,np.nan]
}
df = pd.DataFrame(data = data)
df["ffill"] = df["value"].ffill()
df["bfill"] = df["value"].bfill()

The table below displays the outcome of running the previous script. We can observe that the ffill method, which propagates the last valid observation forward, is unable to replace the first NaN value. Similarly, the bfill method, which propagates the next valid observation backward, cannot fill the last NaN value.

daysensorvalueffillbfill
02024-11-01sensor_1NaNNaN102
12024-11-02sensor_1102102102
22024-11-03sensor_1NaN102105
32024-11-04sensor_1NaN102105
42024-11-05sensor_1105105105
52024-11-05sensor_1NaN105NaN

We can replicate the results using the window function LAST_VALUE. This function returns the last value within a specified window range and includes an option to ignore null values. Both target functions can be implemented using the same operation, differentiated only by the order of processing: ffill applies in ascending order, while bfill applies in descending order. The following SQL script demonstrates this approach:

SELECT *,
LAST_VALUE(value IGNORE NULLS)
OVER (PARTITION BY sensor ORDER BY day ASC) as ffill
LAST_VALUE(value IGNORE NULLS)
OVER (PARTITION BY sensor ORDER BY day DESC) as bfill
FROM dummy_data