Fast 1D Linear NaN Interpolation over Large 3D Array
Introduction
In this article, we will explore the problem of interpolating missing values in a large 3D array. The data is structured such that each value along the first axis represents a different point in time. Some values are missing due to acquisition failures or other reasons, resulting in NaN (Not a Number) values.
We will discuss the current approach using scipy.interpolate.interp1d with pandas, and then explore an optimized solution using numba.
Background
The interp1d function from scipy.interpolate is a powerful tool for performing linear interpolation in 1D arrays. When used with pandas, it provides a convenient interface for interpolating missing values along the time axis.
However, as we will see later, this approach can be computationally expensive, especially when dealing with large datasets like our 3D array.
Current Approach
The current implementation uses the following code snippet:
import numpy as np
import pandas as pd
def interpolate_nan(arr, method="linear", limit=3):
"""return array interpolated along time-axis to fill missing values"""
result = np.zeros_like(arr, dtype=np.int16)
for i in range(arr.shape[1]):
# slice along y axis, interpolate with pandas wrapper to interp1d
line_stack = pd.DataFrame(data=arr[:,i,:], dtype=np.float32)
line_stack.replace(to_replace=-37268, value=np.NaN, inplace=True)
line_stack.interpolate(method=method, axis=0, inplace=True, limit=limit)
line_stack.replace(to_replace=np.NaN, value=-37268, inplace=True)
result[:, i, :] = line_stack.values.astype(np.int16)
return result
As we can see, this implementation uses a for loop to iterate over each column of the 3D array. For each column, it creates a pandas.DataFrame from the corresponding values, replaces NaN values with -37268, and then interpolates missing values using interp1d.
However, as mentioned in the question, this approach is computationally expensive, taking around 2 seconds per slice on the original dataset.
Optimized Solution
To improve performance, we can use numba to optimize the interpolation process. The optimized code snippet is:
from numba import jit
@jit(nopython=True)
def interpolate_numba(arr, no_data=-32768):
"""return array interpolated along time-axis to fill missing values"""
result = np.zeros_like(arr, dtype=np.int16)
for x in range(arr.shape[2]):
# slice along x axis
for y in range(arr.shape[1]):
# slice along y axis
for z in range(arr.shape[0]):
value = arr[z,y,x]
if z == 0: # don't interpolate first value
new_value = value
elif z == len(arr[:,0,0])-1: # don't interpolate last value
new_value = value
elif value == no_data: # interpolate
left = arr[z-1,y,x]
right = arr[z+1,y,x]
# look for valid neighbours
if left != no_data and right != no_data: # left and right are valid
new_value = (left + right) / 2
elif left == no_data and z == 1: # boundary condition left
new_value = value
elif right == no_data and z == len(arr[:,0,0])-2: # boundary condition right
new_value = value
elif left == no_data and right != no_data: # take second neighbour to the left
more_left = arr[z-2,y,x]
if more_left == no_data:
new_value = value
else:
new_value = (more_left + right) / 2
elif left != no_data and right == no_data: # take second neighbour to the right
more_right = arr[z+2,y,x]
if more_right == no_data:
new_value = value
else:
new_value = (more_right + left) / 2
elif left == no_data and right == no_data: # take second neighbour on both sides
more_left = arr[z-2,y,x]
more_right = arr[z+2,y,x]
if more_left != no_data and more_right != no_data:
new_value = (more_left + more_right) / 2
else:
new_value = value
else:
new_value = value
else:
new_value = value
result[z,y,x] = int(new_value)
return result
As we can see, this implementation uses a single for loop to iterate over each element in the 3D array. This reduces the computational complexity from O(n^2) to O(n), where n is the total number of elements in the array.
Conclusion
In conclusion, while the original approach using scipy.interpolate.interp1d with pandas provides a convenient interface for interpolating missing values, it can be computationally expensive. The optimized solution using numba significantly improves performance by reducing the computational complexity from O(n^2) to O(n).
We hope this article has provided valuable insights into optimizing interpolation processes in large 3D arrays. With the use of numba, we can achieve significant improvements in performance while maintaining accuracy.
Additional Considerations
While the optimized solution using numba provides a significant improvement in performance, there are additional considerations to keep in mind:
- Boundary Conditions: In the optimized solution, we handle boundary conditions by not interpolating the first and last values. However, depending on the specific requirements of your application, you may need to modify these boundary conditions.
- Neighbour Selection: The optimized solution uses a strategy for selecting neighbours based on their validity. You can modify this approach to suit your specific needs.
By understanding the trade-offs involved in optimizing interpolation processes, you can make informed decisions about which approach best suits your application’s requirements.
Last modified on 2023-11-18