January 2nd, 2020

Optimizing code with pandas and NumPy

by Patryk Golabek in Applied Machine Learning

Intro

Over the years, SciPy has emerged as one of the best frameworks for data science. SciPy defines itself as an ecosystem of open-source software for mathematics, science, and engineering. The core six packages of SciPy are NumPy, SciPy, Matplotlib, IPython, SymPy, and pandas. In this post, I’ll use NumPy and pandas to optimize the slow implementation from the last post.

Test Notebook – https://www.translucentcomputing.com/2020/01/pandas-and-numpy-performance-test-notebook/

Additional Notebook with restructured code – https://www.translucentcomputing.com/2020/01/performance-waveform-generator-starter-notebook/

Problem

The data science task is to generate synthetic time-series data. In the previous post, we created the first naive implementation.

def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    time = []
    signal = []   
    
    # generate signal
    sample_time = 0
    for s in range(seconds):        
        for sps in range(samples_per_second):
            sample_time += 1/samples_per_second
            noise = random.random()
            scaled_noise = -1 + (noise * 2)
            sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
            time.append(sample_time)
            signal.append(sample)
    
    # return time and signal
    return [time,signal]

On average, running it locally, it takes about 14 seconds, really slow! It might not be clear from the profiler where you should start with the refactoring optimization.

Total time: 14.0719 s
File: <ipython-input-5-ad467de0be46>
Function: reallySlowGenerateTimeSeriesData at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
     3                                               """Generate synthetic data"""
     4                                               
     5         1          4.0      4.0      0.0      time = []
     6         1          1.0      1.0      0.0      signal = []   
     7                                               
     8                                               # generate signal
     9         1          1.0      1.0      0.0      sample_time = 0
    10      3601       1889.0      0.5      0.0      for s in range(seconds):        
    11   3603600    1589537.0      0.4     11.3          for sps in range(samples_per_second):
    12   3600000    2019769.0      0.6     14.4              sample_time += 1/samples_per_second
    13   3600000    2079579.0      0.6     14.8              noise = random.random()
    14   3600000    1921001.0      0.5     13.7              scaled_noise = -1 + (noise * 2)
    15   3600000    2801165.0      0.8     19.9              sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
    16   3600000    1904107.0      0.5     13.5              time.append(sample_time)
    17   3600000    1754810.0      0.5     12.5              signal.append(sample)
    18                                               
    19                                               # return time and signal
    20         1          1.0      1.0      0.0      return [time,signal]

The time is evenly distributed across all the lines within the nested for loops. There is no smoking gun such as an obvious slow execution within the loops. Your software engineering spidey sense might be directing your eyes to the nested loops and the big O complexity, O(n²), of the code structure. Let’s start there.

Vectorization

The Python community and in general the scientific software development community has adopted the term “vectorization” to mean array programming (array-oriented computing), a process where you execute your “business logic” directly on the array without using loops. In Python, the NumPy library is the goto library when it comes to vectorization.

Optimize with NumPy

NumPy is an extension package to Python for array programming. It provides “closer to the hardware” optimization, which in Python means C implementation.

Looking at the first implementation, we see that the nested loop is used to create the time list. We are starting with refactoring the time list to use NumPy array.

def slightlyFasterGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    # generate time 
    time = np.arange(0,seconds,1/samples_per_second)
    
    # generate signal
    signal = []
    for t in time:   
        noise = random.random()
        scaled_noise = -1 + (noise * 2)
        sample = math.sin(2*math.pi*10*t) + scaled_noise            
        signal.append(sample)
    
    # return time and signal
    return [time,signal]

This optimization reduced the average run time to about 10 seconds. Since it looks like we are going in the right direction, let’s try full NumPy implementation.

def reallyFastGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    # generate time
    time = np.arange(0,seconds,1/samples_per_second)
    
    # generate signal
    noise = -2 * np.random.random(len(time)) + 1
    signal = np.sin(2*np.pi*10*time) + noise
    
    # return time and signal
    return [time,signal]

WOW! This implementation executes in about 0.1 seconds, an order of magnitudes faster than previous implementations.

NumPy library comes with a vectorized version of most of the mathematical functions in Python core, random function, and a lot more. In this implementation, Python math and random functions were replaced with the NumPy version and the signal generation was directly executed on NumPy arrays without any loops.

When it comes to data manipulation and analysis, doing the data science thing, you use another library. While NumPy is the workhorse, pandas is the tool for doing data manipulation and analysis.

Optimize with pandas

pandas is the library for data manipulation and analysis. Usually, it is the starting point for your data science tasks. It allows you to read/write data from/to multiple sources. Process the missing data, align your data, reshape it, merge and join it with other data, search data, group it, slice it, really a swiss knife.

Since most likely you will start with pandas, let’s refactor the code to use pandas by starting with the original implementation and adding pandas to it.

def pandasReallySlowGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    # generate time
    time = np.arange(0,seconds,1/samples_per_second)
    
    # create pandas
    df = pd.DataFrame(data=time, columns=['time'])
    
    def generateSignal(t):
        noise = random.random()
        scaled_noise = -1 + (noise * 2)
        return math.sin(2*math.pi*10*t) + scaled_noise 
    
    # generate signal
    df['signal'] = df['time'].apply(lambda t: generateSignal(t))
       
    # return time and signal
    return [df['time'],df['signal']]

On average this implementation runs in 5 seconds. pandas, like NumPy, has been designed to work with vectorization. Let’s update the code to use vectorization.

def pandasFasterSlowGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    # generate time
    time = np.arange(0,seconds,1/samples_per_second)
    
    # create pandas
    df = pd.DataFrame(data=time, columns=['time'])
    
    def generateSignal(t):
        noise = -2 * np.random.random(len(t)) + 1
        return np.sin(2*np.pi*10*t) + noise
    
    # generate signal
    df['signal'] = generateSignal(df['time'])
       
    # return time and signal
    return [df['time'],df['signal']]

This implementation runs in about 0.12 seconds. We are back to reasonable running times. pandas also integrates with NumPy and you can often squeeze out more performance by using NumPy arrays with pandas. Here is one such implementation.

def pandasNumpyFastSlowGenerateTimeSeriesData(seconds,samples_per_second):
    """Generate synthetic data"""
    
    # generate time
    time = np.arange(0,seconds,1/samples_per_second)
    
    # create pandas
    df = pd.DataFrame(data=time, columns=['time'])
    
    def generateSignal(t):
        noise = -2 * np.random.random(len(t)) + 1
        return np.sin(2*np.pi*10*t) + noise
    
    # generate signal
    df['signal'] = generateSignal(df['time'].values)
       
    # return time and signal
    return [df['time'],df['signal']]

The change there is getting values from df[‘time’]. This implementation is slightly faster under the same test conditions and it scales nicely with a lot more data and additional processing.

# Data types
type(dataFrame['time'])
pandas.core.series.Series
type(dataFrame['time'].values)
numpy.ndarray

Bonus

Since this is already a long post, I’ll get to SymPy in the next post. Here is a visualization of the generated synthetic time-series data.

You can see, and by definition, this noise data has no information in it. In the notebook, I show the initial steps of adding information into the data. The data is represented as a sound wave here, but if you change your perspective a bit, this process and data structure apply to other sources of time-series data, like a heartbeat.

Conclusion

The Python ecosystem is full of libraries and have been battle-tested and optimized for data science. The data science tasks are computationally intensive and writing efficient code from the start, not buying into the mantra which presupposes that premature optimization is the root of all evil, will allow you to work efficiently in your local environment before moving to the cloud.

0 0 votes

Article Rating