January 2nd, 2020

# Optimizing code with pandas and NumPy by in Applied Machine Learning #### Intro

Over the years, SciPy has emerged as one of the best frameworks for data science. SciPy defines itself as an ecosystem of open-source software for mathematics, science, and engineering. The core six packages of SciPy are NumPy, SciPy, Matplotlib, IPython, SymPy, and pandas. In this post, I’ll use NumPy and pandas to optimize the slow implementation from the last post.

Additional Notebook with restructured code – https://www.translucentcomputing.com/2020/01/performance-waveform-generator-starter-notebook/

#### Problem

The data science task is to generate synthetic time-series data. In the previous post, we created the first naive implementation.

``````def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

time = []
signal = []

# generate signal
sample_time = 0
for s in range(seconds):
for sps in range(samples_per_second):
sample_time += 1/samples_per_second
noise = random.random()
scaled_noise = -1 + (noise * 2)
sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
time.append(sample_time)
signal.append(sample)

# return time and signal
return [time,signal]``````

On average, running it locally, it takes about 14 seconds, really slow! It might not be clear from the profiler where you should start with the refactoring optimization.

``````Total time: 14.0719 s
Function: reallySlowGenerateTimeSeriesData at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
2                                           def reallySlowGenerateTimeSeriesData(seconds,samples_per_second):
3                                               """Generate synthetic data"""
4
5         1          4.0      4.0      0.0      time = []
6         1          1.0      1.0      0.0      signal = []
7
8                                               # generate signal
9         1          1.0      1.0      0.0      sample_time = 0
10      3601       1889.0      0.5      0.0      for s in range(seconds):
11   3603600    1589537.0      0.4     11.3          for sps in range(samples_per_second):
12   3600000    2019769.0      0.6     14.4              sample_time += 1/samples_per_second
13   3600000    2079579.0      0.6     14.8              noise = random.random()
14   3600000    1921001.0      0.5     13.7              scaled_noise = -1 + (noise * 2)
15   3600000    2801165.0      0.8     19.9              sample = math.sin(2*math.pi*10*sample_time) + scaled_noise
16   3600000    1904107.0      0.5     13.5              time.append(sample_time)
17   3600000    1754810.0      0.5     12.5              signal.append(sample)
18
19                                               # return time and signal
20         1          1.0      1.0      0.0      return [time,signal]``````

The time is evenly distributed across all the lines within the nested for loops. There is no smoking gun such as an obvious slow execution within the loops. Your software engineering spidey sense might be directing your eyes to the nested loops and the big O complexity, O(n²), of the code structure. Let’s start there.

#### Vectorization

The Python community and in general the scientific software development community has adopted the term “vectorization” to mean array programming (array-oriented computing), a process where you execute your “business logic” directly on the array without using loops. In Python, the NumPy library is the goto library when it comes to vectorization.

#### Optimize with NumPy

NumPy is an extension package to Python for array programming. It provides “closer to the hardware” optimization, which in Python means C implementation.

Looking at the first implementation, we see that the nested loop is used to create the time list. We are starting with refactoring the time list to use NumPy array.

``````def slightlyFasterGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

# generate time
time = np.arange(0,seconds,1/samples_per_second)

# generate signal
signal = []
for t in time:
noise = random.random()
scaled_noise = -1 + (noise * 2)
sample = math.sin(2*math.pi*10*t) + scaled_noise
signal.append(sample)

# return time and signal
return [time,signal]``````

This optimization reduced the average run time to about 10 seconds. Since it looks like we are going in the right direction, let’s try full NumPy implementation.

``````def reallyFastGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

# generate time
time = np.arange(0,seconds,1/samples_per_second)

# generate signal
noise = -2 * np.random.random(len(time)) + 1
signal = np.sin(2*np.pi*10*time) + noise

# return time and signal
return [time,signal]``````

WOW! This implementation executes in about 0.1 seconds, an order of magnitudes faster than previous implementations.

NumPy library comes with a vectorized version of most of the mathematical functions in Python core, random function, and a lot more. In this implementation, Python math and random functions were replaced with the NumPy version and the signal generation was directly executed on NumPy arrays without any loops.

When it comes to data manipulation and analysis, doing the data science thing, you use another library. While NumPy is the workhorse, pandas is the tool for doing data manipulation and analysis.

#### Optimize with pandas

pandas is the library for data manipulation and analysis. Usually, it is the starting point for your data science tasks. It allows you to read/write data from/to multiple sources. Process the missing data, align your data, reshape it, merge and join it with other data, search data, group it, slice it, really a swiss knife.

Since most likely you will start with pandas, let’s refactor the code to use pandas by starting with the original implementation and adding pandas to it.

``````def pandasReallySlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

# generate time
time = np.arange(0,seconds,1/samples_per_second)

# create pandas
df = pd.DataFrame(data=time, columns=['time'])

def generateSignal(t):
noise = random.random()
scaled_noise = -1 + (noise * 2)
return math.sin(2*math.pi*10*t) + scaled_noise

# generate signal
df['signal'] = df['time'].apply(lambda t: generateSignal(t))

# return time and signal
return [df['time'],df['signal']]``````

On average this implementation runs in 5 seconds. pandas, like NumPy, has been designed to work with vectorization. Let’s update the code to use vectorization.

``````def pandasFasterSlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

# generate time
time = np.arange(0,seconds,1/samples_per_second)

# create pandas
df = pd.DataFrame(data=time, columns=['time'])

def generateSignal(t):
noise = -2 * np.random.random(len(t)) + 1
return np.sin(2*np.pi*10*t) + noise

# generate signal
df['signal'] = generateSignal(df['time'])

# return time and signal
return [df['time'],df['signal']]``````

This implementation runs in about 0.12 seconds. We are back to reasonable running times. pandas also integrates with NumPy and you can often squeeze out more performance by using NumPy arrays with pandas. Here is one such implementation.

``````def pandasNumpyFastSlowGenerateTimeSeriesData(seconds,samples_per_second):
"""Generate synthetic data"""

# generate time
time = np.arange(0,seconds,1/samples_per_second)

# create pandas
df = pd.DataFrame(data=time, columns=['time'])

def generateSignal(t):
noise = -2 * np.random.random(len(t)) + 1
return np.sin(2*np.pi*10*t) + noise

# generate signal
df['signal'] = generateSignal(df['time'].values)

# return time and signal
return [df['time'],df['signal']]``````

The change there is getting values from df[‘time’]. This implementation is slightly faster under the same test conditions and it scales nicely with a lot more data and additional processing.

``````# Data types
type(dataFrame['time'])
pandas.core.series.Series
type(dataFrame['time'].values)
numpy.ndarray``````

#### Bonus

Since this is already a long post, I’ll get to SymPy in the next post. Here is a visualization of the generated synthetic time-series data.

You can see, and by definition, this noise data has no information in it. In the notebook, I show the initial steps of adding information into the data. The data is represented as a sound wave here, but if you change your perspective a bit, this process and data structure apply to other sources of time-series data, like a heartbeat.

#### Conclusion

The Python ecosystem is full of libraries and have been battle-tested and optimized for data science. The data science tasks are computationally intensive and writing efficient code from the start, not buying into the mantra which presupposes that premature optimization is the root of all evil, will allow you to work efficiently in your local environment before moving to the cloud.

Article Rating

January 2nd, 2020 by in Applied Machine Learning

⟵ Back

Subscribe
Notify of Inline Feedbacks

See more:

December 10th, 2021

Cloud Composer – Terraform Deployment by in Data-Driven, Technology

December 3rd, 2021

Apache Airflow – Data Pipeline by in Technology

December 2nd, 2021

Provision Kubernetes: Securing Virtual Machines by in Kubernetes In Action

November 26th, 2021

Apache Airflow – Management by in Technology

November 22nd, 2021

Workflow Engine – Data Pipeline by in Technology

July 8th, 2022

TEK Formula & Blueprint: The Ultimate Guide to Building Software by in TEK Formula & Blueprint

June 9th, 2022

How Translucent is Changing The Cloud Native Industry by in Data-Driven, Microblog

0