Comprehensive guide to profiling, analyzing, and optimizing Python code for better performance, including CPU profiling, memory optimization, and implementation best practices.
When to Use This Skill
- Identifying performance bottlenecks in Python applications
- Reducing application latency and response times
- Optimizing CPU-intensive operations
- Reducing memory consumption and memory leaks
- Improving database query performance
- Optimizing I/O operations
- Speeding up data processing pipelines
- Implementing high-performance algorithms
- Profiling production applications
Core Concepts
1. Profiling Types
- CPU Profiling: Identify time-consuming functions
- Memory Profiling: Track memory allocation and leaks
- Line Profiling: Profile at line-by-line granularity
- Call Graph: Visualize function call relationships
- Execution Time: How long operations take
- Memory Usage: Peak and average memory consumption
- CPU Utilization: Processor usage patterns
- I/O Wait: Time spent on I/O operations
3. Optimization Strategies
- Algorithmic: Better algorithms and data structures
- Implementation: More efficient code patterns
- Parallelization: Multi-threading/processing
- Caching: Avoid redundant computation
- Native Extensions: C/Rust for critical paths
Quick Start
Basic Timing
python
1import time
2
3def measure_time():
4 """Simple timing measurement."""
5 start = time.time()
6
7 # Your code here
8 result = sum(range(1000000))
9
10 elapsed = time.time() - start
11 print(f"Execution time: {elapsed:.4f} seconds")
12 return result
13
14# Better: use timeit for accurate measurements
15import timeit
16
17execution_time = timeit.timeit(
18 "sum(range(1000000))",
19 number=100
20)
21print(f"Average time: {execution_time/100:.6f} seconds")
Pattern 1: cProfile - CPU Profiling
python
1import cProfile
2import pstats
3from pstats import SortKey
4
5def slow_function():
6 """Function to profile."""
7 total = 0
8 for i in range(1000000):
9 total += i
10 return total
11
12def another_function():
13 """Another function."""
14 return [i**2 for i in range(100000)]
15
16def main():
17 """Main function to profile."""
18 result1 = slow_function()
19 result2 = another_function()
20 return result1, result2
21
22# Profile the code
23if __name__ == "__main__":
24 profiler = cProfile.Profile()
25 profiler.enable()
26
27 main()
28
29 profiler.disable()
30
31 # Print stats
32 stats = pstats.Stats(profiler)
33 stats.sort_stats(SortKey.CUMULATIVE)
34 stats.print_stats(10) # Top 10 functions
35
36 # Save to file for later analysis
37 stats.dump_stats("profile_output.prof")
Command-line profiling:
bash
1# Profile a script
2python -m cProfile -o output.prof script.py
3
4# View results
5python -m pstats output.prof
6# In pstats:
7# sort cumtime
8# stats 10
Pattern 2: line_profiler - Line-by-Line Profiling
python
1# Install: pip install line-profiler
2
3# Add @profile decorator (line_profiler provides this)
4@profile
5def process_data(data):
6 """Process data with line profiling."""
7 result = []
8 for item in data:
9 processed = item * 2
10 result.append(processed)
11 return result
12
13# Run with:
14# kernprof -l -v script.py
Manual line profiling:
python
1from line_profiler import LineProfiler
2
3def process_data(data):
4 """Function to profile."""
5 result = []
6 for item in data:
7 processed = item * 2
8 result.append(processed)
9 return result
10
11if __name__ == "__main__":
12 lp = LineProfiler()
13 lp.add_function(process_data)
14
15 data = list(range(100000))
16
17 lp_wrapper = lp(process_data)
18 lp_wrapper(data)
19
20 lp.print_stats()
Pattern 3: memory_profiler - Memory Usage
python
1# Install: pip install memory-profiler
2
3from memory_profiler import profile
4
5@profile
6def memory_intensive():
7 """Function that uses lots of memory."""
8 # Create large list
9 big_list = [i for i in range(1000000)]
10
11 # Create large dict
12 big_dict = {i: i**2 for i in range(100000)}
13
14 # Process data
15 result = sum(big_list)
16
17 return result
18
19if __name__ == "__main__":
20 memory_intensive()
21
22# Run with:
23# python -m memory_profiler script.py
Pattern 4: py-spy - Production Profiling
bash
1# Install: pip install py-spy
2
3# Profile a running Python process
4py-spy top --pid 12345
5
6# Generate flamegraph
7py-spy record -o profile.svg --pid 12345
8
9# Profile a script
10py-spy record -o profile.svg -- python script.py
11
12# Dump current call stack
13py-spy dump --pid 12345
Optimization Patterns
Pattern 5: List Comprehensions vs Loops
python
1import timeit
2
3# Slow: Traditional loop
4def slow_squares(n):
5 """Create list of squares using loop."""
6 result = []
7 for i in range(n):
8 result.append(i**2)
9 return result
10
11# Fast: List comprehension
12def fast_squares(n):
13 """Create list of squares using comprehension."""
14 return [i**2 for i in range(n)]
15
16# Benchmark
17n = 100000
18
19slow_time = timeit.timeit(lambda: slow_squares(n), number=100)
20fast_time = timeit.timeit(lambda: fast_squares(n), number=100)
21
22print(f"Loop: {slow_time:.4f}s")
23print(f"Comprehension: {fast_time:.4f}s")
24print(f"Speedup: {slow_time/fast_time:.2f}x")
25
26# Even faster for simple operations: map
27def faster_squares(n):
28 """Use map for even better performance."""
29 return list(map(lambda x: x**2, range(n)))
Pattern 6: Generator Expressions for Memory
python
1import sys
2
3def list_approach():
4 """Memory-intensive list."""
5 data = [i**2 for i in range(1000000)]
6 return sum(data)
7
8def generator_approach():
9 """Memory-efficient generator."""
10 data = (i**2 for i in range(1000000))
11 return sum(data)
12
13# Memory comparison
14list_data = [i for i in range(1000000)]
15gen_data = (i for i in range(1000000))
16
17print(f"List size: {sys.getsizeof(list_data)} bytes")
18print(f"Generator size: {sys.getsizeof(gen_data)} bytes")
19
20# Generators use constant memory regardless of size
Pattern 7: String Concatenation
python
1import timeit
2
3def slow_concat(items):
4 """Slow string concatenation."""
5 result = ""
6 for item in items:
7 result += str(item)
8 return result
9
10def fast_concat(items):
11 """Fast string concatenation with join."""
12 return "".join(str(item) for item in items)
13
14def faster_concat(items):
15 """Even faster with list."""
16 parts = [str(item) for item in items]
17 return "".join(parts)
18
19items = list(range(10000))
20
21# Benchmark
22slow = timeit.timeit(lambda: slow_concat(items), number=100)
23fast = timeit.timeit(lambda: fast_concat(items), number=100)
24faster = timeit.timeit(lambda: faster_concat(items), number=100)
25
26print(f"Concatenation (+): {slow:.4f}s")
27print(f"Join (generator): {fast:.4f}s")
28print(f"Join (list): {faster:.4f}s")
Pattern 8: Dictionary Lookups vs List Searches
python
1import timeit
2
3# Create test data
4size = 10000
5items = list(range(size))
6lookup_dict = {i: i for i in range(size)}
7
8def list_search(items, target):
9 """O(n) search in list."""
10 return target in items
11
12def dict_search(lookup_dict, target):
13 """O(1) search in dict."""
14 return target in lookup_dict
15
16target = size - 1 # Worst case for list
17
18# Benchmark
19list_time = timeit.timeit(
20 lambda: list_search(items, target),
21 number=1000
22)
23dict_time = timeit.timeit(
24 lambda: dict_search(lookup_dict, target),
25 number=1000
26)
27
28print(f"List search: {list_time:.6f}s")
29print(f"Dict search: {dict_time:.6f}s")
30print(f"Speedup: {list_time/dict_time:.0f}x")
Pattern 9: Local Variable Access
python
1import timeit
2
3# Global variable (slow)
4GLOBAL_VALUE = 100
5
6def use_global():
7 """Access global variable."""
8 total = 0
9 for i in range(10000):
10 total += GLOBAL_VALUE
11 return total
12
13def use_local():
14 """Use local variable."""
15 local_value = 100
16 total = 0
17 for i in range(10000):
18 total += local_value
19 return total
20
21# Local is faster
22global_time = timeit.timeit(use_global, number=1000)
23local_time = timeit.timeit(use_local, number=1000)
24
25print(f"Global access: {global_time:.4f}s")
26print(f"Local access: {local_time:.4f}s")
27print(f"Speedup: {global_time/local_time:.2f}x")
Pattern 10: Function Call Overhead
python
1import timeit
2
3def calculate_inline():
4 """Inline calculation."""
5 total = 0
6 for i in range(10000):
7 total += i * 2 + 1
8 return total
9
10def helper_function(x):
11 """Helper function."""
12 return x * 2 + 1
13
14def calculate_with_function():
15 """Calculation with function calls."""
16 total = 0
17 for i in range(10000):
18 total += helper_function(i)
19 return total
20
21# Inline is faster due to no call overhead
22inline_time = timeit.timeit(calculate_inline, number=1000)
23function_time = timeit.timeit(calculate_with_function, number=1000)
24
25print(f"Inline: {inline_time:.4f}s")
26print(f"Function calls: {function_time:.4f}s")
Advanced Optimization
Pattern 11: NumPy for Numerical Operations
python
1import timeit
2import numpy as np
3
4def python_sum(n):
5 """Sum using pure Python."""
6 return sum(range(n))
7
8def numpy_sum(n):
9 """Sum using NumPy."""
10 return np.arange(n).sum()
11
12n = 1000000
13
14python_time = timeit.timeit(lambda: python_sum(n), number=100)
15numpy_time = timeit.timeit(lambda: numpy_sum(n), number=100)
16
17print(f"Python: {python_time:.4f}s")
18print(f"NumPy: {numpy_time:.4f}s")
19print(f"Speedup: {python_time/numpy_time:.2f}x")
20
21# Vectorized operations
22def python_multiply():
23 """Element-wise multiplication in Python."""
24 a = list(range(100000))
25 b = list(range(100000))
26 return [x * y for x, y in zip(a, b)]
27
28def numpy_multiply():
29 """Vectorized multiplication in NumPy."""
30 a = np.arange(100000)
31 b = np.arange(100000)
32 return a * b
33
34py_time = timeit.timeit(python_multiply, number=100)
35np_time = timeit.timeit(numpy_multiply, number=100)
36
37print(f"\nPython multiply: {py_time:.4f}s")
38print(f"NumPy multiply: {np_time:.4f}s")
39print(f"Speedup: {py_time/np_time:.2f}x")
python
1from functools import lru_cache
2import timeit
3
4def fibonacci_slow(n):
5 """Recursive fibonacci without caching."""
6 if n < 2:
7 return n
8 return fibonacci_slow(n-1) + fibonacci_slow(n-2)
9
10@lru_cache(maxsize=None)
11def fibonacci_fast(n):
12 """Recursive fibonacci with caching."""
13 if n < 2:
14 return n
15 return fibonacci_fast(n-1) + fibonacci_fast(n-2)
16
17# Massive speedup for recursive algorithms
18n = 30
19
20slow_time = timeit.timeit(lambda: fibonacci_slow(n), number=1)
21fast_time = timeit.timeit(lambda: fibonacci_fast(n), number=1000)
22
23print(f"Without cache (1 run): {slow_time:.4f}s")
24print(f"With cache (1000 runs): {fast_time:.4f}s")
25
26# Cache info
27print(f"Cache info: {fibonacci_fast.cache_info()}")
Pattern 13: Using slots for Memory
python
1import sys
2
3class RegularClass:
4 """Regular class with __dict__."""
5 def __init__(self, x, y, z):
6 self.x = x
7 self.y = y
8 self.z = z
9
10class SlottedClass:
11 """Class with __slots__ for memory efficiency."""
12 __slots__ = ['x', 'y', 'z']
13
14 def __init__(self, x, y, z):
15 self.x = x
16 self.y = y
17 self.z = z
18
19# Memory comparison
20regular = RegularClass(1, 2, 3)
21slotted = SlottedClass(1, 2, 3)
22
23print(f"Regular class size: {sys.getsizeof(regular)} bytes")
24print(f"Slotted class size: {sys.getsizeof(slotted)} bytes")
25
26# Significant savings with many instances
27regular_objects = [RegularClass(i, i+1, i+2) for i in range(10000)]
28slotted_objects = [SlottedClass(i, i+1, i+2) for i in range(10000)]
29
30print(f"\nMemory for 10000 regular objects: ~{sys.getsizeof(regular) * 10000} bytes")
31print(f"Memory for 10000 slotted objects: ~{sys.getsizeof(slotted) * 10000} bytes")
Pattern 14: Multiprocessing for CPU-Bound Tasks
python
1import multiprocessing as mp
2import time
3
4def cpu_intensive_task(n):
5 """CPU-intensive calculation."""
6 return sum(i**2 for i in range(n))
7
8def sequential_processing():
9 """Process tasks sequentially."""
10 start = time.time()
11 results = [cpu_intensive_task(1000000) for _ in range(4)]
12 elapsed = time.time() - start
13 return elapsed, results
14
15def parallel_processing():
16 """Process tasks in parallel."""
17 start = time.time()
18 with mp.Pool(processes=4) as pool:
19 results = pool.map(cpu_intensive_task, [1000000] * 4)
20 elapsed = time.time() - start
21 return elapsed, results
22
23if __name__ == "__main__":
24 seq_time, seq_results = sequential_processing()
25 par_time, par_results = parallel_processing()
26
27 print(f"Sequential: {seq_time:.2f}s")
28 print(f"Parallel: {par_time:.2f}s")
29 print(f"Speedup: {seq_time/par_time:.2f}x")
Pattern 15: Async I/O for I/O-Bound Tasks
python
1import asyncio
2import aiohttp
3import time
4import requests
5
6urls = [
7 "https://httpbin.org/delay/1",
8 "https://httpbin.org/delay/1",
9 "https://httpbin.org/delay/1",
10 "https://httpbin.org/delay/1",
11]
12
13def synchronous_requests():
14 """Synchronous HTTP requests."""
15 start = time.time()
16 results = []
17 for url in urls:
18 response = requests.get(url)
19 results.append(response.status_code)
20 elapsed = time.time() - start
21 return elapsed, results
22
23async def async_fetch(session, url):
24 """Async HTTP request."""
25 async with session.get(url) as response:
26 return response.status
27
28async def asynchronous_requests():
29 """Asynchronous HTTP requests."""
30 start = time.time()
31 async with aiohttp.ClientSession() as session:
32 tasks = [async_fetch(session, url) for url in urls]
33 results = await asyncio.gather(*tasks)
34 elapsed = time.time() - start
35 return elapsed, results
36
37# Async is much faster for I/O-bound work
38sync_time, sync_results = synchronous_requests()
39async_time, async_results = asyncio.run(asynchronous_requests())
40
41print(f"Synchronous: {sync_time:.2f}s")
42print(f"Asynchronous: {async_time:.2f}s")
43print(f"Speedup: {sync_time/async_time:.2f}x")
Database Optimization
Pattern 16: Batch Database Operations
python
1import sqlite3
2import time
3
4def create_db():
5 """Create test database."""
6 conn = sqlite3.connect(":memory:")
7 conn.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT)")
8 return conn
9
10def slow_inserts(conn, count):
11 """Insert records one at a time."""
12 start = time.time()
13 cursor = conn.cursor()
14 for i in range(count):
15 cursor.execute("INSERT INTO users (name) VALUES (?)", (f"User {i}",))
16 conn.commit() # Commit each insert
17 elapsed = time.time() - start
18 return elapsed
19
20def fast_inserts(conn, count):
21 """Batch insert with single commit."""
22 start = time.time()
23 cursor = conn.cursor()
24 data = [(f"User {i}",) for i in range(count)]
25 cursor.executemany("INSERT INTO users (name) VALUES (?)", data)
26 conn.commit() # Single commit
27 elapsed = time.time() - start
28 return elapsed
29
30# Benchmark
31conn1 = create_db()
32slow_time = slow_inserts(conn1, 1000)
33
34conn2 = create_db()
35fast_time = fast_inserts(conn2, 1000)
36
37print(f"Individual inserts: {slow_time:.4f}s")
38print(f"Batch insert: {fast_time:.4f}s")
39print(f"Speedup: {slow_time/fast_time:.2f}x")
Pattern 17: Query Optimization
python
1# Use indexes for frequently queried columns
2"""
3-- Slow: No index
4SELECT * FROM users WHERE email = 'user@example.com';
5
6-- Fast: With index
7CREATE INDEX idx_users_email ON users(email);
8SELECT * FROM users WHERE email = 'user@example.com';
9"""
10
11# Use query planning
12import sqlite3
13
14conn = sqlite3.connect("example.db")
15cursor = conn.cursor()
16
17# Analyze query performance
18cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?", ("test@example.com",))
19print(cursor.fetchall())
20
21# Use SELECT only needed columns
22# Slow: SELECT *
23# Fast: SELECT id, name
Memory Optimization
Pattern 18: Detecting Memory Leaks
python
1import tracemalloc
2import gc
3
4def memory_leak_example():
5 """Example that leaks memory."""
6 leaked_objects = []
7
8 for i in range(100000):
9 # Objects added but never removed
10 leaked_objects.append([i] * 100)
11
12 # In real code, this would be an unintended reference
13
14def track_memory_usage():
15 """Track memory allocations."""
16 tracemalloc.start()
17
18 # Take snapshot before
19 snapshot1 = tracemalloc.take_snapshot()
20
21 # Run code
22 memory_leak_example()
23
24 # Take snapshot after
25 snapshot2 = tracemalloc.take_snapshot()
26
27 # Compare
28 top_stats = snapshot2.compare_to(snapshot1, 'lineno')
29
30 print("Top 10 memory allocations:")
31 for stat in top_stats[:10]:
32 print(stat)
33
34 tracemalloc.stop()
35
36# Monitor memory
37track_memory_usage()
38
39# Force garbage collection
40gc.collect()
Pattern 19: Iterators vs Lists
python
1import sys
2
3def process_file_list(filename):
4 """Load entire file into memory."""
5 with open(filename) as f:
6 lines = f.readlines() # Loads all lines
7 return sum(1 for line in lines if line.strip())
8
9def process_file_iterator(filename):
10 """Process file line by line."""
11 with open(filename) as f:
12 return sum(1 for line in f if line.strip())
13
14# Iterator uses constant memory
15# List loads entire file into memory
Pattern 20: Weakref for Caches
python
1import weakref
2
3class CachedResource:
4 """Resource that can be garbage collected."""
5 def __init__(self, data):
6 self.data = data
7
8# Regular cache prevents garbage collection
9regular_cache = {}
10
11def get_resource_regular(key):
12 """Get resource from regular cache."""
13 if key not in regular_cache:
14 regular_cache[key] = CachedResource(f"Data for {key}")
15 return regular_cache[key]
16
17# Weak reference cache allows garbage collection
18weak_cache = weakref.WeakValueDictionary()
19
20def get_resource_weak(key):
21 """Get resource from weak cache."""
22 resource = weak_cache.get(key)
23 if resource is None:
24 resource = CachedResource(f"Data for {key}")
25 weak_cache[key] = resource
26 return resource
27
28# When no strong references exist, objects can be GC'd
Custom Benchmark Decorator
python
1import time
2from functools import wraps
3
4def benchmark(func):
5 """Decorator to benchmark function execution."""
6 @wraps(func)
7 def wrapper(*args, **kwargs):
8 start = time.perf_counter()
9 result = func(*args, **kwargs)
10 elapsed = time.perf_counter() - start
11 print(f"{func.__name__} took {elapsed:.6f} seconds")
12 return result
13 return wrapper
14
15@benchmark
16def slow_function():
17 """Function to benchmark."""
18 time.sleep(0.5)
19 return sum(range(1000000))
20
21result = slow_function()
python
1# Install: pip install pytest-benchmark
2
3def test_list_comprehension(benchmark):
4 """Benchmark list comprehension."""
5 result = benchmark(lambda: [i**2 for i in range(10000)])
6 assert len(result) == 10000
7
8def test_map_function(benchmark):
9 """Benchmark map function."""
10 result = benchmark(lambda: list(map(lambda x: x**2, range(10000))))
11 assert len(result) == 10000
12
13# Run with: pytest test_performance.py --benchmark-compare
Best Practices
- Profile before optimizing - Measure to find real bottlenecks
- Focus on hot paths - Optimize code that runs most frequently
- Use appropriate data structures - Dict for lookups, set for membership
- Avoid premature optimization - Clarity first, then optimize
- Use built-in functions - They're implemented in C
- Cache expensive computations - Use lru_cache
- Batch I/O operations - Reduce system calls
- Use generators for large datasets
- Consider NumPy for numerical operations
- Profile production code - Use py-spy for live systems
Common Pitfalls
- Optimizing without profiling
- Using global variables unnecessarily
- Not using appropriate data structures
- Creating unnecessary copies of data
- Not using connection pooling for databases
- Ignoring algorithmic complexity
- Over-optimizing rare code paths
- Not considering memory usage
Resources
- cProfile: Built-in CPU profiler
- memory_profiler: Memory usage profiling
- line_profiler: Line-by-line profiling
- py-spy: Sampling profiler for production
- NumPy: High-performance numerical computing
- Cython: Compile Python to C
- PyPy: Alternative Python interpreter with JIT