且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在傅立叶域中对长信号实现Pytorch 1D互相关?

更新时间:2023-02-27 13:20:23

需要考虑的几件事.

Python解释器非常慢,那些矢量化库所做的就是将工作负载移至本机实现.为了产生任何变化,您需要能够在一条python指令中执行许多操作.在GPU上评估事物遵循相同的原理,尽管GPU具有更大的计算能力,但向/从GPU复制数据的速度较慢.

我修改了您的示例以同时处理多个信号.

 将numpy导入为np定义numpy_xcorr(BATCH = 1,signal_length = 36000):#发出信号signal_1 = np.random.uniform(-1,1,(BATCH,signal_length))signal_2 = np.random.uniform(-1,1,(BATCH,signal_length))#输出互相关目标长度x_cor_sig_length = signal_length * 2-1#获取用于ftf计算的优化数组长度fast_length = next_fast_len(x_cor_sig_length)#将数据移至频域.axis = -1执行#沿最后一个维度fft_1 = np.fft.rfft(signal_1,fast_length,axis = -1)fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1,fast_length,axis = -1)#取频谱之一的复共轭.fft_1 = np.conj(fft_1)fft_multiplied = fft_1 * fft_2#返回时域.prelim_correlation = np.fft.irfft(fft_multiplied,fast_length,axis = -1)#移位信号以使其看起来像适当的互相关,#并将输出转换为纯实数final_result = np.fft.fftshift(np.real(prelim_correlation),轴= -1)返回final_result,np.sum(final_result) 

自1.7以来,我们有了

让我惊讶的是当数字不是那么平滑时的结果,例如使用17平滑长度的割炬实现要好得多,以至于我在这里使用了对数刻度(批次大小为100时,割炬gpu比批次大小为1的numpy快10000倍).

请记住,这些功能通常是在GPU上生成数据,如果我们考虑将最终结果复制到CPU所花费的时间,我想将最终结果复制到CPU,我观察到的时间比互相关高10倍.计算(随机数据生成+三个FFT).

I have a series of signals length n = 36,000 which I need to perform crosscorrelation on. Currently, my cpu implementation in numpy is a little slow. I've heard Pytorch can greatly speed up tensor operations, and provides a way to perform computations in parallel on the GPU. I'd like to explore this option, but I'm not quite sure how to accomplish this using the framework.

Because of the length of these signals, I'd prefer to perform the crosscorrelation operation in the frequency domain.

Normally using numpy I'd perform the operation like so:

import numpy as np

signal_length=36000

# make the signals
signal_1 = np.random.uniform(-1,1, signal_length)
signal_2 = np.random.uniform(-1,1, signal_length)

# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1

# get optimized array length for fft computation
fast_length = np.fftpack.next_fast_len(x_cor_sig_length)

# move data into the frequency domain. axis=-1 to perform 
# along last dimension
fft_1 = np.fft.rfft(src_data, fast_length, axis=-1)
fft_2 = np.fft.rfft(src_data, fast_length, axis=-1)

# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_1 = np.conj(fft_1)


fft_multiplied = fft_1 * fft_2

# back to time domain. 
prelim_correlation = np.fft.irfft(result, x_corr_sig_length, axis=-1)

# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real

final_result = np.real(np.fft.fftshift(prelim_correlation),axes=-1)).astype(np.float64)

Looking at the Pytorch documentation, there doesn't seem to be an equivalent for numpy.conj(). I'm also not sure if/how I need to implement a fftshift after the irfft operation.

So how would you go about writing a 1D crosscorrelation in Pytorch using the fourier method?

A few things to be considered.

Python interpreter is very slow, what those vectorization libraries do is to move the workload to a native implementation. In order to make any difference you need to be able to give perform many operations in one python instruction. Evaluating things on GPU follows the same principle, while GPU has more compute power it is slower to copy data to/from GPU.

I adapted your example to process multiple signals simulataneously.

import numpy as np
def numpy_xcorr(BATCH=1, signal_length=36000):
    # make the signals
    signal_1 = np.random.uniform(-1,1, (BATCH, signal_length))
    signal_2 = np.random.uniform(-1,1, (BATCH, signal_length))

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length)

    # move data into the frequency domain. axis=-1 to perform 
    # along last dimension
    fft_1 = np.fft.rfft(signal_1, fast_length, axis=-1)
    fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1, fast_length, axis=-1)

    # take the complex conjugate of one of the spectrums. 
    fft_1 = np.conj(fft_1)


    fft_multiplied = fft_1 * fft_2

    # back to time domain. 
    prelim_correlation = np.fft.irfft(fft_multiplied, fast_length, axis=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = np.fft.fftshift(np.real(prelim_correlation), axes=-1)
    return final_result, np.sum(final_result)

Since torch 1.7 we have the torch.fft module that provides an interface similar to numpy.fft, the fftshift is missing but the same result can be obtained with torch.roll. Another point is that numpy uses by default 64-bit precision and torch will use 32-bit precision.

The fast length consists in choosing smooth numbers (those having that are factorized in to small prime numbers, and I suppose you are familiar with this subject).

def next_fast_len(n, factors=[2, 3, 5, 7]):
    '''
      Returns the minimum integer not smaller than n that can
      be written as a product (possibly with repettitions) of
      the given factors.
    '''
    best = float('inf')
    stack = [1]
    while len(stack):
        a = stack.pop()
        if a >= n:
            if a < best:
                best = a;
                if best == n:
                    break; # no reason to keep searching
        else:
            for p in factors:
                b = a * p;
                if b < best:
                    stack.append(b)
    return best;

Then the torch implementation goes

import torch;
import torch.fft
def torch_xcorr(BATCH=1, signal_length=36000, device='cpu', factors=[2,3,5], dtype=torch.float):
    signal_length=36000
    # torch.rand is random in the range (0, 1)
    signal_1 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
    signal_2 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)

    # just make the cross correlation more interesting
    signal_2 += 0.1 * signal_1;

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length, [2, 3])

    # the last signal_ndim axes (1,2 or 3) will be transformed
    fft_1 = torch.fft.rfft(signal_1, fast_length, dim=-1)
    fft_2 = torch.fft.rfft(signal_2, fast_length, dim=-1)

    # take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions

    fft_multiplied = torch.conj(fft_1) * fft_2

    # back to time domain. 
    prelim_correlation = torch.fft.irfft(fft_multiplied, dim=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = torch.roll(prelim_correlation, (fast_length//2,))
    return final_result, torch.sum(final_result);

And here a code to test the results

import time
funcs = {'numpy-f64': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float64), 
         'numpy-f32': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float32), 
         'torch-cpu-f64': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float64), 
         'torch-cpu': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float32), 
         'torch-gpu-f64': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float64),
         'torch-gpu': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float32),
         }
times ={}
for batch in [1, 10, 100]:
    times[batch] = {}
    for l, f in funcs.items():
        t0 = time.time()
        t1, t2 = f(batch)
        tf = time.time()
        del t1
        del t2
        times[batch][l] = 1000 * (tf - t0) / batch;

I obtained the following results

And what surprised myself is the result when the numbers are not so smooth e.g. using 17-smooth length the torch implementation is so much better that I used logarithmic scale here (with batch size 100 the torch gpu was 10000 times faster than numpy with batch size 1).

Remember that these functions are generating the data at the GPU in general we want to copy the final results to the CPU, if we consider the time spent copying the final result to CPU I observed times up to 10x higher than the cross correlation computation (random data generation + three FFTs).