且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用doParallel计算R中邮政编码之间的距离?

更新时间:2023-02-01 23:04:30

R 是一种向量化语言,因此该函数将对向量中的所有元素进行操作.由于您正在计算每一行的原始和目的地之间的距离,因此不需要循环.矢量化方法大约是循环性能的 1000 倍.
此外,直接使用 distVincentyEllipsoid(或 distHaveersine 等)并绕过 distm 函数也应该可以提高性能.

R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.

在没有任何示例数据的情况下,此代码段未经测试.

Without any sample data this snippet is untested.

library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat), 
       c(zipdata$destlong, zipdata$destlat))

注意:为了使大多数地圈功能正常工作,正确的顺序是:先经度,然后是纬度.

上面列出的 tidyverse 方法缓慢的原因是 distm 函数正在计算每个起点和终点之间的距离,这将产生一个 200 万乘 200 万的元素矩阵.

The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.