且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

根据参数或数据框行下方列中的差异对df进行细分或分组?

更新时间:2023-11-26 10:36:40

我是通过


  1. 计算相对于第一行的距离(以英里为单位)和时间


    • 我的逻辑


      • 如果n行位于第一行的 x分钟/英里之内,则那n行位于之内 x彼此的分钟/英里

  1. calculating the distances (in miles) and time relative to the first row
    • My logic
      • if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other

生成一些虚拟数据

  • random co-ordinates
# Generate random Lat-Long points
def newpoint():
   return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))

df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)

# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))

# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude
0   0 2012-10-01 00:00:00      -48.885981     -39.412351      -68.756244      24.739860
1   1 2012-10-01 00:05:00       58.584893      59.851739     -119.978429     -87.687858
2   2 2012-10-01 00:10:00      -18.623484      85.435248      -14.204142      -3.693993
3   3 2012-10-01 00:15:00      -29.721788      71.671103      -69.833253     -12.446204
4   4 2012-10-01 00:20:00      168.257968     -13.247833       60.979050     -18.393925

使用Haversine距离公式(向量化的Haversine距离公式,以公里为单位

Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
    """
    slightly modified version: of http://***.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

使用Haversine公式计算距离(相对于第一行),然后将km转换为英里

Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles

df['Home_dist_miles'] = \
    haversine(df.Home_Longitude, df.Home_Latitude,
                 df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
    haversine(df.Work_Longitude, df.Work_Latitude,
                 df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

计算时差,以分钟为单位(相对于第一行)


  • 这里的虚拟数据,时间差将是5分钟的倍数(但在实际数据中,它们可能是任意值)

df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

应用过滤器(方法1),然后选择满足OP中所述条件的任意两行

Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

应用过滤器(方法2),然后选择任何满足以下条件的行在OP中

multi_query = """Home_dist_miles<=12000 & \
                Work_dist_miles<=8000 & \
                time_min<=25"""
df_filtered = df.query(multi_query)

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0