且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Hive选择数据时是否保留文件顺序

更新时间:2023-11-18 20:54:52

如果没有 ORDER BY ,则不能保证顺序.

Without ORDER BY the order is not guaranteed.

许多进程(映射器)正在并行读取数据,计算拆分之后,每个进程都将根据计算的拆分开始读取一个文件或几个文件.

Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated.

所有并行进程可以处理不同数量的数据并在不同的节点上运行,每次负载都不相同,因此它们取决于节点负载,网络等因素而开始返回行并在不同的时间完成负载,每个进程的数据量等.

All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing at different times, depending on too many factors, such as node load, network load, volume of data per process, etc, etc.

消除所有这些因素,可以提高订单预测的准确性.说,单线程顺序文件读取可能以与文件中相同的顺序返回行.但这不是数据库的工作方式.

Removing all this factors you can increase the order prediction accuracy. Say, single thread sequential file read may return rows in the same order as they are in the file. But this is not how the database works.

根据科德的关系理论,列和行的顺序无关紧要.

Also according to Codd's relational theory, the order of columns and rows is immaterial.