且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将SQL Server表导出到多个部件文件

更新时间:2023-01-23 09:36:53

不幸的是,BCP的batch_size参数不能控制输出。



我做了这种类型的分割:



1 - 简单但不可重复:创建命令文件.cmd)在表的特定行范围上运行多个 BCP 。这可能需要在表上有 IDENTITY(1,1)的主键。

  bcpSELECT * FROM MyTable WHERE id BETWEEN 0 AND 10000000queryout ... 
bcpSELECT * FROM MyTable WHERE id BETWEEN 10000000 AND 20000000queryout ...
/ pre>

2 - 简单和可重复,使用了大量磁盘: BCP 将整个表导出到单个文件,并使用 split 创建任意数量的新文件,每个文件具有给定的字节数(注意:按行分割会更好想法IMO)。使用'Cygwin'(GnuWin32不再维护)安装 split 和任何其他所需的实用程序。

  bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w 
split -b 10737418240 C:\MyFile.csv C:\MySplitFile_

生成以下文件

  C:\MySplitFile_aaa 
C:\MySplitFile_aab
...

3 - 复杂但可重复,需要可能不安全的T-SQL:使用 xp_cmdshell 函数在遍历表的存储过程中调用BCP。

  DECLARE @loop AS INT; 
- 使用WHILE根据需要循环 -
DECLARE @sql AS VARCHAR(MAX);
- 添加生成动态SQL的代码 -
DECLARE @bcp AS VARCHAR(MAX);
SELECT @ bcp ='BCP'+ @ sql +'queryout C:\MyFolder\MyFile _'+ @ loop +'。csv';

FINAL注意:如果您在数据中使用任何NVARCHAR字段您需要使用 -w 标志,并注意输出将为UTF-16LE。我会强烈建议使用 iconv (再次从Cygwin)将其转换为UTF-8,然后再尝试在Hadoop中执行任何操作。 / p>

I need to export a fairly large SQL Server table ~100GB to a CSV file. But rather than the output be a single csv file, it should ideally be multiple files say 10 files each 10GB.

I see BCP has a batch_size argument but this still writes all data to a single file? Are there other free utilities for doing what I require? Either where the size of file can be specified in bytes or number of rows?

For bit of context this is so the data can be combined with other sources in a Hive/Hadoop platform, so if there are better ways of exporting the data I'm open for suggestions.

BCP's batch_size argument does not control the output, unfortunately.

Ways I've done this type of splitting:

1 - Simple but non-repeatable: Create a command file (.cmd) that runs a multiple BCPs over the table for specific row ranges. This probably requires an IDENTITY(1,1) based primary key on the table.

bcp "SELECT * FROM MyTable WHERE Id BETWEEN 0 AND 10000000" queryout …  
bcp "SELECT * FROM MyTable WHERE Id BETWEEN 10000000 AND 20000000" queryout …   

2 - Simple and repeatable, uses a lot of disk: BCP out the entire table to a single file and use split to create as many new files as needed with a given number of bytes in each (note: splitting by lines would be a better idea IMO). Use 'Cygwin' (GnuWin32 no longer maintained) to install split and any other utilities you want.

 bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w  
 split -b 10737418240 C:\MyFile.csv C:\MySplitFile_  

Generates the following files

 C:\MySplitFile_aaa
 C:\MySplitFile_aab
 …

3 - Complex but repeatable, requires possibly insecure T-SQL: Use the xp_cmdshell function to call BCP inside a stored procedure that iterates through the table.

 DECLARE @loop AS INT;   
 --Use WHILE to loop as needed--   
 DECLARE @sql AS VARCHAR(MAX);   
 --Add code to generate dynamic SQL here--   
 DECLARE @bcp AS VARCHAR(MAX);   
 SELECT @bcp='BCP "'+@sql+'" queryout C:\MyFolder\MyFile_'+@loop+'.csv';   

FINAL NOTE: If you are using any NVARCHAR fields in your data then you need to use the -w flag and be aware that the output will be in UTF-16LE. I would strongly recommend converting that to UTF-8 using iconv (from 'Cygwin' again) before trying to do anything with it in Hadoop.