更新时间:2023-11-03 08:54:22
我可能会猜测无效行的位置未知.在这种情况下,读取原始文件并创建仅包含有效数据的新文件可能是明智的.而且,如果源数据可以从操纵中受益,则可以在将其读入R之前完成.
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
一个大到3,5 GiB的文件在较大的方面有点要读入这样的内存中.当然,它可以在64位系统时代完成,但是对于简单的行处理而言,这并不方便.可伸缩的解决方案使用.Net方法和逐行方法.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
要逐行处理文件,请使用.Net方法进行有效的行读取.创建StringBuilder来存储包含有效数据的行,其他行则被丢弃. StringBuilder经常在磁盘上刷新.即使是几天的SSD,相对于一次写入大量(例如10000)行,每行的写入操作也相对较慢.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}