且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

基于多个条件通过python或R脚本删除或删除意外的记录和字符串

更新时间:2023-12-05 11:23:40

这可以通过以下Python脚本实现:

  import csv 
import re
import string

output_header = ['a_id','b_id ','CC','DD','EE','FF','GG']

sanitise_table = string.maketrans(,)
nodigits_table = sanitise_table。翻译(sanitise_table,string.digits)

def sanitise_cell(cell):
return cell.translate(sanitise_table,nodigits_table)#保持数字

with open fileOne.csv')as f_input,open('resultFile.csv','wb')as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)

input_header = next(f_input)
csv_output.writerow(output_header)

csv_input中的行:
bb = re.match(r'(\d + )_(\ d +)\.csv',row [1])$ ​​b
$ b如果bb和row [2]不在['No Bi','less']:
#删除'Mi'后的所有列
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] * (len(row) - mi)
,除了ValueError:
pass

row [:] = [san in row_col ] = bb.group(1)
row [1] = bb.group(2)
csv_output.writerow(row)

要从现有文件中简单删除列 c>,可以使用以下命令:

  import csv 

with open('input.csv')as f_input,open('output.csv','wb')as f_output :
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)

csv_input中的行:
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] *(len(row) - mi)
ValueError:
pass

csv_output.writerow(row)

使用Python 2.7.9测试


I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.

What I want to do is as follows:

  1. Delete the first column.

  2. Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.

  3. Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.

  4. Assign new name to each of the columns.

  5. Delete the records that contain strings like less in the CC column.

  6. Trim all other unnecessary string from the records.

  7. Delete the reamining filds of each rows after I find the "Mi" in each rows.

My fileOne.csv is as follows:

   AA      BB       CC       DD     EE      FF    GG
   1       1_1.csv  (=0      =10"   27"     =57   "Mi"
   0.97    0.9      0.8      NaN    0.9     od    0.2
   2       1_3.csv  (=0      =10"   27"     "Mi"  0.5
   0.97    0.5      0.8      NaN    0.9     od    0.4
   3       1_6.csv  (=0      =10"   "Mi"     =53  cnt
   0.97    0.9      0.8      NaN    0.9     od    0.6
   4       2_6.csv  No Bi    000    000     000   000
   5       2_8.csv  No Bi    000    000     000   000
   6       6_9.csv  less     000    000     000   000
   7       7_9.csv  s(=0     =26"   =46"    "Mi"  121     

My 1st expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57    Mi              
1       3       0     10    27    Mi    0.5
1       6       0     10    Mi    53    cnt 
7       9       0     26    46    Mi    121  

My final expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57              
1       3       0     10    27
1       6       0     10 
7       9       0     26    46  

This can be achieved with the following Python script:

import csv
import re
import string

output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']

sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)

def sanitise_cell(cell):
    return cell.translate(sanitise_table, nodigits_table)       # Keep digits

with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    input_header = next(f_input)
    csv_output.writerow(output_header)

    for row in csv_input:
        bb = re.match(r'(\d+)_(\d+)\.csv', row[1])

        if bb and row[2] not in ['No Bi', 'less']:
            # Remove all columns after 'Mi' if present
            try:
                mi = row.index('Mi')
                row[:] = row[:mi] + [''] * (len(row) - mi)
            except ValueError:
                pass

            row[:] = [sanitise_cell(col) for col in row]
            row[0] = bb.group(1)
            row[1] = bb.group(2)
            csv_output.writerow(row)

To simply remove Mi columns from an existing file the following can be used:

import csv

with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    for row in csv_input:
        try:
            mi = row.index('Mi')
            row[:] = row[:mi] + [''] * (len(row) - mi)
        except ValueError:
            pass

        csv_output.writerow(row)

Tested using Python 2.7.9