且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python将文件保存到csv

更新时间:2023-02-02 21:13:39

p>使用这样的示例文件:

  tweet第一个
tweet第二个
tweet第三个

此代码:

  file = open('tweets.txt')
文件中的行:
打印行

产生此输出:

  tweet第一个

tweet第二个

推文第三个

Python正在阅读的最后一行,但是你的脚本通过正则表达式替换它们。



此正则表达式替换:

  tweet = re.sub [\ s] +','',tweet)

正在转换所有的空格字符(例如制表符和新行)转换为单个空格。



在输出之前,在推文上添加尾标,或者修改正则表达式, / p>

  tweet = re.sub('[] +','',tweet)

编辑:我把测试替换命令放在那里。该建议已修复。


I have the following code that gets in Twitter tweets and should process the data and after that save into a new file.

This is the code:

#import regex
import re

#start process_tweet
def processTweet(tweet):
    # process the tweets

    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet
#end

#Read the tweets one by one and process it
input = open('withoutEmptylines.csv', 'rb')
output = open('editedTweets.csv','wb')

line = input.readline()

while line:
    processedTweet = processTweet(line)
    print (processedTweet)
    output.write(processedTweet)
    line = input.readline()

input.close()
output.close()

My data in the input file looks like this, so each tweet in one line:

She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
BMW Sees U.S. As Top Market For 2015 i8 http://t.co/kkFyiBDcaP

my function is working good, but I am not happy with the output which looks like this:

she wants to ride my bmw the go for a ride in my bmw lol URL rt AT_USER Ðun bmw es mucho? yo: bmw. -AT_USER veeergaaa!. hahahahahahahahaha nos hiciste la noche caray! 

so it puts everything in one row and not each tweet in one row as was the format in the input file.

Has someone an idea to get each tweet in one line?

With a example file like this:

tweet number one
tweet number two
tweet number three

This code:

file = open('tweets.txt')
for line in file:
   print line

Produces this output:

tweet number one

tweet number two

tweet number three

Python is reading in the endlines just fine, but your script is replacing them via regular expression substitution.

this regex substitution:

tweet = re.sub('[\s]+', ' ', tweet)

Is converting all of your white space characters (e.g tabs and new lines) into single spaces.

Either add a endline onto the tweet before you output it, or modify your regex to not substitute endlines like so:

tweet = re.sub('[ ]+', ' ', tweet)

EDIT: I put my test substitution command in there. the suggestion has been fixed.