更新时间:2023-02-02 21:13:39
p>使用这样的示例文件:
tweet第一个
tweet第二个
tweet第三个
此代码:
file = open('tweets.txt')
文件中的行:
打印行
产生此输出:
tweet第一个
tweet第二个
推文第三个
Python正在阅读的最后一行,但是你的脚本通过正则表达式替换它们。
此正则表达式替换:
tweet = re.sub [\ s] +','',tweet)
正在转换所有的空格字符(例如制表符和新行)转换为单个空格。
在输出之前,在推文上添加尾标,或者修改正则表达式, / p>
tweet = re.sub('[] +','',tweet)
编辑:我把测试替换命令放在那里。该建议已修复。
I have the following code that gets in Twitter tweets and should process the data and after that save into a new file.
This is the code:
#import regex
import re
#start process_tweet
def processTweet(tweet):
# process the tweets
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert @username to AT_USER
tweet = re.sub('@[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#trim
tweet = tweet.strip('\'"')
return tweet
#end
#Read the tweets one by one and process it
input = open('withoutEmptylines.csv', 'rb')
output = open('editedTweets.csv','wb')
line = input.readline()
while line:
processedTweet = processTweet(line)
print (processedTweet)
output.write(processedTweet)
line = input.readline()
input.close()
output.close()
My data in the input file looks like this, so each tweet in one line:
She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
BMW Sees U.S. As Top Market For 2015 i8 http://t.co/kkFyiBDcaP
my function is working good, but I am not happy with the output which looks like this:
she wants to ride my bmw the go for a ride in my bmw lol URL rt AT_USER Ðun bmw es mucho? yo: bmw. -AT_USER veeergaaa!. hahahahahahahahaha nos hiciste la noche caray!
so it puts everything in one row and not each tweet in one row as was the format in the input file.
Has someone an idea to get each tweet in one line?
With a example file like this:
tweet number one
tweet number two
tweet number three
This code:
file = open('tweets.txt')
for line in file:
print line
Produces this output:
tweet number one
tweet number two
tweet number three
Python is reading in the endlines just fine, but your script is replacing them via regular expression substitution.
this regex substitution:
tweet = re.sub('[\s]+', ' ', tweet)
Is converting all of your white space characters (e.g tabs and new lines) into single spaces.
Either add a endline onto the tweet before you output it, or modify your regex to not substitute endlines like so:
tweet = re.sub('[ ]+', ' ', tweet)
EDIT: I put my test substitution command in there. the suggestion has been fixed.