且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何将数据从Amazon SQS流传输到Amazon S3中的文件

更新时间:2022-04-18 21:25:44

您可以编写一个 AWS Lambda函数,该函数可通过将消息发送到Amazon SQS队列来触发.您有责任编写该代码,因此答案是,这取决于您的编码技能.

You can write an AWS Lambda function that gets triggered by a message being sent to an Amazon SQS queue. You are responsible for writing that code, so the answer is that it depends on your coding skill.

但是,如果每条消息都是单独处理的,您最终会得到每个SQS消息一个Amazon S3对象,这对于处理效率非常低.该文件为Avro格式的事实无关紧要,因为每个文件都非常小.在处理文件时,这会增加很多开销.

However, if each message is processed individually, you will end up with one Amazon S3 object per SQS message, which is quite inefficient to process. The fact that the file is in Avro format is irrelevant because each file will be quite small. This will add a lot of overhead when processing the files.

另一种选择是将消息发送到 Amazon Kinesis数据流,后者可以按大小(例如,每5MB)或时间(例如,每5分钟)将消息聚合在一起.这样可以减少S3中较大的对象,但它们不会被分区,也不会采用Avro格式.

An alternative could be to send the messages to an Amazon Kinesis Data Stream, which can aggregate messages together by size (eg every 5MB) or time (eg every 5 minutes). This will result in fewer, larger objects in S3 but they will not be partitioned, nor in Avro format.

要从Avro之类的列格式中获得***性能,请将数据合并到更大的文件中,以便更有效地处理.因此,例如,您可以使用Kinesis收集数据,然后使用Amazon EMR的日常工作将这些文件合并为分区的Avro文件.

To get the best performance out of a columnar format like Avro, combine the data into larger files that will be more efficient for processing. So, for example, you could use Kinesis for collecting the data, then a daily Amazon EMR job to combine those files into partitioned Avro files.

因此,答案是:这很容易,但是您可能不想这样做."

So, the answer is: "It's pretty easy, but you probably don't want to do it."

您的问题并未定义数据如何进入SQS.如果您希望在一段时间内(例如1小时或1天)愿意在SQS中积累数据,而不是立即处理它们,而不是立即处理这些消息,则可以编写一个程序读取所有消息,并将它们输出到分区的Avro文件中.这将SQS用作临时保存区域,从而允许在处理数据之前进行数据累积.但是,它将失去任何实时报告方面.

Your question does not define how the data gets into SQS. If, rather than processing messages as soon as they arrive, you are willing for the data to accumulate in SQS for some period of time (eg 1 hour or 1 day), you could then write a program that reads all of the messages and outputs them into partitioned Avro files. This uses SQS as a temporary holding area, allowing data to accumulate before being processed. However, it would lose any real-time reporting aspect.