且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Java EE中处理文件

更新时间:2023-12-03 13:19:34

我在这里简单介绍一些命题,并考虑以下几点:


  • 可伸缩性(文件大小,集群等)

  • 批处理架构(作业恢复,错误处理,监控等)

  • 符合J2EE



使用JCA

JCA连接器属于Java EE堆栈,允许从/ /到EJB世界。 JDBC和JMS通常以JCA连接器的形式实现。入站JCA连接器可以使用线程(通过工作者抽象)和事务。然后它可以将任何处理转发到消息驱动的bean(MDB)。


  • 编写一个JCA连接器,用于轮询新文件,然后处理并以同步的方式将进一步处理委托给消息驱动bean。
    然后MDB可以使用JPA将信息存入数据库中

  • JCA连接器可以控制事务,并且多个MDB调用可以在同一个事务中
  • 文件系统不是事务性的,所以你需要弄清楚如何处理错误,比如错误的输入文件

  • 您可以在pipleline中使用流式传输(InputStream)


使用纯线程



通过使用从Web Servlet上下文侦听器(或evt。)启动的线程,我们可以实现与JCA方式大致相同的方式。一个EJB定时器)。




  • 线程轮询新文件,如果找到文件,它会处理它,并以同步方式将进一步处理委托给常规SLSB。 li>
  • Web容器中的线程可以访问UserTransaction并且可以控制事务

  • EJB可以是本地的,因此InputStream是通过引用传递的。 $ b
  • Web模块+ ejb的部署可以通过耳朵完成



使用JMS 为避免需要多个并发轮询线程和作业获取/锁定问题,可以使用JMS异步实现实际处理。 JMS也可以将处理分成更小的任务。


  • 周期性任务轮询新文件。如果找到了文件,则会将JMS消息排队。

  • 传递JMS消息时,将读取并处理该文件,并使用JPA将信息保存到数据库中

  • 如果JMS处理失败,应用程序。服务器可能会自动重试或将消息放入死信息队列中。
  • 监视/错误处理更为复杂。您可以使用streaming
  • >


使用ESB



去年处理集成:JBI,ServiceMix,OpenESB,Mule,Spring集成,Java CAPS,BPEL。有些是技术,有些是平台,它们之间有一些重叠。他们都有连接器的旅行车路线,转换和编排消息流。恕我直言,这条消息被认为是一小块信息,并且可能很难使用这些技术来处理您的大数据文件。 企业应用程序集成模式是一个很好的网站,可以获取更多信息。



,IMO最适合Java EE理念的方法是JCA。但是投资的努力相对较高。在你的情况下,使用普通线程代表进一步处理SLSB可能是最简单的解决方案。如果处理流程变得更加复杂,JMS方法(接近P. Thivent的命题)可能会很有趣。使用ESB似乎矫枉过正。

I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).

The files are in the range of 1 document to 100 000 in each file. The files come in various types

  • Compressed
    • Zip
    • Tar + gzip
    • Gzip
  • Plain-text
  • XML
  • PDF

Now the biggest concern is that the specification forbids accessing local files. At least in the way that i'm used to.

I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB and accessing the files from the database would require that you download the whole file, either into memory or onto disk.

My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.

My questions are basically

  1. Is there a standard way or a recommended way of dealing with this in Java EE?
  2. Is there an application server specific way around this?
  3. Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?

I sketch here a few more propositions and consider the following concerns:

  • scalability (file size, clustering, etc.)
  • batch architecture (job recovery, error handling, monitoring, etc.)
  • compliance with J2EE

With JCA

JCA connectors belong to the Java EE stack and permit inboud/outboud connectivity from/to the EJB world. JDBC and JMS are usually implemented as JCA connector. An inbound JCA connector can use thread (through the worker abstraction) and transactions. It can then forward any processing to a message-driven bean (MDB).

  • write a JCA connector that polls for new file, then process them and delegate further processing to message-driven bean in a synchronous way.
  • the MDB can then persit the information in database with JPA
  • the JCA connector has control over the transaction, and several MDB invocations can be in the same transaction
  • file system is not transactional so you will somehow need to figure out how to deal with error such as faulty input files
  • you can probably use streaming (InputStream) all along the pipleline

With plain threads

We can achieve more or less the same as the JCA way, using threads that are launched from a web servlet context listener (or evt. an EJB Timer).

  • The thread polls for new file, if file is found it processes it and delegates further processing to regular SLSB in a synchronous way.
  • Thread in web container have access to UserTransaction and can control the transaction
  • EJB can be local so that InputStream is passed by reference
  • Deployment of the web module + ejb can be done with an ear

With JMS

To avoid the need of having several concurrent polling threads and the problem of job acquision/locking, the actual processing can be realized asynchronously using JMS. JMS can also be interesting to split the processing in smaller tasks.

  • A periodic task polls for new file. If file is found, a JMS message is queued.
  • When the JMS message is delivered, the file is read and processed and the information is persisted in database with JPA
  • if JMS processing fails, the app. server may retries automatically or put the message in the dead message queue
  • monitoring/error handling is more complicated
  • you can probably use streaming

With ESB

Many projects have emerged in the past year to deal with integration: JBI, ServiceMix, OpenESB, Mule, Spring integration, Java CAPS, BPEL. Some are technologies, some are platform, and there is some overlap between them. They all have a wagon of connectors to route, transform and orchestrate message flow. IMHO, the message are suppose to be small piece of information, and it may be hard to use these technologies to process your large data file. The website patterns of enterprise application integration is an excellent website for more information.

IMO, the approach that fits best the Java EE philosophy is JCA. But the effort to invest is relatively high. In your case, the usage of plain thread that delegate further processing to SLSB is maybe the easiest solution. The JMS approach (close to the proposition of P. Thivent) can be interesting if the processing pipelie gets more complicated. Using an ESB seems overkill to me.