programing tip

JPA : 큰 결과 집합을 반복하는 데 적합한 패턴은 무엇입니까?

itbloger 2020. 7. 29. 07:54
반응형

JPA : 큰 결과 집합을 반복하는 데 적합한 패턴은 무엇입니까?


수백만 행의 테이블이 있다고 가정 해 봅시다. JPA를 사용하면 해당 테이블에 대해 쿼리를 반복하는 올바른 방법은 무엇 입니까? 따라서 수백만 개의 객체가있는 메모리 내 목록이 모두 없습니까?

예를 들어, 테이블이 크면 다음과 같은 문제가 발생할 것으로 생각됩니다.

List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList();

for (Model model : models)
{
     System.out.println(model.getId());
}

페이지 매김 (루핑 및 수동 업데이트 setFirstResult()/ setMaxResult())이 실제로 가장 좋은 솔루션입니까?

편집 : 내가 타겟팅하는 주요 사용 사례는 일종의 배치 작업입니다. 실행하는 데 시간이 오래 걸리면 괜찮습니다. 관련된 웹 클라이언트가 없습니다. 한 번에 한 줄씩 (또는 작은 N) 각 행에 대해 "뭔가"를 수행하면됩니다. 나는 그것들을 동시에 메모리에 넣지 않도록 노력하고 있습니다.


Hibernate사용한 Java Persistence의 537 페이지 는을 사용하는 솔루션을 제공 ScrollableResults하지만 아쉽게도 Hibernate 전용입니다.

따라서 setFirstResult/ setMaxResults및 수동 반복 을 사용하는 것이 실제로 필요한 것 같습니다 . JPA를 사용하는 솔루션은 다음과 같습니다.

private List<Model> getAllModelsIterable(int offset, int max)
{
    return entityManager.createQuery("from Model m", Model.class).setFirstResult(offset).setMaxResults(max).getResultList();
}

그런 다음 다음과 같이 사용하십시오.

private void iterateAll()
{
    int offset = 0;

    List<Model> models;
    while ((models = Model.getAllModelsIterable(offset, 100)).size() > 0)
    {
        entityManager.getTransaction().begin();
        for (Model model : models)
        {
            log.info("do something with model: " + model.getId());
        }

        entityManager.flush();
        entityManager.clear();
        em.getTransaction().commit();
        offset += models.size();
    }
}

여기에 제시된 답변을 시도했지만 JBoss 5.1 + MySQL Connector / J 5.1.15 + Hibernate 3.3.2는 그와 함께 작동하지 않았습니다. 우리는 방금 JBoss 4.x에서 JBoss 5.1로 마이그레이션했습니다. 따라서 우리는 지금 그것을 고수했습니다. 따라서 우리가 사용할 수있는 최신 Hibernate는 3.3.2입니다.

몇 가지 추가 매개 변수를 추가하면 작업이 완료되었으며 이와 같은 코드는 OOME없이 실행됩니다.

        StatelessSession session = ((Session) entityManager.getDelegate()).getSessionFactory().openStatelessSession();

        Query query = session
                .createQuery("SELECT a FROM Address a WHERE .... ORDER BY a.id");
        query.setFetchSize(Integer.valueOf(1000));
        query.setReadOnly(true);
        query.setLockMode("a", LockMode.NONE);
        ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
        while (results.next()) {
            Address addr = (Address) results.get(0);
            // Do stuff
        }
        results.close();
        session.close();

중요한 줄은 createQuery와 scroll 사이의 쿼리 매개 변수입니다. 그것들이 없으면 "스크롤"호출은 모든 것을 메모리에로드하려고 시도하지 않으며 OutOfMemoryError로 끝나지 않거나 실행되지 않습니다.


JPA에서는 실제로이 작업을 수행 할 수 없지만 Hibernate는 상태 비 저장 세션 및 스크롤 가능한 결과 세트를 지원합니다.

우리 는 도움을 받아 수십억 행을 일상적으로 처리 합니다.

다음은 문서에 대한 링크입니다. http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html#batch-statelesssession


To be honest, I would suggest leaving JPA and stick with JDBC (but certainly using JdbcTemplate support class or such like). JPA (and other ORM providers/specifications) is not designed to operate on many objects within one transaction as they assume everything loaded should stay in first-level cache (hence the need for clear() in JPA).

Also I am recommending more low level solution because the overhead of ORM (reflection is only a tip of an iceberg) might be so significant, that iterating over plain ResultSet, even using some lightweight support like mentioned JdbcTemplate will be much faster.

JPA is simply not designed to perform operations on a large amount of entities. You might play with flush()/clear() to avoid OutOfMemoryError, but consider this once again. You gain very little paying the price of huge resource consumption.


If you use EclipseLink I' using this method to get result as Iterable

private static <T> Iterable<T> getResult(TypedQuery<T> query)
{
  //eclipseLink
  if(query instanceof JpaQuery) {
    JpaQuery<T> jQuery = (JpaQuery<T>) query;
    jQuery.setHint(QueryHints.RESULT_SET_TYPE, ResultSetType.ForwardOnly)
       .setHint(QueryHints.SCROLLABLE_CURSOR, true);

    final Cursor cursor = jQuery.getResultCursor();
    return new Iterable<T>()
    {     
      @SuppressWarnings("unchecked")
      @Override
      public Iterator<T> iterator()
      {
        return cursor;
      }
    }; 
   }
  return query.getResultList();  
}  

close Method

static void closeCursor(Iterable<?> list)
{
  if (list.iterator() instanceof Cursor)
    {
      ((Cursor) list.iterator()).close();
    }
}

It depends upon the kind of operation you have to do. Why are you looping over a million of row? Are you updating something in batch mode? Are you going to display all records to a client? Are you computing some statistics upon the retrieved entities?

If you are going to display a million records to the client, please reconsider your user interface. In this case, the appropriate solution is paginating your results and using setFirstResult() and setMaxResult().

If you have launched an update of a large amount of records, you'll better keep the update simple and use Query.executeUpdate(). Optionally, you can execute the update in asynchronous mode using a Message-Driven Bean o a Work Manager.

If you are computing some statistics upon the retrieved entities, you can take advantage on the grouping functions defined by the JPA specification.

For any other case, please be more specific :)


There is no "proper" what to do this, this isn't what JPA or JDO or any other ORM is intended to do, straight JDBC will be your best alternative, as you can configure it to bring back a small number of rows at a time and flush them as they are used, that is why server side cursors exist.

ORM tools are not designed for bulk processing, they are designed to let you manipulate objects and attempt to make the RDBMS that the data is stored in be as transparent as possible, most fail at the transparent part at least to some degree. At this scale, there is no way to process hundreds of thousands of rows ( Objects ), much less millions with any ORM and have it execute in any reasonable amount of time because of the object instantiation overhead, plain and simple.

Use the appropriate tool. Straight JDBC and Stored Procedures definitely have a place in 2011, especially at what they are better at doing versus these ORM frameworks.

Pulling a million of anything, even into a simple List<Integer> is not going to be very efficient regardless of how you do it. The correct way to do what you are asking is a simple SELECT id FROM table, set to SERVER SIDE ( vendor dependent ) and the cursor to FORWARD_ONLY READ-ONLY and iterate over that.

If you are really pulling millions of id's to process by calling some web server with each one, you are going to have to do some concurrent processing as well for this to run in any reasonable amount of time. Pulling with a JDBC cursor and placing a few of them at a time in a ConcurrentLinkedQueue and having a small pool of threads ( # CPU/Cores + 1 ) pull and process them is the only way to complete your task on a machine with any "normal" amount of RAM, given you are already running out of memory.

See this answer as well.


You can use another "trick". Load only collection of identifiers of the entities you're interested in. Say identifier is of type long=8bytes, then 10^6 a list of such identifiers makes around 8Mb. If it is a batch process (one instance at a time), then it's bearable. Then just iterate and do the job.

One another remark - you should anyway do this in chunks - especially if you modify records, otherwise rollback segment in database will grow.

When it comes to set firstResult/maxRows strategy - it will be VERY VERY slow for results far from the top.

Also take into consideration that the database is probably operating in read commited isolation, so to avoid phantom reads load identifiers and then load entities one by one (or 10 by 10 or whatever).


I was surprised to see that the use of stored procedures was not more prominent in the answers here. In the past when I've had to do something like this, I create a stored procedure that processes data in small chunks, then sleeps for a bit, then continues. The reason for the sleeping is to not overwhelm the database which is presumably also being used for more real time types of queries, such as being connected to a web site. If there is no one else using the database, then you can leave out the sleep. If you need to ensure that you process each record once and only once, then you will need to create an additional table (or field) to store which records you have processed in order to be resilient across restarts.

The performance savings here are significant, possibly orders of magnitude faster than anything you could do in JPA/Hibernate/AppServer land, and your database server will most likely have its own server side cursor type of mechanism for processing large result sets efficiently. The performance savings come from not having to ship the data from the database server to the application server, where you process the data, and then ship it back.

There are some significant downsides to using stored procedures which may completely rule this out for you, but if you've got that skill in your personal toolbox and can use it in this kind of situation, you can knock out these kinds of things fairly quickly.


To expand on @Tomasz Nurkiewicz's answer. You have access to the DataSource which in turn can provide you with a connection

@Resource(name = "myDataSource",
    lookup = "java:comp/DefaultDataSource")
private DataSource myDataSource;

In your code you have

try (Connection connection = myDataSource.getConnection()) {
    // raw jdbc operations
}

This will allow you to bypass JPA for some specific large batch operations like import/export, however you still have access to the entity manager for other JPA operations if you need it.


Use Pagination Concept for retrieving result


I have wondered this myself. It seems to matter:

  • how big your dataset is (rows)
  • what JPA implementation you are using
  • what kind of processing you are doing for each row.

I have written an Iterator to make it easy to swap out both approaches (findAll vs findEntries).

I recommend you try both.

Long count = entityManager().createQuery("select count(o) from Model o", Long.class).getSingleResult();
ChunkIterator<Model> it1 = new ChunkIterator<Model>(count, 2) {

    @Override
    public Iterator<Model> getChunk(long index, long chunkSize) {
        //Do your setFirst and setMax here and return an iterator.
    }

};

Iterator<Model> it2 = List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList().iterator();


public static abstract class ChunkIterator<T> 
    extends AbstractIterator<T> implements Iterable<T>{
    private Iterator<T> chunk;
    private Long count;
    private long index = 0;
    private long chunkSize = 100;

    public ChunkIterator(Long count, long chunkSize) {
        super();
        this.count = count;
        this.chunkSize = chunkSize;
    }

    public abstract Iterator<T> getChunk(long index, long chunkSize);

    @Override
    public Iterator<T> iterator() {
        return this;
    }

    @Override
    protected T computeNext() {
        if (count == 0) return endOfData();
        if (chunk != null && chunk.hasNext() == false && index >= count) 
            return endOfData();
        if (chunk == null || chunk.hasNext() == false) {
            chunk = getChunk(index, chunkSize);
            index += chunkSize;
        }
        if (chunk == null || chunk.hasNext() == false) 
            return endOfData();
        return chunk.next();
    }

}

I ended up not using my chunk iterator (so it might not be that tested). By the way you will need google collections if you want to use it.


With hibernate there are 4 different ways to achieve what you want. Each has design tradeoffs, limitations, and consequences. I suggest exploring each and deciding which is right for your situation.

  1. Use stateless session with scroll()
  2. Use session.clear() after every iteration. When other entities need to be attached, then load them in a separate session. effectively the first session is emulating the stateless session, but retaining all the features of a stateful session, until the objects are detached.
  3. Use iterate() or list() but get only ids in the first query, then in a separate session in each iteration, do session.load and close the session at the end of the iteration.
  4. Use Query.iterate() with EntityManager.detach() aka Session.evict();

참고URL : https://stackoverflow.com/questions/5067619/jpa-what-is-the-proper-pattern-for-iterating-over-large-result-sets

반응형