又踩坑了。ThreadPoolExecutor？

问题是出现在 24 号的时候，当时有台 weblogic 实例出现阻塞，运维 dump 线程栈后重启了，有个同事进行分析。

该同事分析线程栈后认为问题出在一个被外部系统调用的接口，这个接口收到请求后会从数据库查询数据，然后把数据处理后发提交到线程池，再由线程池异步发送到 MQ 服务器，调用方监听 MQ 进行数据处理，接口代码大致如下：

@RequestMapping("publish")
public void publish() {
    String msg = RandomStringUtils.randomAlphabetic(32);
    log.info("before submit task :" + msg);
    ThreadPoolUtil.getPoolExecutor().execute(() -> {
        log.info("publish msq to mq done:" + msg);
    });
    log.info("submit task done:" + msg);
}

public class ThreadPoolUtil {
    private static ThreadPoolExecutor pool = new ThreadPoolExecutor(10, 30,
            20, TimeUnit.SECONDS, new ArrayBlockingQueue<>(5),
            new CustomThreadFactory(), new CustomRejectHandler());

    private static AtomicInteger counter = new AtomicInteger(0);

    public static ThreadPoolExecutor getPoolExecutor() {
        return pool;
    }

    public static void shutdown() {
        pool.shutdown();
    }

    static class CustomThreadFactory implements ThreadFactory {
        public Thread newThread(Runnable r) {
            Thread thread = new Thread(r);
            thread.setName(ThreadPoolUtil.class.getName() + counter.incrementAndGet());
            return thread;
        }
    }

    static class CustomRejectHandler implements RejectedExecutionHandler {
        public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
            try {
                pool.getQueue().put(r);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
}

那同事问了发起调用的系统的人，对方每次取 300 个任务来发起调用，他觉得是对方调用的并发太大导致的。

虽然隔得有点远，我听到后觉得有疑惑，调用方每次取300个任务更可能是串行调用、而不是300个并发。

我找他要了线程栈文件过来，发现大多数工作线程都是空闲，阻塞的线程的线程栈与下面的相似：

"http-nio-8001-exec-10" #36 daemon prio=5 os_prio=0 tid=0x0000000026b8c800 nid=0x2f64 waiting on condition [0x000000002778c000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000837cd8b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at net.coderbee.cloud.controller.ThreadPoolUtil$CustomRejectHandler.rejectedExecution(ThreadPoolUtil.java:34)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
        at net.coderbee.cloud.controller.PublishController.publish(PublishController.java:22)

看了又看之后，结合线程栈和源码，我觉得这个接口只是一个受害者，根本的问题不是这个接口，而在于线程池，线程池为什么不再继续处理提交的任务呢。

我一开始是怀疑是不是踩了 ThreadPoolExecutor 的 bug（滑稽脸，太膨胀了，竟然怀疑 JUC 的 bug），搜了下并没有什么启发。

因为线程是被阻塞了， weblogic 里线程的状态是 stuck，再拿 “ThreadPoolExecutor stuck” 去搜，在 stackoverflow 看到这个 Java ThreadPoolExecutor getting stuck while using ArrayBlockingQueue 。

回答里提到的调用 shutdown 方法敲醒了我，这个线程池果然在一个定时任务调用的地方被调用：ThreadPoolUtil.getPoolExecutor()，然后调用了 shutdown 方法（就在上一个版本引入的代码）。

线程池被 shuadown 后，提交上来的任务被执行 reject ，由于 RejectHandler 是定制的，把提交任务的线程阻塞住了。

到此问题算是搞清楚了，有如下反思：
1、只要方法、对象暴露出来了，就可能会被调用，像这种全局的线程池就不应暴露出来。
2、定位问题，最好是有证据的，不要只是“应该就是这个”。
3、碰到没什么思路的问题，多检查下最近版本的代码，问题很多都是在最近版本引进的。

coderbee笔记

日拱一卒，不期速成，厚积薄发。 wen866595@gmail.com

又踩坑了。ThreadPoolExecutor？

发表回复取消回复

发表回复 取消回复

发表回复取消回复