Bugzilla@Mozilla – Bug 165586
pthreads: PR_Interrupt may not interrupt PR_WaitCondVar (the join test hangs)
Last modified: 2007-05-08 17:40:45 PDT
This bug affects the pthreads version of NSPR, which is used on most Unix platforms. There is a race condition when we use PR_Interrupt to interrupt PR_WaitCondVar. Suppose thread A is calling PR_WaitCondVar and thread B is interrupting thread A. The following event sequence is problematic. Thread A Thread B ========================== ======================== Test its interrupt flag Set thred->waiting to cvar Set thread A's interrupt flag Call pthread_cond_broadcast on thread A's 'waiting' cvar Call pthread_cond_wait ===================================================== Thread A misses the broadcast and blocks in pthread_cond_wait forever. This can be reproduced with the 'join' test program, at least on Red Hat Linux 6.2.
Created attachment 97240 [details] For testing: a patch for NSPR that makes it easy to reproduce the bug with the join test Apply the patch to mozilla/nsprpub. It inserts a 2 second delay to PR_WaitCondVar after it sets thred->waiting to cvar and inserts a 1 second delay to the very beginning of PR_Interrupt. With the patched NSPR library, run the 'join' test. The events will happen at the following time instants: Thread A Thread B =========================== ======================== T0: Test its interrupt flag T0: Sleep 1 second T0: Set thred->waiting to cvar T0: Sleep 2 seconds T1: Set thread A's interrupt flag T1: Call pthread_cond_broadcast on thread A's 'waiting' cvar T2: Call pthread_cond_wait
This bug can also be reproduced rather easily with the 'join' test program on Mac OS X.
I am seeing a problem which may be related to this bug, though I don't know enough about threads to be sure. I am getting frequents hangs of Mozilla where it will not respond at all if I open up several pages in new tabs in the background in quick succession. I first noticed this problem with Mozilla 1.4 beta, but I have now reproduced it with 1.4rc and 1.0.0 under Linux as well. Here's a copy of the the backtrace I get after I get this hang from 1.4rc: #0 0x403b6ae2 in sigsuspend () from /lib/libc.so.6 #1 0x400d6f35 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x400d3f05 in pthread_cond_wait () from /lib/libpthread.so.0 #3 0x400af15e in PR_WaitCondVar () from /usr/local/mozilla/libnspr4.so #4 0x4057c8ee in nsThreadPool::GetRequest () from /usr/local/mozilla/libxpcom.so #5 0x4057d060 in nsThreadPoolRunnable::Run () from /usr/local/mozilla/libxpcom.so #6 0x4057b9bd in nsThread::Main () from /usr/local/mozilla/libxpcom.so #7 0x400b42b9 in PR_Select () from /usr/local/mozilla/libnspr4.so #8 0x400d4d53 in pthread_start_thread () from /lib/libpthread.so.0 #9 0x400d4d99 in pthread_start_thread_event () from /lib/libpthread.so.0 The backtrace is essentially the same for 1.0.0. I am not sure what I have changed about my system that has caused this problem to only startup recently - within the past month. My guess are libc6 or the kernel. I am now running Linux kernel 2.5.69 and libc6 2.3.1-16 from Debian testing. Since the backtrace involved pthreads and the interrupt and wait conditions mentioned in this bug, it seems that my problem might be related. If you don't think so, let me know and I'll open up a new bug.
You should open a new bug. PR_Interrupt is only used when shutting down a nsThreadPool. I am not sure if PR_Interrupt is involved in the hang you are seeing.