Perhaps I jumped to the wrong conclusion

Yesterday I wrote an entry `wherein I blamed power draw`_ for causing the hard hangs I’ve experienced since I LU’d to snv_60. After a bit more downtime overnight and serious worrying about the safety of my data (photos…. I can’t lose them!) I’ve done a bit more analysis and come up with this explanation: it’s a dodgy disk.

Not my preferred explanation, but one which fits the evidence better than the power draw theory.

After I removed the two extra disks, I still had the problem. Ergo, the power draw theory is unlikely to be the cause.

I tried copying files off my `camera’s`_ CF card into my photo storage area three times. Each time I did, the cp process and then every other process making use of somewhere under /scratch would hang, with a stack trace like this:

> fffffffed2910340::print proc_t p_tlist|::findstack -v
stack pointer for thread fffffffed0ca5280: ffffff0005673b60
[ ffffff0005673b60 _resume_from_idle+0xf8() ]
ffffff0005673ba0 swtch+0x17f()
ffffff0005673bd0 cv_wait+0x61(fffffffec7dd2b16, fffffffec7dd2ad0)
ffffff0005673c20 txg_wait_open+0x7f(fffffffec7dd2a00, 2a3968)
ffffff0005673c60 dmu_tx_wait+0x92(ffffffff23b55800)
ffffff0005673d60 zfs_write+0x2de(fffffffef1bc9680, ffffff0005673e20, 0, fffffffee7401e88, 0)
ffffff0005673dd0 fop_write+0x3f(fffffffef1bc9680, ffffff0005673e20, 0, fffffffee7401e88, 0)
ffffff0005673e90 write+0x2ad(4, fe400000, 800000)
ffffff0005673ec0 write32+0x1e(4, fe400000, 800000)
ffffff0005673f10 sys_syscall32+0x101()

After managing to pull my head out and think about this for a moment, I realised that the problem had not occurred before I LU’d to snv_60. When I did the LU, I activated my alternate BE (on my second disk) and made it the logical lefthand side of the mirror. Whenever I hit a specific part of the filesystem in multiuser/64bit mode, all IOs would hang.

A failsafe boot to 32bit followed by a zpool scrub didn’t find anything wrong with the pool or its filesystems, but when I rebooted again I saw the dreaded GRUB error message that it couldn’t find my root partition. Another failsafe boot and “format…label” later and once more, a hard hang while doing heavy IO to and from the pool.

I removed the disk, attached the jumper to the end of it which forces 1.5Gbps (SATA-I) speeds, re-inserted and rebooted. I’ve now been up and running for nearly 45 minutes and doing some fairly heavy IO …. looks ok for the moment.

I’m more confident that I’ve nailed the source of the problem, but we’ll just have to wait and see.

Update: Another possibility, given that I’ve stumbled across 6536905 biosdev 1.4,1.5 changes render SATA disks under old framework invisible to LU, is that there’s a bios bug which prevents the second onboard SATA channel from operating at full SATA-II speeds. Not exactly sure how I’m going to investigate this idea though. .. _camera’s: http://www.jmcpdotcom.com/roller/jmcp/entry/why_are_black_eos400d_bodies .. _wherein I blamed power draw: http://www.jmcpdotcom.com/roller/jmcp/entry/20070325

Docutils System Messages

System Message: ERROR/3 (<string>, line 2); backlink

Unknown target name: "wherein i blamed power draw".

System Message: ERROR/3 (<string>, line 14); backlink

Unknown target name: "camera’s".