This shows you the differences between two versions of the page.
| — |
tape_drives_from_hell [2020/08/13 09:43] (current) ss_wiki_admin created |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | On master01, media02, media03, media04 and media5 servers we have 9 shared tape drives for the SLPs. | ||
| + | Most of the time only 3 or 4 would be working. Every day I would start in the evening with 8 or 9 working tape drives and wake up in the morning to 2 or 3 working drives. | ||
| + | |||
| + | On a daily basis I would try up the drives, vary them in the robot, get Vendor Support to check them for errors, complain to the storage team about paths going missing and have them check and re-check them, and go through all the troubleshooting steps I could think of – reboot, restart netbackup, rescan, release allocations, | ||
| + | |||
| + | Attaching the tapes to new fibre (about two years ago) gave some hope that things would improve, but it didn’t seem to. In fact, it may have made things worse – some drives now had multiple paths which would appear and disappear which made things even more complicated. | ||
| + | |||
| + | It is possible there’s a reason joining to the new fabric may have made things worse though… SCSI Reservation. | ||
| + | |||
| + | I finally found out that if the entire end-to-end fabric, including switches, ISLs and any device in-between is not fully enabled and capable of handling SCSI reservation, | ||
| + | |||
| + | <WRAP center round important 60%> | ||
| + | “it is imperative that SCSI traffic passes through the network/SAN infrastructure unmolested and without corruption, otherwise issues such as SCSI reservations can arise due to devices not releasing their reservations correctly and/or being unable to acquire them” | ||
| + | </ | ||
| + | |||
| + | https:// | ||
| + | |||
| + | /// | ||
| + | |||
| + | With the drives dedicated to this backup server set only, Netbackup itself is actually capable of managing drive allocation without SCSI reservation, | ||
| + | |||
| + | Things seemed to still work, but I still had at least 4 drives down. | ||
| + | <code bash> | ||
| + | vmoprcmd -crawlreleasebyname < | ||
| + | </ | ||
| + | Running this command released all the stale or broken reservations and within 10 minutes *all* drives were UP and in ACS. | ||
| + | |||
| + | Since this change on Saturday night the servers have caught up the entire backlog of 300 SLP jobs (120-odd TB). | ||
| + | All new backup jobs complete with plenty of space and their duplications kick off within 10 minutes and finish relatively soon after. | ||
| + | |||
| + | There is no longer any backlog on SLPs for the first time in years. \\ | ||
| + | Only one drive has gone down, but this is a confirmed hard failure in the robot. All other drives are performing beautifully and are able to be used on any of the media servers as netbackup desires. | ||
| + | |||
| + | The lesson seems to be that at least in this case the fibre infrastructure does not support SCSI reservation properly. | ||
| + | |||
| + | If your drives are DOWNing, you may want to check this out. | ||
| + | |||
| + | So far so good - let’s hope this level of performance from the drives continues. | ||