Read.IT & Reread.IT – Typographic errors or use English proofreader

This morning I was entering computer parts serial numbers into my spreadsheet.  I picked the Thermaltake Toughpower 1200W power supply box.  It comes with 5 years warranty – definitely a keeper – so as I was entering the serial number, my eye is drawn to an interesting typographical error.

SONY DSC

Since I have been talking about capacitors lately, I would dearly love to get some of these fantastic Japanese capacitors.

My suggestion for companies that manufacture products for the English market, is to “read.it” & “reread.it” and preferably get marketing and product information proofread by English proofreaders.  Don’t just rely on spell-checks and the like.  A lot of documentation coming out of Asia does have these seemingly obvious typos, have you seen an interesting typo lately?

Advertisements

Retell.IT – 2013 – Faulty capacitor on Toshiba laptop

Speaking about faulty capacitors reminded me of a Toshiba laptop that I repaired in August 2013.  This Toshiba laptop belonged to a friend’s landlord – if my friend could get it fixed economically, he could score brownie points – always an advantage to be on good terms with your landlord, especially when the rental climate looks like rents are rising.

SONY DSC

This is with the laptop taken apart – see the massive amounts of dust in the cpu heatsink heat distributor and fan just above.  The laptop would fail to boot and from research on the internet, the cause was a failed capacitor – the rectangular NEC TOKIN inside that metal plate near the bottom left.  This capacitor was a bugger to remove – I tried Infrared heating – it just started to cook the plastic top and did not budge.  These are often glued down, but because it had a large contact area on the board, it was impossible to remove without more specialized equipment that I don’t have, i.e. dark infrared reworking station with under board heater and good temperature controller.

I had to effective destroy the capacitor piece by piece, layer by layer.  Actually capacitors are fragile and easy to destroy especially if I am wielding a scalpel.  Eventually it was removed, and I replaced it with four smd capacitors.  I had to use a fibreglass pen to remove the green coating from the board in order to do the soldering of the replacement capacitors.

SONY DSC

Now a closeup of the board with the replacement capacitors.  The motherboard was reassembled in the laptop, after cleaning all the dust, of course – then powered on, and it worked.  Total cost to the friend’s landlord was $60 plus a lot of brownie points.

[Note]  I have since learned that I could have used my existing infrared rework station by using the under board heater, then heating the capacitor with the infrared, and use my hot-air rework station to add additional heat – to keep the temperature steady.  But that comes with a risk of damage to the board due to the heat being there for much longer – my scalpel was much better.

Reveal.IT – Faulty capacitors on Presario SR5120AN motherboard

Last night, when I was swapping out the Corsair HX650W power supply from my Compaq Presario SR5120AN desktop computer to use in my VMware server, I noticed a bad or failed capacitor.  So, you might ask – what are the obvious signs of a failed or failing capacitor that it is possible to see with the eye? Here is a photo of the motherboard that I removed from the desktop this morning.

SONY DSC

The electrolytic capacitors that quite often fail are those with aluminium cans – these are the common types that fail due to over-heating.  What happens is that the dielectric material inside the capacitor is an electrolyte that will dry out either with time or through over-heating causing vents to open.  The venting of capacitors can occasionally be an explosive event with a loud bang – if anyone is around to here it.  Sometimes if you are lucky – you might hear a fizzle.  Can anyone with sharp eyes see anything unusual on the motherboard?

Sounds of suspense here – tick tock tick tock…  Ok, see that there is one capacitor on the top left – which has a bulged and blackened top, then another four in a row beneath the cpu?  Usual signs are bulging of the top – a sign of over-pressure caused by over-heating. If the vent opens, then electrolyte is released – this is that black stuff.  The bulging with or without dark/black discolouration is how a bad or failed capacitor will “reveal.it“self. On the top of the other capacitors, you can see something like a cross or a T shaped logo – this logo is actually the vent, where the aluminium is deliberately weakened so that failure occurs along the crease marks.  These capacitors are 1800uF 6.3V rated up to 105 degrees Celsius.  There are another four of these that would appear to be intact on this motherboard.  I don’t have any of these in my parts stock so will need to order.

Usually we would replace all of the ones that fail and including those of the same brand and type that have not failed, so I will need nine of them to be ordered.  I don’t have to replace the other four as they haven’t yet appeared to fail, but generally speaking, it is just a matter of time – so best to get the hard work done now instead of doing it again later. Why hard work? Motherboards are multi-layered board that can be very difficult to work on even with specialized desoldering equipment.  Sometimes no matter what, a capacitor may stubbornly resist being desoldered – and usually when this happens, we have to get the cutters out and cut it off the board, then use a soldering iron on one side of the board and desoldering iron/tool on the other side.  I have had to do this once or twice.

You may also notice that I had removed the cpu heatsink.  The thermal material has gotten quite old and is brittle and flaking off – I have cleaned it off and put in Arctic Silver 5. Then reinstalled the motherboard since it is still working, with the occasional freeze and blue screen – I am typing this right now on that desktop.  Anyway, that’s it for now.

[Note]  Desoldering equipment should be maintained regularly.  Due to the higher temperatures involved, the desoldering tip can go black from oxidation and then it will not transfer heat well, which requires increasing the temperature, which makes it worse, etc, etc.  I use a Chemtools Tip Tinner – it is something that has powdered flux and solder that reactivates the desoldering tip – also used for soldering irons.

[Note2]  When the electrolyte dries out, the capacitor’s capacitance will be reduced – which means that its function in the circuit, for reducing ripple, will be degraded.  Also the drying electrolyte will have a higher ESR (Equivalent Series Resistance).  This higher ESR also causes the capacitor to dissipate more power which leads to more heating – this is the mechanism that means that once it starts, it will continue until failure.  I have an ESR meter and with this I could test the other four capacitors to see if they are still functioning well – and then decide to replace or not, but the five failed ones definitely require replacement.

Replace.IT – FSP Aurum Pro 1000W Power Supply with Corsair HX650W

The last day and a half was quiet as our internet had been capped until this morning. During that time I have been having a struggle with the FSP Aurum Pro 1000W power supply unit that I installed into my new VMware ESXi 5.5U2 server.  I was using the VMware OVF Tool to export virtual machines from my old server to the new server and large machines would take a long time to copy – so I left it running overnight.  I was finding that the new server would be off.

It seems that after running for a time the power supply shuts down.  I remembered that I had faced this problem before during my cryptomining phase, but now I am not running that sort of power – my power meter indicates that power consumption of the new server is only 120W – low enough that the power supply should be able to handle it with ease. Unfortunately, that is not the case – it kept tripping out.  I know that it has tripped because the power button on the case does nothing until the switch on the power supply is turned off then turned on after a few seconds.

This power supply had been working so what has changed.  Then I had a hunch – the power supply also has two dedicated fan connectors for attaching fans – I do have a couple of fans on one of those cables – could they be the cause of the problem.

The Antec 1100 case has a fan power hub, that takes a molex connection and spreads it out to four 3-pin fan power connectors allowing additional case fans to be run.  I disconnected those fan cables and the power supply started running consistently.  That is until I thought I would test whether the fan cable can handle just one fan – I connected a single fan while everything was running and immediately – everything off.

Bad move!  After doing this, no matter what I did, removed all fans, removed the modular cables going to the dvd drive and the hard disks, leaving only the motherboard connected – the power supply was not going to work this time.  After resetting the power supply, press the case power button – the cpu fan would start spinning, then everything stops.  It happened more than 10 times in a row last night so I was ready to just throw the FSP power supply into the pool.  After calming down a little – I decided to “replace.it” with my Corsair HX650W power supply that I have in my Compaq Presario desktop.

I got out the original Compaq power supply from the Presario that I had put into the Corsair box – it was a 300W unit.  I had bought the replacement as I had previously upgraded to a nVidia 8800GT graphics card that needed more power, but have seen gone to an AMD Radeon 7850 which uses less power.  I removed the Corsair power supply – ever notice how much dust computers accumulate?  Also noticed a blown capacitor on the motherboard – that might explain the occasional blue screen that I had been getting recently – another story.

Now with the Corsair HX650W installed in the server – it powers up and I am happy to continue with my exporting and importing of virtual machines.

One thing that I do think of, is because of the multiple times that VMware has crashed due to the power supply shutdown – some of the data on the hard disk could be corrupted.  I don’t have a battery backup for the Adaptec 5805 array controller so any data that is going to the hard disk will not be stored.  It might be a good idea to buy the battery – it is about $150 or so and would allow data to be stored in the array controller cache then written to the disks on the next power up.  Also the Corsair power supply being a professional series – has a 7 year warranty, I will need to find the receipt and keep it handy, just in case.

[Note] Due to the multiple unintentional shutdowns, it might also be a good idea to reinstall the VMware ESXi to ensure that all the data is valid – I might do this after I get the battery and enable all the write caching, now that I have a good reliable power supply in the server.

Review.IT – To AMD Crossfire Radeon 7850’s or not

As mentioned previously, I had dabbled in cryptocurrency mining. I played around to the extent of building a few dedicated machines with 3x Radeon 7950’s, and 3x Radeon 7850’s plus the odd Radeon 7870. Since the bitcoin price went down the gurgler – although it has since recovered a little, I had decided to stop mining for a while. What I have left with is…

SONY DSC

What to do with all of these graphics cards – they are in almost new condition, have been used for about six months, then cleaned and packed up. My younger son has put dibs on one of the MSI HD 7850’s for his new computer, with the option of taking another one (for Crossfire). I had plans to use one of the MSI Twin Frozr 7950’s for my new desktop and gaming machine. The questions in my mind were:

Q1: Can I sell the MSI HD 7850 cards perhaps, and recoup some cash?

Q2: If the MSI HD 7850’s aren’t worth selling, because the return is very low – should I keep them and sell the MSI R7950’s instead?

Q3: Should I use the MSI HD 7850’s instead by running two of them in crossfire?

With these questions in mind, I searched on the internet for reviews of any sort that would compare 7850 crossfire versus a single 7950. Most of the sites that I went to – arbitrarily said single card is better than two, etc – with little justification. Then I came across this old review from Tweaktown – which I found very interesting.

HIS Radeon HD 7850 IceQ X TurboX 2GB in CrossFire Video Card Review

http://www.tweaktown.com/reviews/4798/his_radeon_hd_7850_iceq_x_turbox_2gb_in_crossfire_video_card_review/index.html

A couple of differences, I don’t have that HIS card, but mine are equivalent in specifications. It doesn’t compare against the Radeon 7950 but does show performance relative to the Radeon 7970 which is slightly better than the Radeon 7950. It doesn’t test with Borderlands 2 since this review was before the release. I don’t plan to be running at a very high resolution, so this seems to say that it is a good thing to do. There is of course talk about how crossfire can be problematic, but surely it has matured by now.

So, let’s “review.it“.

Q1 answer – the 7850’s have a low resale value, for some reason – Gumtree has them for $75 each – I bought them at $169 – potentially 44% return.

Q2 answer – The cheapest 7950 on sale is at $150 – I bought mine for $289 – potentially 52% return.

Q3 answer – Possibly, it certainly looks interesting. Power consumption should be slightly less than a single MSI R7950.  I will need to investigate this further – as it would not hurt to actually try it out and compare the performance myself.

Asking questions also raise additional questions – apparently if we were really smart enough, we could ask the right question, but then we would already know the answer, isn’t that strange!

[Note]  AMD Crossfire technology allows you to harness the power of multiple AMD graphics cards.  For my case, the MSI HD 7850 only has one crossfire connector hence can have two cards in AMD Crossfire, but the MSI R7950 has two crossfire connectors allowing a three card configuration.  Having two cards is not the same as double the performance – however there is a significant increase in performance.  People choose to do this if they already have one card and just want to improve performance by getting another similar card.  The downside is that usually buying two cards is slightly more expensive than buying one better card especially now since the price of AMD Radeon cards have fallen.  My question is that I already have these cards so price is not really relevant unless I wish to get some return on my initial investment.

Restore.IT, Recover.IT – 2006 – When Murphy’s Law just wasn’t funny anymore! Or pages from the diary of a high-flying IT consultant and troubleshooter!

I was going through my old diaries with the view to putting the pages into the recycle bin when I came across an entry in 2006 that brought back painful memories (and perhaps tears to my eyes). This was an example when Murphy raised his head and continued doing so with near disastrous consequences – ok, my exaggeration – you can be the judge. The names have been changed to protect the innocent. A warning – this post is a long one, feel free to zone out and zone back in again further down. I have put everything into a timeline since that is what I get from my diary – also racking my memory to fill in gaps in my notes. I don’t have my original notes and detailed documents because all files had to be returned to the company when I got WFR’ed in 2010.

February 2006 – I got outsourced to a well known IT company. Basically we were given a choice, we move over or we leave. Leaving was not really an option for me, so I was the only one in Australia that got outsourced. 1 out of 3 – not bad!

April 2006 – Customer office in Malaysia was moved to another location. This included servers, networks etc, the whole kit and caboodle.

Fast forward to October 2006…

25th October 2006 – Malaysia server NTS4 (not its real name but similar) had shutdown during the afternoon. Yes, NT does stand for Windows NT.

26th October 2006 – During the late afternoon, I hear about the outage – NTS4 was down at 13:09 Malaysia time yesterday. We arrange to get it restarted.

27th October 2006 – We find that the tape backup drive is not connected. Also we find that the disk drive in slot 4 has failed. After some conference calls, we determine that the disk had failed before the site move but apparently nothing had been done about it – [Murphy 1: Customer 0]. This was escalated to get the drive replaced. This drive was part of a 5-disk RAID-5 array.  RAID-5 can handle a disk outage but requires replacing the failed drive as soon as possible, otherwise it operates in a degraded state with no fault-tolerance.

28th October 2006 – Engineer is scheduled onsite in Malaysia in the morning. A short time later, I get a call – the NTS4 server is down. Conference calls for the next three hours – it appears that the server is unable to boot as it has lost another drive, although this time the error report indicates that a drive was removed. We check with the engineer – he denies touching anything. All drives are still in the server – so, if the drive was removed, it was put back in – but that was already too late because the array controller now has two drives down, so what really happened. Anyway, I get on the phone to the regional service manager – I tell him that if the drive was removed, the data on the drive should still be intact and I would (with 99% certainly) be able to recreate the array using the data from each disk drive except the one that had failed a long time ago. I also tell him not to let any engineers do anything to the server before I get there.

A few hours later, after more conference calls – we decide that we need to bring services back online. NTS4 was both a SQL database server and a Microsoft Exchange server. I commence copying Exchange installation files from Singapore to another Malaysia server NTS8 which will become the replacement Exchange server.

Fast forward a couple of hours – I get a call that the server NTS4 has been fixed. That was when I had the knot in my stomache, shivers down my spine and knowing what comes next – like when you are standing on the edge of an abyss with the wind behind you getting stronger and stronger and nowhere to go.

What happened is a case of pride before prudence (not prejudice – ok, pun). My company’s Wintel Level 3 is based in Malaysia – they are supposed to know everything there is to know about Windows and Intel servers – however, as I found out, they know little about data recovery. Pressure was put on them to resolve a problem – why should an outsider (yes, myself – a newcomer with only 8 months in the company) be the only one that could fix the problem. When this was put to you by a big boss, how can you say that you can’t fix it – of course you can fix it. So what they did was to replace the long time failed drive and the one that had been removed. The array begins rebuilding – smiles all around… Except that the server does not boot – of course it was obvious to me, but I knew then that the data was essentially lost. [Murphy 2: Customer 0]

Ok – no point in crying over spilt milk – the only other course of action (with little hope) I could suggest is to have the server looked at by a data recovery company. My company does not have a data recovery department (surprise) – something that I have suggested, so an external company was required. A suitable company was located in Malaysia – and the server is being packaged up to go to them – cost would be 8,000-20,000 Malaysian Ringgits (irrelevant) and about 1 week turnaround. I finally got to bed on that Saturday at about midnight to try to get some sleep before a scheduled conference call 4 hours later.

29th October 2006 – Looks like that particular Sunday would be full-on. I am right. 04:00 conference call, followed by more work and more calls. I forgot to mention that I am also responsible for Microsoft Exchange 5.5 Level 3 support especially for these emergencies like server restorations. For about 7 hours, I work on installing Exchange 5.5 on NTS8 and finally around 21:30 I get all the mailboxes created. Then spent the next 3 hours getting replication and the X.400 Connector working to the Singapore regional bridgehead. Got to bed at about 1AM.

30th October 2006 – Got up early on a fine Monday morning and started installing the Trend Micro ScanMail and End User Quarantine software for Exchange. Installed Backup and service monitoring agents – yes, I basically install software for the entire infrastructure. Then to prepare for Microsoft SQL – copying install files to Malaysia from Singapore. We would use NTS8 for SQL – the Malaysia customer office uses SQL as the database backend for AccPac accounting software.

31st October 2006 – More work getting SQL installed and finally ready to look at restoring databases. A problem arose trying to read from the NTS4 tapes – it looks like the tape drive wasn’t working for some reason. I would probably have to actually go these, since I am also the level 3 support for Arcserve backup software – my company didn’t really have people who know much about these old applications, and I had been supporting and installing those applications since 1996. Anyway, the Malaysia customer office had email working and accounting database could wait until I get there to restore the NTS4 server from backup tapes.

Over the next 24 hours, I work on a site recovery and contingency plan. I knew that I would have to restore the NTS4 server from tape, so would need to export and import the mailboxes from the restored server to the new server. There were quite a few steps that would be needed in order to affect a good recovery and minimize any further downtime. Towards the end of the job, I expected that there would be a number of late nights involved.

2nd November 2006 – The report back from the data recovery company was not good. They cannot do anything because the array had been reinitialized. There were lots of files that could be recovered, but the main files we are wanting are the Exchange & SQL databases and associated log files – these are very large and much of the data had been lost due to the data striping of the array. I.e. two drives introduced forced a rebuild which is basically a reinitialize. A quarter of the actual data being efectively replaced with zeroes was what I estimated had happened. The server would be returned to the customer site.

6th November 2006 – The NTS8 server is down. Oh no! An IBM engineer is requested since this server is an IBM xSeries server. I thought at the time, that I should start arranging my travel and book flights. I get approval from my manager to fly to Malaysia from Sydney with the purpose of rebuilding and restoring NTS4 and to resolve NTS8. I get a call from the IBM engineer – the server is down due to a bad stripe. [Murphy 3: Customer 0]  How can that happen?  [It seems that if data within a stripe becomes inconsistent due to media errors, i.e. bad block (or part of) on the hard disk, then the stripe becomes bad. For instance, with three drives in RAID-5 and a block size of 16KB, this means that 32KB has become unavailable – and if this is part of an operating system file, then that could be preventing the server from booting.]  Flights arranged, SYD-SIN, SIN-KUL for the next day.

7th November 2006 – Left home at 05:30 heading to the airport for a 08:30 flight to Singapore. Arrived about 13:30 Singapore time and waiting for my 17:00 flight to KL. I get a call from the IBM engineer – he can fix the bad stripe. Really? Ok – how? Delete the array and recreate the array – yeah, right! What about the data? No problem – the data should be fine – no thanks! I forbid him to do this as I am on the way to Malaysia – don’t touch the server until I tell you to! I can be forceful when I need to be.  Deleting and recreating the array will definitely lose the data – I was not going to lose two servers in a row, no way, if I could help it!  I finally get my flight and arrive in KL and head to the hotel – arriving around 19:30 just in time for dinner – best to eat and get a good night’s sleep because tomorrow would be a long day.

8th November 2006 – Arrived at Malaysia customer site at 08:25. I have a look at both NTS4 and NTS8 servers. I carry a couple of Linux CD’s with me all the time. I planned to boot each server with a Knoppix live CD and run a “cat /etc/fstab” command – this would list the drives and file systems that Knoppix (Linux) recognizes as being available.

NTS8 – single drive, 2 partitions. /dev/sda1, i.e. C: drive on NTS8 is corrupted at about the 7.5GB point. /dev/sda2, D: drive appears intact – fantastic, because this is where the Exchange server databases and logfiles are stored. This is great news because it means that I can “recover.it“.  If I could get those Exchange databases and logfiles copied and restored successfully in the correct manner, the users will have all their email up to the point of failure – that was the best that anyone could hope to achieve.  I scrounged around looking for a machine with sufficient storage capacity – I finally found a relatively new desktop with enough space. I enabled Samba and then shared /dev/sda2 and started the copy of the Exchange databases and logfiles, etc to the desktop machine. I also wanted the Arcserve databases and logs.  It took a while because the files are quite large – especially Arcserve and even though that Exchange had only been running for approximately a week until it went down.

I reconfigured NTS4 to connect all of the disks to the inbuilt SCSI controller instead of the Smart Array controller.  Knoppix recognized 4 drives, 18GB, 18GB, 18GB, 36GB – I set up to copy the contents of each disk across the network to my laptop. I would use this data to test my perl script – the one that I would have used to rebuild the data if the disk array had been left as I had requested instead of being interfered with and effectively destroyed by the reinitialization process.

When the copying from NTS8 had completed, I started the copying of the files from the desktop to my usb disk – careful is my middle name, especially when it comes to critical customer data.

I rebooted NTS8 as it was time to “restore.it” and booted from the IBM ServerGuide cd. I erased the disk array and then started the install which would create a new array and then install Windows 2000 – since this is what had been running on NTS8, however it hung up when Setup is starting Windows 2000. Bummer!  Anyway, it was late – 20:45, better to get some rest and start afresh in the morning.  I called the IBM engineer, explained what I had done, and told him that his services were no longer required.  He could go ahead and close the call-out ticket.

9th November 2006 – In the office early again.  I worked on NTS8 again, trying to install a couple of times until finally the penny dropped, disconnected the tape drive and tried again. Success – it seemed that during the ServerGuide installation, it would hang trying to detect additional hardware, so best not to give it hardware to find and not know what to do with.  Windows 2000 Server installed – great.  I then quickly installed the Arcserve backup application and restored the D: drive then restored the C: drive including the system state. This overwrote the fresh installation with what had been backed up during the last full backup which fortunately was the night before the crash. I rebooted when the server was ready, and then stopped all of the Exchange related services.  I started the copying of the databases and logfiles from the desktop machine – this should put back into place the files up to the point of failure – at least for the email system.

Done – files are back in place – quick check of the files – they looked ok, file sizes the same as on the desktop. It was necesary to run a recovery process so that the files could be fully integrated into Exchange and the system registry. I ran the following commands – unfortunately I cannot give you a lot of detail on them as it isn’t relevant to this post, but suffice to say that the commands and specific order are necessary – as any Exchange 5.5 level 3 engineer will tell you.

“eseutil /g” – a few errors seen, not a problem as they were expected. “eseutil /r” – soft recovery completed successfully. Started the System Attendant and Directory Service services for Exchange, then logged off and logged on with the Exchange service account. “isinteg -patch” – completed, no errors. Started all remaining Exchange services – voila! Exchange is running.  Fixed Trend Micro ScanMail due to the antivirus patches not updating.

All users are informed that email is now accessible and that mail should be at the point of failure – hooray! [Murphy 3: Customer 1]

I then copied the databases for Arcserve so that Arcserve was back to the state at the time of the server crash.  All done.  It was time to look further at NTS4. I reconfigure the disk drives back to the array controller as by then I had all of the disk contents and can work on rebuilding the server. I installed Windows NT 4.0. While that was happening, I had a look at the tape drive to find out why it was not being recognized.  I saw some bent pins in the scsi connector – how did that happen?

DSCN2696

The penny dropped – it happens a lot!  During the site move in April 2006, they would have disconnected the cables to move the equipment and reconnect.  Whoever reconnected the cable to the tape backup unit obviously did so very clumsily and the backup unit was not tested afterwards. [Murphy 4: Customer 1]

10th November 2006 – I had to checkout of the Crystal Crown Hotel – and would move to another hotel – Hilton PJ, later in the day.  When booking flights and accommodation on short notice, we could not always get the one hotel for the entire stay.  Flights to Singapore and then back to Sydney were reserved.  Installed SQL Server 2000 onto NTS8 in preparation for restoration of AccPac databases. A slight (conservatively) hitch had to be resolved, last backup of NTS4 was probably the one before the office move in April – what to do?  Ok – not my problem, someone else could worry about that. I continued with my recovery plan to finish the NTS4 reinstallation in preparation for data restoration from tape.

11th November 2006 – The last backup tape of NTS4 (17/04/2014) was merged into the Arcserve database on NTS8 – this was needed before restoration from the tape was possible.  Restored two backup sessions to a temporary folder on NTS8. Attempts to restore session 3, resulted in session 2 being found instead – what gives? [Murphy 5: Customer 1]

12th November 2006 – It appears that Arcserve 6.61 when doing a full drive backup would allocate space on the tape based on the expected backup size requirement, however during the backup – some files may be unavailable, hence the actual backup is smaller resulting in slack space on the tape. This was causing a problem with the restore because the tape could not be positioned to session 3 properly.  Actually on further analysis, there appeared to be an extra session in between 2 and 3, so that 3 was not real, but trying to restore 3 ended up with 2. Restoring session 4 just failed because if it got to 4, it would see 3 and fail – pulling my hair out just didn’t help.  To rule out a tape drive problem, I decided to copy the tape to another different tape media. I used the tapecopy command to copy all sessions from the DLT4 tape to the SDLT1 tape. As it was going to take some time, I began analyzing the data I collected from NTS4 disks before the reinstallation.  I updated my perl script so that I could recreate the logical drive – as an academic exercise.

13th November 2006 – The tapecopy had completed. After deleting the tape from the database, I re-merged the tape in Arcserve – to my immediate relief, all backup sessions were visible and in the correct order. [Murphy 5: Customer 2]

I was able to restore the first three sessions comprising of C:, D: and F: then the fourth session being the System Registry was also successful. Next on the list was to restore the SQL databases – another hitch – the restore fails with “no valid destination” – I cannot restore the databases to NTS8 when they were backed up on NTS4. This was apparently a limitation of the backup agent.  NTS4 and NTS8 were on different Windows domains – I had to establish a trust between the two domains, then was able to restore from NTS8 directly to NTS4 when restoring to the original location. It wasn’t quite that straightforward as a reboot was involved and the Master database had to be restored first before restores of other SQL databases could work – but it was done. [Murphy 5: Customer 3].

Unfortunately we didn’t really want the SQL databases back on NTS4 because that server was already obsolete, so we decided at the time, that another server NTS5 would become the SQL database server. Since SQL Server was no longer needed on NTS8, it was uninstalled as it was intended to be temporary anyway for the purpose of restoring the databases.

14th November 2006 – It was time for the Exchange database restoration on NTS4. The Exchange site was isolated, to avoid replication – essential when doing an online restoration of old databases. The Exchange database restore was commenced. In the meantime NTS5 was worked on to install SQL Server and Arcserve backup agents. I also did some further work on my perl script for the raid recovery test.

Whew! Still reading this? I did say that this was a long post. Anyway to cut a long story short – in the remaining days of that week, the Exchange databases were restored to NTS4. Exchange was brought up and verified that the mailboxes were intact – which was fantastic. All mailboxes were then exported to pst files using the Exmerge program. These pst files were uploaded to the NTS8 Exchange server.  All of the users were happy to get more emails back, but not so happy that the emails between 18/04/2014 to 28/10/2014 was irretrievably lost. Sql databases were also moved to NTS5 and my job in Malaysia was done except for some cleanup actions that could be done remotely. [Murphy 5: Customer 4]

This was an example of some of the things that I encountered during my roving life as an IT consultant and troubleshooter. In those couple of weeks I had to contend with multiple failures involving disk arrays and had to perform server recoveries and restorations under difficult circumstances.

Did we break even at Murphy 5: Customer 4 – doesn’t look like it?  Oh yes, the backup tape was six months old – what about the AccPac accounting databases, I can hear you asking?  My company had to hire a number of data entry people to input all the accounts for the six months or more based on the accounting printouts that they had – lucky they had hard-copies, right?  And yes, the whole accounting process had to be followed, April data entry, then April end of month closure, printout, May data entry… A month or so later, the data entry was completed, and AccPac was rolling ahead! [Murphy 5: Customer 5]

[PS] I feel a bit sorry to put you through all of this, but I hope you understand that an IT problem is not always straight forward. I also tried to keep the relevant parts as it is possible that others may encounter this situation in the future and may find some help in this post.  I forgot to mention that I did finish my perl script to recreate the logical drive of the failed array, then during analysis, was able to show conclusively that the array had been reinitialized which was why the data was lost. Further to this, I was able to confirm during testing on equivalent hardware that taking a 5 drive RAID-5 array, I could pull out one drive and lock it away to simulate an old failed drive, then pull out a second drive to crash the array – read the contents of the four available disks, then I could use my perl script to recreate the data on the locked away drive, and also to recreate a logical drive that is the same as converting the array into a single larger drive. All this using a perl scripting language that is over 20 years old – and the script comprising of only a small number of actual commands. For those of you who know perl, you will understand “$b5 = $b1 ^ $b2 ^ $b3 ^ $b4;” – that is the magic line. Everything else was just definitions, reading, writing and looping.

Maybe we could make a movie out of this – but of course, no car chases, no martial arts, no gunplay, no scantily clad women – no fun, right?

Repair.IT of course – Corsair H100i liquid cooling standoff

There is a kit from Corsair that would contain the standoffs that we could use to replace the one that broke.  We checked a couple of computer shops but these were not in stock, with no information about when the kit might be available.

The results are in – the majority of votes (being one) are to “repair.it” – great, another use for my machinery.

Alright – lets get into it.  The standoff that had the screw stud snap off, has a M3 thread, i.e. metric 3mm thread.  I happened to have M3 machine screws in my cupboard of different lengths.  I chose to use a 20mm machine screw.  First thing is to put the standoff in my lathe chuck, then face it off.

DSCN5027

Face off in metalworking terms means to make the end face flat, i.e. remove the remaining threaded portion.  Next, I use a 2mm centre drill – to start a hole in the end face.  A centre drill is used to put a starting hole exactly in the centre – as it means.  If I start with the 2.5mm drill that I will require, I may end up with an off-centre hole.  After doing this, I followed up with the 2.5mm drill that was needed.  To work out what hole is needed to tap a particular size thread, we refer to a tapping chart.  To tap a M3 thread, I needed to drill a hole that is 2.5mm in diameter.  Actually a 2.46mm drill would be ideal – but these don’t exist, hence the nearest one being 2.5mm.

DSCN5028

The above photo shows the standoff after tapping.  I use cutting fluid on the tap as I am tapping, backing off from time to time as recommended.  After tapping is completed, it is necessary to clean the tapped hole – I use a duster spray on a thin nozzle to blow out the metals bits.  Next step – to screw my M3 machine screw into the tapped hole.

DSCN5029

Now I cut the end off with a hacksaw, the length needed is not so important as long as it is similar to the original – it does not have to be exact.

DSCN5030

Then the end is cleaned up with a metal file.  I try a M3 nut to make sure that the threads will engage without difficulty.

DSCN5031

This photo shows the two side by side, the repaired standoff is on the right – you can see the different silvery screw.  Yes, the one on the left – the bottom thread appears slightly bent because it is – it shouldn’t be that way though, but I won’t try to straighten it.  Great – I have “repaired.it“.  Overall time it took – about 20 minutes because I did not want to rush.

[Note]  When using metalworking machinery – it is always important to be safe.  Wear eye protection – the little bits of metal can be so small that you might need a magnifying glass to see them, especially when spraying into the tapped hole.  My lathe does not have a very low speed, the minimum speed is 100rpm, which is a bit too high, so it is necessary to tap manually.  I disconnect the power plug to the lathe, then install a spindle handle so that I can manually turn the chuck that is holding the standoff – in this manner, I can hold the tap handle in my right hand and turn the chuck with the left.  When working manually on machines like this, it is so easy to forget what you are doing and accidentally hit the power button – often with disastrous results.