Recover.IT – Asus Taichi 21 Notebook

I haven’t been writing much lately so it is time to get a few out of the way.  Some weeks ago, I was asked about an Asus Taichi 21 Notebook that had suddenly stopped working.  The notebook is one that has a dual screen, open up normally as a notebook, close the lid and the back screen comes up as a tablet.  Neither screen was operating and it had been sent to Asus to look at.  I suggested that I should be able to get his data off the notebook as Asus would not provide this service.  Eventually a quotation was received which was quite high – you could buy a second hand Asus Taichi 21 on eBay for much less than the quote, so eventually it came to me to look at and get some very important files from it.

On inspection, the notebook as an internal SSD which at first glance looks like a normal mSata or M.2 SSD however on closer inspection – it is quite different.  Further research indicated that there were adapters available that would convert this SSD to standard Sata – and I was fortunate enough to find a local Sydney supplier that had one of these in stock for $20 or so.  I ordered one, and when it was ready – went for a  short drive to pick it up.  Now the adapter looked like it wasn’t the right one, but they assured me that it would work.  The socket is much larger and is not quite the same as the socket on the motherboard, so after some further research, I decided that it should work.  Of course, this can be a risk that could destroy much wanted data – but there were no indications on the internet that these adapters posed a problem.

dsc_0285

This shows the adapter with the SSD installed.  Note the size of the socket.  This adapter is used for the Asus Taichi and UX21/31 notebooks.  See below, for a photo of the motherboard with its socket.

dsc_0279

You can clearly see the difference as the motherboard socket has 6 and 12 pins, but the adapter socket has many more pins.  Anyway, I connected the adapter to my recovery machine, and it was recognized by the Bios and by my Ubuntu operating system.  I went to mount the disk, but it complained that the partition had not been cleanly dismounted.

No real problem, the way to get around this is to mount it as read-only which will ignore the dirty bits as I only want to copy data from it.  After doing an “fdisk -l”  to list the partitions, I eventually used the “mount -t ntfs -o ro” command to mount the partition and then was able to copy the required data to any external usb disk.  The D: drive folders and contents which is what I copied – as this was what was required.

After that, I reassembled the notebook and that was that, or was it?  A quick search of the internet showed that the motherboard “60-NTFMB1102-D07” was available for a few hundred dollars which would likely fix the notebook, but that is another story.

Advertisements

Recover.IT – HP EX490 MediaSmart Server – Part 3

Ok, where was I?  Yes, the faulty disk in my MediaSmart Server – eventually the removal process came to an end where there was only a few files left over – about 10 – which was old and unnecessary – I tried to delete them, but each time I deleted them, they came back. So, I had to log onto the server through the Windows console, and then run a chkdsk command on the disk. I had to run it twice and after that, was able to do the removal again, and finally the disk was blinking – indicating that it was ready for removal.

As this disk had some bad sectors in one area, sometimes we can do a security erase on the disk. The security erase is an internal function of the disk drive firmware and we have to go through a little process to do this. The security erase will effectively perform a factory format of the disk surface which in general should rewrite the entire data surface and we should (theoretically) end up with no bad sectors.

ScreenShot074

The first hdparm command is to interrogate the disk, and writes the output to a file, we need to check that it isn’t frozen. The second hdparm command sets a master password which I have called llformat (meaning low level format), then we check the drive again, to confirm that it is enabled for erasing – which was confirmed.

The final hdparm command tells the drive to commence a security erase, which would take approximately 322 minutes, so it was time to leave it and let it run and check it in the morning, which I did. Afterwards, I ran a diagnostic on the drive, but had some strange error – it seems that the drive now thinks that it is a 1TB drive and not a 2TB drive. I checked the hdparm output from the initial command.

ScreenShot072

Definitely it shows that it is 2TB (2000GB) as indicated by the device size with M = 1000*1000 – i.e. M = 1 million bytes.  The hdparm command I ran after the erase had finished is shown here.

ScreenShot073

Definitely, here again, it says it is a 1TB = 1000GB drive – what is going on? The serial number is the same, so I am not dreaming – and definitely, it thinks it is 1TB. I also ran a smartctl command to check the status of the SMART data on the drive.

ScreenShot071

It shows that parameter 184 End-to-End_Error is FAILING_NOW, so basically the drive is failing – so should not be used for anything critical as it could stop working at any time. A pity because a 2TB drive could still be handy to play around with, but now it is 1TB.

I performed an erase again, and it seems that it still is a 1TB afterwards, so definitely there is a problem somewhere – maybe in the firmware. Suspiciously, there is nothing like this when I do a Google search. Maybe if someone knows how this happened, they can let me know and we can try to reverse it. I have performed security erase on numerous drives without this happening, so it would be good to know.

 

Recover.IT – HP EX490 MediaSmart Server – Part 2

This is part 2 of the recovery of the HP EX490 MediaSmart Server – which is a Windows Home Server machine. The second drive on this server was seen to be offline, so I had shut down this server to investigate the problem.

Last Saturday, I ran some tests on the drive as previously mentioned. On Wednesday night, I decided to copy the disk to a new 3TB disk that I had just bought – a Toshiba 3TB drive with 3 years warranty for $127 each from a local computer shop. I thought that this was a good price.

Anyway, as you might have guessed – I connected this disk on a Linux machine. The Linux in this case was Ubuntu. I used the dd command (that I have previously mentioned) to copy raw data from the disk directly to the new disk.

dd if=/dev/sdb of=/dev/sdc conv=noerror,sync 2>&1 | tee -a ./logfile.txt

Now, of course – I did a smartctl -a /dev/sdb first to check the source disk, and then another one – smartctl -a /dev/sdc to confirm the destination disk. The source disk is a Seagate – correct, and the destination disk is a Toshiba – also correct, so I was good to go. It is good to check, and don’t assume that because the Seagate is connected to SATA0 and the Toshiba is connected to SATA1 that the disk designations will be in the right order.

Ok, so on Wednesday, the copy was started – then I went back to the machine some time later to check on its progress and I see these errors on the display.

dd: error reading ‘/dev/sdb’: Input/output error
6364552+0 records in
6364552+0 records out
3258650624 bytes (3.3 GB) copied, 346.982 s, 9.4 MB/s
dd: error reading ‘/dev/sdb’: Input/output error
6364552+1 records in
6364553+0 records out

6,364,552 sectors were read and copied before an error occurred. The noerror parameter means that it will continue, and sync means that the unreadable sector will be replaced on the destination with a blank sector. I stopped the copy at that time, since it is not a good idea to keep trying to read bad sectors in case the drive decides to quit permanently.

Then last night, I decided to copy from a point after this sector. This time I used this command line and let it run overnight after it seemed to start without throwing up any errors.

dd if=/dev/sdb of=/dev/sdc conv=noerror,sync bs=1M skip=4000 seek=4000 2>&1 | tee -a ./logfile.txt
1903728+1 records in
1903729+0 records out
1996204539904 bytes (2.0 TB) copied, 24148.3 s, 82.7 MB/s

For that command, I set a block size (bs) of 1MB, then used the skip and seek parameters to begin at a point 4000MB into the drive, on both the source and the destination. I checked this morning when I woke up, and found that it had completed successfully – the time taken for the copy works out to about 6.7 hours.

This evening, I also bought a Toshiba 2TB disk drive on my way home – I will talk about this later on. Ok, so I had copied about 3.3GB on Wednesday before it hit the bad sectors. Last night – I started the copy at 4GB or thereabouts onwards and it copied to the end. Now I did a few more copying commands – I won’t bore you with all of the details however the result was to copy the remaining good sectors, using the count parameter to specify how many blocks to copy.

Eventually, I had copied every sector that was able to be copied. It turns out that sectors 6,364,553 to 6,364,568 – 16 of them was unable to be read, not too bad. I also copied a couple of blocks before and after the bad sectors and had a look at the data – it seems to be file information, most likely parts of the Master File Table – which means that a few files are potentially lost.

Ok, this is where my new 2TB drive comes in. I put the faulty Seagate drive back into the EX490, and then added the new Toshiba drive into the top-most bay. After powering up the MediaSmart Server, and waiting – I was eventually shown two solid green lights – which means that the Seagate drive is now online together with the main WD drive, and one blinking green light which was the new Toshiba drive. I logged onto my Windows Home Server console and went into Server Storage and proceeded to add the new drive.

Screenshot 2016-08-05 19.45.01

The idea is to add the new Toshiba drive, so that WHS knows that it is available for storage, and then tell WHS that I want to remove the Seagate drive.

Screenshot 2016-08-05 19.45.51

You might ask, why am I doing this? The drive has bad sectors – it isn’t a good idea to keep using it. Also WHS allows me to remove this disk – by moving and redistributing the files on the disk to other available disks, like the new one that I just added.

Screenshot 2016-08-05 19.46.11

Great, it says that I have sufficient storage space to have this drive removed.

Screenshot 2016-08-05 21.42.40

Ok, I am not actually going to sit here and wait for it, but eventually it will (hopefully) tell me that the drive is ready to be removed. Depending on how full the disk drive was, it can definitely take many hours. Windows Home Server is actually really good, because most storage systems don’t allow you to remove disk drives once they had been used for storing data.

What about the 3TB drive, you are thinking? That is for insurance – in case the disk stops working during the removal, then I have a copy of it that I can use to copy files from. If this removal works successfully, then my 3TB drive can be retasked. By the way, Windows Home Server cannot use disk drives larger than 2TB without major surgery. The reason for this is that WHS uses partitioning based on the Master Boot Record. In order to use drives larger than 2TB, it is necessary to use GPT partitioning – but that is another story.

What about the 16 bad sectors on this Seagate drive? Once I take it out, I plan to do a factory erase on the Seagate drive – this should rewrite every sector on the disk, including the bad ones and I should end up with a disk drive without bad sectors. I can then use it it either for temporary storage of non-critical data or run lots of diagnostics on it to see if it is continuing to fail. If it holds up to the diagnostics, maybe it gets a second chance on life.

In the meantime, I am off to bed!

Recover.IT – HP EX490 MediaSmart Server

Yesterday, I noticed that the Windows Home Server icon in my taskbar was red.  I opened it up and saw some file conflicts – that is strange.  I could access the files in the server, so what is going on – then the penny dropped, it says that a disk drive is missing. I went out to the computer area and could see only one disk was lit up, the second one is not lit – meaning that it is offline. I went back to the console and shut down the server – which eventually it did, albeit slowly because it had stopped responding for a long time before I could hit the Shutdown button.

DSC_0098

Some of you may have heard about Windows Home Server, many probably haven’t. WHS was a great product for its time – a semi-redundant network storage device that could be packaged like a NAS. I bought this HP EX490 MediaSmart Server back when it was available in 2009. That is the box on the right in the photo above, ok – a little dusty even though it sets on a shelf 2m above the floor.  It came with a single Seagate 1TB disk drive, and over the next few years went to 4x1TB drives, then eventually to 2x2TB drives. The files can be stored in folders that are shared out – and each folder/share can be configured to be redundant or not.

Ok – back to the problem at hand, one of the two drives – the Seagate 2TB had apparently stopped working.  After it had shutdown, I pulled out the second drive and connected it to my test/recovery machine. This second drive was able to spin up, and I ran a few commands on it, to determine what the issue with the drive was and then shut down. I didn’t want to keep the drive running until I had a way to copy its contents – having temporarily run out of disk storage space recently.

One of the commands that I run is “smartctl -a /dev/sdb” which on Linux will check the display the SMART data from the disk drive which is physically connected as /dev/sdb. The interesting things I am looking for are the Reallocated Sector Count and if any of the SMART attributes show that the drive has failed. None of them did and the Reallocated Sector Count was 14760 which is a little high – but this can be normal for the drive. The Power On Hours was 34,235 which equates to nearly 4 years – the drive itself is 5 years old. If I hadn’t used the drive straight away – this might be ok.

Of course, there were other values to be considered. Attribute 187 – Reported Uncorrectable was 0, 188 Command Timeout was 1, 197 Current Pending Sector Count was 216 and 198 Offline Uncorrectable Sector Count was also 216. Now – these last two are concerning – generally a non-zero number on these can indicate that the drive is having issues, and we should plan to replace it.

Smartctl also reports SMART errors that the drive has recorded – the main one occurred at 34,227 hours – like 8 hours before I noticed the problem and shut it down. This was error 8170 – WP at LBA = 0x00611d8f = 6364559 – this probably means that it couldn’t access this particular sector – which is a concern. What I need to do now, is to obtain or get a spare disk of at least 2TB and make a disk to disk copy of it – in order to ensure that my data is copied. I have a few 3TB disks lying around – maybe I can free one up for a little while. I think I will do that during the week.

Remember that I mentioned that we can specify some folders or shares to be redundant – meaning that the contents of those folders have copies that reside on the other disk? Well – not all folders were marked to be redundant, so if any of those folders reside on this particular disk might well be inaccessible. Fortunately, Windows Home Server creates a NTFS file system on each drive, so these drives can be connected to any Windows machine and be accessible – unlike some versions of RAID which can mean that the data is striped across each disk.

The other thing I want to think about is – what I would replace this WHS with. I currently run a virtual Freenas on ESXi server – but I was thinking about building a new standalone network storage appliance. Freenas is great if we can get the right hardware – such as ECC memory, a CPU and motherboard that supports ECC memory – and run ZFS but then I was reading about issues on ZFS – which caused me to look at what other people are using.

I could stay with Linux and run something like MergerFS and SnapRaid or I could go the Windows way – with Storage Spaces which is looking very tempting, except I don’t have a spare Windows 10 machine to play with – since the Free Upgrade from Windows 7/8.1 was over a couple of days ago. Decisions, decisions…

 

Recover.IT – Netgear Stora MS2110 NAS

Just last week, I was given a Netgear Stora MS2110 NAS to look at. I was shown that when powered on, only the orange HDDLED2 would light up – a symptom of it not working. They wanted the photos that were stored on it, that I expect were from many years ago. I could hear the internal disk drive spinning up, which was a good sign.

The disk drive is a Seagate Barracuda LP of size 1TB. The front cover slides off the NAS and then the single disk drive can be ejected using a lever at the back. This NAS could handle two disks, but only one had ever been installed. The first thing that I usually do, is to connect it to my Linux test machine. However, in this case – my test machine would not boot. It had occasionally done this in the past, but usually has then worked after a couple of tries. I plugged in my diagnostic card – which goes into a PCI slot, then powered on to see what was going on.

SONY DSC

PC Diagnostic result

The diagnostic card shows that the motherboard has stopped when checking the memory – DIMM. Ok, so what is wrong with the memory – it was a pair of 2GB DDR2-800 memory sticks. I checked my stock, and I had a Kingston 4GB set of DDR2-800, so swapped them in.

Now, it boots – ok, to proceed with the data recovery. I opened a terminal session, then ran “dmesg | grep sd” to check which disk is the 1TB drive – it was sdb. Here is the result of “dmesg | grep sdb”

Screenshot - 080616 - 114907

From this output, I can see that there is a partition called sdb1 on this disk drive. Next is to see what kind of partition it is.

Screenshot - 080616 - 133159

It is a partition type “0xfd” – Linux RAID partition. I needed to run the fdisk with the sudo command as a normal user cannot access disk devices directly. Since this was a single disk RAID, it is safe to assume that if the partition can be accessed, that the data is most likely intact.

Generally, the Netgear Stora uses a xfs partition which is slightly different to other common linux file systems. Ubuntu which is running on my test machine can handle xfs partition types, so to do this – I need to mount it. Just a word of caution – you will note that I haven’t copied the disk as yet – the reason is that my network storage is a bit full and cannot handle an extra 1TB of disk image. Anyway, I am just going to mount the partition and check the contents – if it is small, I can copy everything from it quickly without having to image the disk.

Screenshot - 080616 - 133332

The first command is to create a mount point for the disk, that I have called /mnt/nasdisk. The next command mounts the /dev/sdb1 partition as a xfs file system type in read-only mode. Then a ls command to list the files and folders shows folders.

Screenshot - 080616 - 133432

I opened this mount point in the graphical File Manager and can see that I can access 0common and going in, I could see various folders and the like – so I quickly copied those out to my network drive. The other folders with the X have permissions which I cannot access, so I will need to be the root user to get to them. I also checked with the df command to see how much was used in all of the file systems – only 11GB or so on this disk.

To do this, you need to run the File Manager with root permissions – which requires the package gksu to be installed in Ubuntu. I did this, then ran the command “gksudo thunar” and then navigated to this mount point.

Screenshot - 080616 - 133620

Success, the other folders now were accessible. This time, the File Manager has this orange line that tells you to be extra careful – as it is easy to do things when using the root account that could destroy your computer. I then proceeded to copy any other folders that contained file. All in all, there was just over 10GB of data – I ignored anything that had to do with the Stora software – like web pages and the like.

The Netgear Stora hardware seems to have failed – it could occur for many reasons, and could possibly be repaired but as I was just asked to get the data, then Recover.IT is what I did.

[Note]  It seems that many people on the internet have had problems with the Netgear Stora – but it might be that if you look for anything at all, you might find that everything also has problems. I did see though, that a lot of people had problems accessing xfs partitions – but that may be due to the various linux flavours. FreeBSD for example had read-only support for xfs in 2005, then removed it from version 10 in 2013. Maybe that is why I stay with Ubuntu.

[Note 2] My original memory seems to have developed a problem. I cleaned the contacts on the DDR2 memory using alcohol wipes, as come contacts were dirty. Eventually after further testing, I could have one memory dimm installed and working but not both. The Kingston memory though would work with both dimm slots occupied.

Restore.IT, Recover.IT – 2006 – When Murphy’s Law just wasn’t funny anymore! Or pages from the diary of a high-flying IT consultant and troubleshooter!

I was going through my old diaries with the view to putting the pages into the recycle bin when I came across an entry in 2006 that brought back painful memories (and perhaps tears to my eyes). This was an example when Murphy raised his head and continued doing so with near disastrous consequences – ok, my exaggeration – you can be the judge. The names have been changed to protect the innocent. A warning – this post is a long one, feel free to zone out and zone back in again further down. I have put everything into a timeline since that is what I get from my diary – also racking my memory to fill in gaps in my notes. I don’t have my original notes and detailed documents because all files had to be returned to the company when I got WFR’ed in 2010.

February 2006 – I got outsourced to a well known IT company. Basically we were given a choice, we move over or we leave. Leaving was not really an option for me, so I was the only one in Australia that got outsourced. 1 out of 3 – not bad!

April 2006 – Customer office in Malaysia was moved to another location. This included servers, networks etc, the whole kit and caboodle.

Fast forward to October 2006…

25th October 2006 – Malaysia server NTS4 (not its real name but similar) had shutdown during the afternoon. Yes, NT does stand for Windows NT.

26th October 2006 – During the late afternoon, I hear about the outage – NTS4 was down at 13:09 Malaysia time yesterday. We arrange to get it restarted.

27th October 2006 – We find that the tape backup drive is not connected. Also we find that the disk drive in slot 4 has failed. After some conference calls, we determine that the disk had failed before the site move but apparently nothing had been done about it – [Murphy 1: Customer 0]. This was escalated to get the drive replaced. This drive was part of a 5-disk RAID-5 array.  RAID-5 can handle a disk outage but requires replacing the failed drive as soon as possible, otherwise it operates in a degraded state with no fault-tolerance.

28th October 2006 – Engineer is scheduled onsite in Malaysia in the morning. A short time later, I get a call – the NTS4 server is down. Conference calls for the next three hours – it appears that the server is unable to boot as it has lost another drive, although this time the error report indicates that a drive was removed. We check with the engineer – he denies touching anything. All drives are still in the server – so, if the drive was removed, it was put back in – but that was already too late because the array controller now has two drives down, so what really happened. Anyway, I get on the phone to the regional service manager – I tell him that if the drive was removed, the data on the drive should still be intact and I would (with 99% certainly) be able to recreate the array using the data from each disk drive except the one that had failed a long time ago. I also tell him not to let any engineers do anything to the server before I get there.

A few hours later, after more conference calls – we decide that we need to bring services back online. NTS4 was both a SQL database server and a Microsoft Exchange server. I commence copying Exchange installation files from Singapore to another Malaysia server NTS8 which will become the replacement Exchange server.

Fast forward a couple of hours – I get a call that the server NTS4 has been fixed. That was when I had the knot in my stomache, shivers down my spine and knowing what comes next – like when you are standing on the edge of an abyss with the wind behind you getting stronger and stronger and nowhere to go.

What happened is a case of pride before prudence (not prejudice – ok, pun). My company’s Wintel Level 3 is based in Malaysia – they are supposed to know everything there is to know about Windows and Intel servers – however, as I found out, they know little about data recovery. Pressure was put on them to resolve a problem – why should an outsider (yes, myself – a newcomer with only 8 months in the company) be the only one that could fix the problem. When this was put to you by a big boss, how can you say that you can’t fix it – of course you can fix it. So what they did was to replace the long time failed drive and the one that had been removed. The array begins rebuilding – smiles all around… Except that the server does not boot – of course it was obvious to me, but I knew then that the data was essentially lost. [Murphy 2: Customer 0]

Ok – no point in crying over spilt milk – the only other course of action (with little hope) I could suggest is to have the server looked at by a data recovery company. My company does not have a data recovery department (surprise) – something that I have suggested, so an external company was required. A suitable company was located in Malaysia – and the server is being packaged up to go to them – cost would be 8,000-20,000 Malaysian Ringgits (irrelevant) and about 1 week turnaround. I finally got to bed on that Saturday at about midnight to try to get some sleep before a scheduled conference call 4 hours later.

29th October 2006 – Looks like that particular Sunday would be full-on. I am right. 04:00 conference call, followed by more work and more calls. I forgot to mention that I am also responsible for Microsoft Exchange 5.5 Level 3 support especially for these emergencies like server restorations. For about 7 hours, I work on installing Exchange 5.5 on NTS8 and finally around 21:30 I get all the mailboxes created. Then spent the next 3 hours getting replication and the X.400 Connector working to the Singapore regional bridgehead. Got to bed at about 1AM.

30th October 2006 – Got up early on a fine Monday morning and started installing the Trend Micro ScanMail and End User Quarantine software for Exchange. Installed Backup and service monitoring agents – yes, I basically install software for the entire infrastructure. Then to prepare for Microsoft SQL – copying install files to Malaysia from Singapore. We would use NTS8 for SQL – the Malaysia customer office uses SQL as the database backend for AccPac accounting software.

31st October 2006 – More work getting SQL installed and finally ready to look at restoring databases. A problem arose trying to read from the NTS4 tapes – it looks like the tape drive wasn’t working for some reason. I would probably have to actually go these, since I am also the level 3 support for Arcserve backup software – my company didn’t really have people who know much about these old applications, and I had been supporting and installing those applications since 1996. Anyway, the Malaysia customer office had email working and accounting database could wait until I get there to restore the NTS4 server from backup tapes.

Over the next 24 hours, I work on a site recovery and contingency plan. I knew that I would have to restore the NTS4 server from tape, so would need to export and import the mailboxes from the restored server to the new server. There were quite a few steps that would be needed in order to affect a good recovery and minimize any further downtime. Towards the end of the job, I expected that there would be a number of late nights involved.

2nd November 2006 – The report back from the data recovery company was not good. They cannot do anything because the array had been reinitialized. There were lots of files that could be recovered, but the main files we are wanting are the Exchange & SQL databases and associated log files – these are very large and much of the data had been lost due to the data striping of the array. I.e. two drives introduced forced a rebuild which is basically a reinitialize. A quarter of the actual data being efectively replaced with zeroes was what I estimated had happened. The server would be returned to the customer site.

6th November 2006 – The NTS8 server is down. Oh no! An IBM engineer is requested since this server is an IBM xSeries server. I thought at the time, that I should start arranging my travel and book flights. I get approval from my manager to fly to Malaysia from Sydney with the purpose of rebuilding and restoring NTS4 and to resolve NTS8. I get a call from the IBM engineer – the server is down due to a bad stripe. [Murphy 3: Customer 0]  How can that happen?  [It seems that if data within a stripe becomes inconsistent due to media errors, i.e. bad block (or part of) on the hard disk, then the stripe becomes bad. For instance, with three drives in RAID-5 and a block size of 16KB, this means that 32KB has become unavailable – and if this is part of an operating system file, then that could be preventing the server from booting.]  Flights arranged, SYD-SIN, SIN-KUL for the next day.

7th November 2006 – Left home at 05:30 heading to the airport for a 08:30 flight to Singapore. Arrived about 13:30 Singapore time and waiting for my 17:00 flight to KL. I get a call from the IBM engineer – he can fix the bad stripe. Really? Ok – how? Delete the array and recreate the array – yeah, right! What about the data? No problem – the data should be fine – no thanks! I forbid him to do this as I am on the way to Malaysia – don’t touch the server until I tell you to! I can be forceful when I need to be.  Deleting and recreating the array will definitely lose the data – I was not going to lose two servers in a row, no way, if I could help it!  I finally get my flight and arrive in KL and head to the hotel – arriving around 19:30 just in time for dinner – best to eat and get a good night’s sleep because tomorrow would be a long day.

8th November 2006 – Arrived at Malaysia customer site at 08:25. I have a look at both NTS4 and NTS8 servers. I carry a couple of Linux CD’s with me all the time. I planned to boot each server with a Knoppix live CD and run a “cat /etc/fstab” command – this would list the drives and file systems that Knoppix (Linux) recognizes as being available.

NTS8 – single drive, 2 partitions. /dev/sda1, i.e. C: drive on NTS8 is corrupted at about the 7.5GB point. /dev/sda2, D: drive appears intact – fantastic, because this is where the Exchange server databases and logfiles are stored. This is great news because it means that I can “recover.it“.  If I could get those Exchange databases and logfiles copied and restored successfully in the correct manner, the users will have all their email up to the point of failure – that was the best that anyone could hope to achieve.  I scrounged around looking for a machine with sufficient storage capacity – I finally found a relatively new desktop with enough space. I enabled Samba and then shared /dev/sda2 and started the copy of the Exchange databases and logfiles, etc to the desktop machine. I also wanted the Arcserve databases and logs.  It took a while because the files are quite large – especially Arcserve and even though that Exchange had only been running for approximately a week until it went down.

I reconfigured NTS4 to connect all of the disks to the inbuilt SCSI controller instead of the Smart Array controller.  Knoppix recognized 4 drives, 18GB, 18GB, 18GB, 36GB – I set up to copy the contents of each disk across the network to my laptop. I would use this data to test my perl script – the one that I would have used to rebuild the data if the disk array had been left as I had requested instead of being interfered with and effectively destroyed by the reinitialization process.

When the copying from NTS8 had completed, I started the copying of the files from the desktop to my usb disk – careful is my middle name, especially when it comes to critical customer data.

I rebooted NTS8 as it was time to “restore.it” and booted from the IBM ServerGuide cd. I erased the disk array and then started the install which would create a new array and then install Windows 2000 – since this is what had been running on NTS8, however it hung up when Setup is starting Windows 2000. Bummer!  Anyway, it was late – 20:45, better to get some rest and start afresh in the morning.  I called the IBM engineer, explained what I had done, and told him that his services were no longer required.  He could go ahead and close the call-out ticket.

9th November 2006 – In the office early again.  I worked on NTS8 again, trying to install a couple of times until finally the penny dropped, disconnected the tape drive and tried again. Success – it seemed that during the ServerGuide installation, it would hang trying to detect additional hardware, so best not to give it hardware to find and not know what to do with.  Windows 2000 Server installed – great.  I then quickly installed the Arcserve backup application and restored the D: drive then restored the C: drive including the system state. This overwrote the fresh installation with what had been backed up during the last full backup which fortunately was the night before the crash. I rebooted when the server was ready, and then stopped all of the Exchange related services.  I started the copying of the databases and logfiles from the desktop machine – this should put back into place the files up to the point of failure – at least for the email system.

Done – files are back in place – quick check of the files – they looked ok, file sizes the same as on the desktop. It was necesary to run a recovery process so that the files could be fully integrated into Exchange and the system registry. I ran the following commands – unfortunately I cannot give you a lot of detail on them as it isn’t relevant to this post, but suffice to say that the commands and specific order are necessary – as any Exchange 5.5 level 3 engineer will tell you.

“eseutil /g” – a few errors seen, not a problem as they were expected. “eseutil /r” – soft recovery completed successfully. Started the System Attendant and Directory Service services for Exchange, then logged off and logged on with the Exchange service account. “isinteg -patch” – completed, no errors. Started all remaining Exchange services – voila! Exchange is running.  Fixed Trend Micro ScanMail due to the antivirus patches not updating.

All users are informed that email is now accessible and that mail should be at the point of failure – hooray! [Murphy 3: Customer 1]

I then copied the databases for Arcserve so that Arcserve was back to the state at the time of the server crash.  All done.  It was time to look further at NTS4. I reconfigure the disk drives back to the array controller as by then I had all of the disk contents and can work on rebuilding the server. I installed Windows NT 4.0. While that was happening, I had a look at the tape drive to find out why it was not being recognized.  I saw some bent pins in the scsi connector – how did that happen?

DSCN2696

The penny dropped – it happens a lot!  During the site move in April 2006, they would have disconnected the cables to move the equipment and reconnect.  Whoever reconnected the cable to the tape backup unit obviously did so very clumsily and the backup unit was not tested afterwards. [Murphy 4: Customer 1]

10th November 2006 – I had to checkout of the Crystal Crown Hotel – and would move to another hotel – Hilton PJ, later in the day.  When booking flights and accommodation on short notice, we could not always get the one hotel for the entire stay.  Flights to Singapore and then back to Sydney were reserved.  Installed SQL Server 2000 onto NTS8 in preparation for restoration of AccPac databases. A slight (conservatively) hitch had to be resolved, last backup of NTS4 was probably the one before the office move in April – what to do?  Ok – not my problem, someone else could worry about that. I continued with my recovery plan to finish the NTS4 reinstallation in preparation for data restoration from tape.

11th November 2006 – The last backup tape of NTS4 (17/04/2014) was merged into the Arcserve database on NTS8 – this was needed before restoration from the tape was possible.  Restored two backup sessions to a temporary folder on NTS8. Attempts to restore session 3, resulted in session 2 being found instead – what gives? [Murphy 5: Customer 1]

12th November 2006 – It appears that Arcserve 6.61 when doing a full drive backup would allocate space on the tape based on the expected backup size requirement, however during the backup – some files may be unavailable, hence the actual backup is smaller resulting in slack space on the tape. This was causing a problem with the restore because the tape could not be positioned to session 3 properly.  Actually on further analysis, there appeared to be an extra session in between 2 and 3, so that 3 was not real, but trying to restore 3 ended up with 2. Restoring session 4 just failed because if it got to 4, it would see 3 and fail – pulling my hair out just didn’t help.  To rule out a tape drive problem, I decided to copy the tape to another different tape media. I used the tapecopy command to copy all sessions from the DLT4 tape to the SDLT1 tape. As it was going to take some time, I began analyzing the data I collected from NTS4 disks before the reinstallation.  I updated my perl script so that I could recreate the logical drive – as an academic exercise.

13th November 2006 – The tapecopy had completed. After deleting the tape from the database, I re-merged the tape in Arcserve – to my immediate relief, all backup sessions were visible and in the correct order. [Murphy 5: Customer 2]

I was able to restore the first three sessions comprising of C:, D: and F: then the fourth session being the System Registry was also successful. Next on the list was to restore the SQL databases – another hitch – the restore fails with “no valid destination” – I cannot restore the databases to NTS8 when they were backed up on NTS4. This was apparently a limitation of the backup agent.  NTS4 and NTS8 were on different Windows domains – I had to establish a trust between the two domains, then was able to restore from NTS8 directly to NTS4 when restoring to the original location. It wasn’t quite that straightforward as a reboot was involved and the Master database had to be restored first before restores of other SQL databases could work – but it was done. [Murphy 5: Customer 3].

Unfortunately we didn’t really want the SQL databases back on NTS4 because that server was already obsolete, so we decided at the time, that another server NTS5 would become the SQL database server. Since SQL Server was no longer needed on NTS8, it was uninstalled as it was intended to be temporary anyway for the purpose of restoring the databases.

14th November 2006 – It was time for the Exchange database restoration on NTS4. The Exchange site was isolated, to avoid replication – essential when doing an online restoration of old databases. The Exchange database restore was commenced. In the meantime NTS5 was worked on to install SQL Server and Arcserve backup agents. I also did some further work on my perl script for the raid recovery test.

Whew! Still reading this? I did say that this was a long post. Anyway to cut a long story short – in the remaining days of that week, the Exchange databases were restored to NTS4. Exchange was brought up and verified that the mailboxes were intact – which was fantastic. All mailboxes were then exported to pst files using the Exmerge program. These pst files were uploaded to the NTS8 Exchange server.  All of the users were happy to get more emails back, but not so happy that the emails between 18/04/2014 to 28/10/2014 was irretrievably lost. Sql databases were also moved to NTS5 and my job in Malaysia was done except for some cleanup actions that could be done remotely. [Murphy 5: Customer 4]

This was an example of some of the things that I encountered during my roving life as an IT consultant and troubleshooter. In those couple of weeks I had to contend with multiple failures involving disk arrays and had to perform server recoveries and restorations under difficult circumstances.

Did we break even at Murphy 5: Customer 4 – doesn’t look like it?  Oh yes, the backup tape was six months old – what about the AccPac accounting databases, I can hear you asking?  My company had to hire a number of data entry people to input all the accounts for the six months or more based on the accounting printouts that they had – lucky they had hard-copies, right?  And yes, the whole accounting process had to be followed, April data entry, then April end of month closure, printout, May data entry… A month or so later, the data entry was completed, and AccPac was rolling ahead! [Murphy 5: Customer 5]

[PS] I feel a bit sorry to put you through all of this, but I hope you understand that an IT problem is not always straight forward. I also tried to keep the relevant parts as it is possible that others may encounter this situation in the future and may find some help in this post.  I forgot to mention that I did finish my perl script to recreate the logical drive of the failed array, then during analysis, was able to show conclusively that the array had been reinitialized which was why the data was lost. Further to this, I was able to confirm during testing on equivalent hardware that taking a 5 drive RAID-5 array, I could pull out one drive and lock it away to simulate an old failed drive, then pull out a second drive to crash the array – read the contents of the four available disks, then I could use my perl script to recreate the data on the locked away drive, and also to recreate a logical drive that is the same as converting the array into a single larger drive. All this using a perl scripting language that is over 20 years old – and the script comprising of only a small number of actual commands. For those of you who know perl, you will understand “$b5 = $b1 ^ $b2 ^ $b3 ^ $b4;” – that is the magic line. Everything else was just definitions, reading, writing and looping.

Maybe we could make a movie out of this – but of course, no car chases, no martial arts, no gunplay, no scantily clad women – no fun, right?

Recover.IT – HP dc7700 Small Form Factor PC

A neighbour asked me to look at their PC late last week. Apparently it had stopped working and contained very important data. They also gave me a Seagate external hard disk to put data that I recover from it.

The PC was very dusty – I noticed because my white T-Shirt had a bit of a stain on it after bringing the PC inside. First step was to open it up and vacuum it out. The cpu fan and heatsink was covered in lint and dust – a sure sign of being operated in a home environment – also after a few years this can happen no matter how clean the house is kept.

When I power on the HP dc7700 SFF machine, the fans start up then stop, power goes out, and then it beeps nine times. The power supply does not stay on, but the 5V standby power is still available, hence why it can beep. After searching online for the diagnostic codes for this machine, it turns out that the resolution is to:

1. check the power supply voltage selector – none in this case

2. replace the motherboard – I don’t have a spare

3. replace the power supply – not available, but could be repaired.

Next – to “recover.it” the data, I mean.  Removing the hard disk from the machine was quite easy.  I connected the hard disk to an external dock on my Linux laptop – it seemed to be visible. First thing now is to make a raw image of the disk drive. It was a Samsung HD082GJ which is 80GB in size. That should fit on my laptop as I have about 140GB free. I run Ubuntu 14.04 on my laptop for recovery operations like this. I use a dd utility to copy the raw disk. Specifically the command I use is

dd if=/dev/sdb of=may-hd082gj.dd bs=1M conv=noerror,sync

The parameters are to use a block size of 1MB, continue even if errors encountered, but the “sync” option means to replace bad sectors with blank sectors. I will end up with an image file that is exactly the same size as the original disk drive. Without sync, the image file will be missing the bad sectors, so the data can be slightly out of line which is not good.  Fortunately in this case, the dd utility completed with no errors detected.

Next I copied the image file to a network location, so that I have another copy of it in case it is needed.  I then mounted the disk, and browsed the contents – it appears to be Windows XP, therefore the data that should be collected will be files and folders within C:\Documents and Settings\Edwin – Edwin being the user login name.  Typically My Documents, Desktop, Pictures etc are what to copy, but other folders on the C: drive were also copied. I did the same for the D: drive – the original disk was partitioned into two drives.  There is no point in copying the Windows folders since most of the files are not user files or data. However it is good to have a look in case documents, spreadsheets are stored there.

Ok – job done. I returned the external disk to my neighbour on Tuesday evening. They were happy to get their data, but they mentioned that they use that machine to access their Optus email, and they don’t know the password as it was stored on the machine – ok… If I can’t get the password for them, they could contact Optus to get it reset – so how do I go about this? More on this next.