I have ended up with a bit of a mess in my XFS root! I have a directory full of illegal files and trash, it looks like some how inodes have become cross linked from some data files. This is somewhat of a pain as the box is up and running and I don’t want to offline it to repair it. Of course the way to repair XFS is with an offline filesystem.
Like that’s going to stop me trying. In the off-chance I manage to do this without rendering the file system jam I am going to document it here (plus may be handy if I need to do it again).
In my situation my root is very small in size (about 380MB), all of you who just have one epic root dir with everything in it, let this be a lesson to you. It is on an LVM Volume so I am sure there are possibly nicer ways to do some parts of this with LVM, but I have no unallocated space to make clones of volumes etc.
On my system I have a volume group called ‘primary’ and the root volume is called ‘root’.
An ls shows me the mess :
sulaco:/home/graeme# ls /etc/vmware.bork/pam.d/
001qr-RN-D 07Sd-Eu-H Di-H ? L
005Rn-Di-D 3TO-GM-H -GM-D ^ùì
008D6-E3-H 4aJ-Fc-D L
05Qb-4z-H 7SK-SD-D ? L
? 06fY-RJ-D AES (AdvA?d EnR·0¦ ?qSt0àard)?$? ? L
(I have renamed the directory to vmware.bork)
1st off, we need to get the current running FS as consistent as possible. So sync what’s outstanding and make it read only.
sulaco:/home/graeme# sync;sync;mount / -o remount,ro
Now, we have a static target (I hope). Lets get a nice block level copy of it whilst it’s not moving, I hate working on live stuff and much rather make copies of stuff I am about to break.
sulaco:/home/graeme# dd if=/dev/primary/root of=root.fs
720896+0 records in
720896+0 records out
369098752 bytes (369 MB) copied, 45.8338 seconds, 8.1 MB/s
We now have a file ‘root.fs’ which is a block level copy of our broken file system. Depending on your confidence it may be worth making a copy of that elsewhere, should the repair go horribly wrong you can always DD it back in place and be where you were when you started.
Now we have an unmounted copy of our broken XFS. We should now be able to use the -f flag with xfs_check to see what’s what.
(NB: If you get some error about there being changes and the log needing to be replayed, simply mount the file as a loopback device and unmount it again. You may need to use the nouuid option to be able to mount it. I will cover mounting the FS later)
sulaco:/home/graeme# xfs_check -f root.fs
agi unlinked bucket 38 is 96038 in ag 2 (inode=620326)
agi unlinked bucket 40 is 96040 in ag 2 (inode=620328)
agi unlinked bucket 41 is 96041 in ag 2 (inode=620329)
agi unlinked bucket 42 is 96042 in ag 2 (inode=620330)
agi unlinked bucket 44 is 96044 in ag 2 (inode=620332)
agi unlinked bucket 45 is 96045 in ag 2 (inode=620333)
agi unlinked bucket 46 is 96046 in ag 2 (inode=620334)
agi unlinked bucket 47 is 96047 in ag 2 (inode=620335)
agi unlinked bucket 49 is 96049 in ag 2 (inode=620337)
agi unlinked bucket 52 is 244 in ag 2 (inode=524532)
agi unlinked bucket 53 is 96053 in ag 2 (inode=620341)
agi unlinked bucket 57 is 96057 in ag 2 (inode=620345)
dir 894700 bad size in entry at 10
dir 894700 entry .. bad inode number 18158138244107078
dir 894700 i8count mismatch is 181 should be 1
agi unlinked bucket 18 is 338 in ag 6 (inode=1573202)
agi unlinked bucket 29 is 157 in ag 6 (inode=1573021)
agi unlinked bucket 58 is 80890 in ag 6 (inode=1653754)
agi unlinked bucket 62 is 19454 in ag 6 (inode=1592318)
link count mismatch for inode 524426 (name ?), nlink 3, counted 2
allocated inode 524532 has 0 link count
allocated inode 620326 has 0 link count
allocated inode 620328 has 0 link count
allocated inode 620329 has 0 link count
allocated inode 620330 has 0 link count
allocated inode 620332 has 0 link count
allocated inode 620333 has 0 link count
allocated inode 620334 has 0 link count
allocated inode 620335 has 0 link count
allocated inode 620337 has 0 link count
allocated inode 620340 has 0 link count
allocated inode 620341 has 0 link count
allocated inode 620345 has 0 link count
disconnected inode 894702, nlink 1
allocated inode 1573202 has 0 link count
allocated inode 1653754 has 0 link count
allocated inode 1573021 has 0 link count
allocated inode 1592318 has 0 link count
So, as I said.. broken then.
On with the repair then, xfs_repair -f (as we are using a file for our block device) should work.
sulaco:/home/graeme# xfs_repair -f root.fs
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
error following ag 2 unlinked list
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3 \Ð" in shortform directory 894700 references invalid inode 185706064904960>ßýÐPÏ
size of entry #0 overflows space left in in shortform dir 894700
junking 255 entries
corrected entry count in directory 894700, was 63, now 0
corrected i8 count in directory 894700, was 181, now 1
corrected directory 894700 size, was 25, now 10
bogus .. inode number (18158138244107078) in directory inode 894700, clearing inode number
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- clear lost+found (if it exists) ...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
corrected i8 count in directory 894700, was 1, now 0
corrected directory 894700 size, was 10, now 6
bogus .. inode number (0) in directory inode 894700, clearing inode number
- agno = 4
- agno = 5
- agno = 6
- agno = 7
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
entry "pam.d" in directory inode 524426 not consistent with .. value (18446744073709551615) in inode 894700,
junking entry "pam.d" in directory inode 894700
- traversal finished ...
- traversing all unattached subtrees ...
- traversals finished ...
- moving disconnected inodes to lost+found ...
disconnected inode 524532, moving to lost+found
disconnected inode 620326, moving to lost+found
disconnected inode 620328, moving to lost+found
disconnected inode 620329, moving to lost+found
disconnected inode 620330, moving to lost+found
disconnected inode 620332, moving to lost+found
disconnected inode 620333, moving to lost+found
disconnected inode 620334, moving to lost+found
disconnected inode 620335, moving to lost+found
disconnected inode 620337, moving to lost+found
disconnected inode 620340, moving to lost+found
disconnected inode 620341, moving to lost+found
disconnected inode 620345, moving to lost+found
disconnected dir inode 894700, moving to lost+found
disconnected inode 894702, moving to lost+found
disconnected inode 1573021, moving to lost+found
disconnected inode 1573202, moving to lost+found
disconnected inode 1592318, moving to lost+found
disconnected inode 1653754, moving to lost+found
Phase 7 - verify and correct link counts...
resetting inode 524426 nlinks from 3 to 2
done
Few errors in there but its better than nothing I guess. Now to mount it and have a look around see if things match the existing root etc.
sulaco:/home/graeme# mkdir testroot
sulaco:/home/graeme# mount root.fs testroot -o loop
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
Aha! Our 1st gotcha! whyever didn’t that work… following its advice and tailing the logs gets us :
sulaco:/home/graeme# tail /var/log/messages
Aug 24 16:45:15 sulaco kernel: Filesystem "loop0": Disabling barriers, not supported by the underlying device
Aug 24 16:45:15 sulaco kernel: XFS: Filesystem loop0 has duplicate UUID - can't mount
UUID conflict, simple to overcome, ‘nouuid’ as an option to the mount :
sulaco:/home/graeme# mount root.fs testroot -o loop,nouuid
sulaco:/home/graeme# ls testroot
bin data home initrd.img.old media proc srv tmp vmlinuz
boot dev initrd lib mnt root sys usr vmlinuz.old
cdrom etc initrd.img lost+found opt sbin tftpboot var
Much better. Have a quick look through this and check to see if the important stuff looks OK. Maybe diff some files/directories with your still mounted root etc.
Before making the crazy step, unmount the newly repaired filesystem.
sulaco:/home/graeme# umount testroot
Now, we have a fixed root file system.. its just not the one we are using. So, time to cross your limbs and dd it back. Hopefully the fact the mounted one is read-only should allow us to change the disk without too much hassle. I hope.
so here we go..
sulaco:/home/graeme# dd if=root.fs of=/dev/primary/root
720896+0 records in
720896+0 records out
369098752 bytes (369 MB) copied, 30.9349 seconds, 11.9 MB/s
Lets have a little look around…
sulaco:/etc# ls /etc/vmware.bork/pam.d/
sulaco:/etc# ls /lost+found/
1573021 1592318 524532 620328 620330 620333 620335 620340 620345 894702
1573202 1653754 620326 620329 620332 620334 620337 620341 894700
Well the trash has gone from the /etc/vmware.bork directory, and there are a chunk of lost inodes in lost and found… So far so good. Perhaps try running a few commands from /bin /sbin etc to check that all is well, as we haven’t reorganised the data I wouldn’t imagine there would be too much broken.
All that’s left now is to remount it read-write and go and have a nice lie down/brew.
sulaco:/bin# mount / -o remount,rw
(remember it’s mounted read only so no need to sync )
There we go, XFS root repaired with no rebooting, or downtime.
Think I will go and be chuffed with myself now for a bit and wait for this to bite me in the arse
(If it does I will post here and let you know about it
)
Hope someone else finds that useful, or at least interesting.