summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2015-08-03ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko
commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream. ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: correctly migrate a file with a hole at the beginningEryu Guan
commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream. Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: be more strict when migrating to non-extent based fileEryu Guan
commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream. Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner
commit 9705acd63b125dee8b15c705216d7186daea4625 upstream. On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: don't retry file block mapping on bigalloc fs with non-extent fileDarrick J. Wong
commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream. ext4 isn't willing to map clusters to a non-extent file. Don't signal this with an out of space error, since the FS will retry the allocation (which didn't fail) forever. Instead, return EUCLEAN so that the operation will fail immediately all the way back to userspace. (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: call sync_blockdev() before invalidate_bdev() in put_super()Theodore Ts'o
commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream. Normally all of the buffers will have been forced out to disk before we call invalidate_bdev(), but there will be some cases, where a file system operation was aborted due to an ext4_error(), where there may still be some dirty buffers in the buffer cache for the device. So try to force them out to memory before calling invalidate_bdev(). This fixes a warning triggered by generic/081: WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f() Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix race between truncate and __ext4_journalled_writepage()Theodore Ts'o
commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream. The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different warnings or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7 c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06ext4: check for zero length extent explicitlyEryu Guan
commit 2f974865ffdfe7b9f46a9940836c8b167342563d upstream. The following commit introduced a bug when checking for zero length extent 5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries() Zero length extent could pass the check if lblock is zero. Adding the explicit check for zero length back. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-06ext4: fix NULL pointer dereference when journal restart failsLukas Czerner
commit 9d506594069355d1fb2de3f9104667312ff08ed3 upstream. Currently when journal restart fails, we'll have the h_transaction of the handle set to NULL to indicate that the handle has been effectively aborted. We handle this situation quietly in the jbd2_journal_stop() and just free the handle and exit because everything else has been done before we attempted (and failed) to restart the journal. Unfortunately there are a number of problems with that approach introduced with commit 41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart() fails" First of all in ext4 jbd2_journal_stop() will be called through __ext4_journal_stop() where we would try to get a hold of the superblock by dereferencing h_transaction which in this case would lead to NULL pointer dereference and crash. In addition we're going to free the handle regardless of the refcount which is bad as well, because others up the call chain will still reference the handle so we might potentially reference already freed memory. Moreover it's expected that we'll get aborted handle as well as detached handle in some of the journalling function as the error propagates up the stack, so it's unnecessary to call WARN_ON every time we get detached handle. And finally we might leak some memory by forgetting to free reserved handle in jbd2_journal_stop() in the case where handle was detached from the transaction (h_transaction is NULL). Fix the NULL pointer dereference in __ext4_journal_stop() by just calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix the potential memory leak in jbd2_journal_stop() and use proper handle refcounting before we attempt to free it to avoid use-after-free issues. And finally remove all WARN_ON(!transaction) from the code so that we do not get random traces when something goes wrong because when journal restart fails we will get to some of those functions. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner
commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream. Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06ext4: make fsync to sync parent dir in no-journal for real this timeLukas Czerner
commit e12fb97222fc41e8442896934f76d39ef99b590a upstream. Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a support for for syncing parent directory of newly created inodes to make sure that the inode is not lost after a power failure in no-journal mode. However this does not work in majority of cases, namely: - if the directory has inline data - if the directory is already indexed - if the directory already has at least one block and: - the new entry fits into it - or we've successfully converted it to indexed So in those cases we might lose the inode entirely even after fsync in the no-journal mode. This also includes ext2 default mode obviously. I've noticed this while running xfstest generic/321 and even though the test should fail (we need to run fsck after a crash in no-journal mode) I could not find a newly created entries even when if it was fsynced before. Fix this by adjusting the ext4_add_entry() successful exit paths to set the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the parent directory as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11ext4: prevent bugon on race between write/fcntlDmitry Monakhov
commit a41537e69b4aa43f0fea02498c2595a81267383b upstream. O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked twice inside ext4_file_write_iter() and __generic_file_write() which result in BUG_ON inside ext4_direct_IO. Let's initialize iocb->private unconditionally. TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/ #TYPICAL STACK TRACE: kernel BUG at fs/ext4/inode.c:2960! invalid opcode: 0000 [#1] SMP Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000 RIP: 0010:[<ffffffff811fabf2>] [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246 RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818 RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001 RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80 R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28 FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0 Stack: ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30 0000000000000200 0000000000000200 0000000000000001 0000000000000200 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200 Call Trace: [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740 [<ffffffff811be030>] SyS_io_submit+0x10/0x20 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c 24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89 RIP [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0 RSP <ffff88080f90bb58> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> [hujianyang: Backported to 3.10 - Move initialization of iocb->private to ext4_file_write() as we don't have ext4_file_write_iter(), which is introduced by commit 9b884164. - Adjust context to make 'overwrite' changes apply to ext4_file_dio_write() as ext4_file_dio_write() is not move into ext4_file_write()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix oops when loading block bitmap failedJan Kara
commit 599a9b77ab289d85c2d5c8607624efbe1f552b0f upstream. When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
commit 98c1a7593fa355fda7f5a5940c8bf5326ca964ba upstream. If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix overflow when updating superblock backups after resizeJan Kara
commit 9378c6768e4fca48971e7b6a9075bc006eda981d upstream. When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check s_chksum_driver when looking for bg csum presenceDarrick J. Wong
commit 813d32f91333e4c33d5a19b67167c4bae42dae75 upstream. Convert the ext4_has_group_desc_csum predicate to look for a checksum driver instead of the metadata_csum flag and change the bg checksum calculation function to look for GDT_CSUM before taking the crc16 path. Without this patch, if we mount with ^uninit_bg,^metadata_csum and later metadata_csum gets turned on by accident, the block group checksum functions will incorrectly assume that checksumming is enabled (metadata_csum) but that crc16 should be used (!s_chksum_driver). This is totally wrong, so fix the predicate and the checksum formula selection. (Granted, if the metadata_csum feature bit gets enabled on a live FS then something underhanded is going on, but we could at least avoid writing garbage into the on-disk fields.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: Replace open coded mdata csum feature to helper functionDmitry Monakhov
commit 9aa5d32ba269bec0e7eaba2697a986a7b0bc8528 upstream. Besides the fact that this replacement improves code readability it also protects from errors caused direct EXT4_S(sb)->s_es manipulation which may result attempt to use uninitialized csum machinery. #Testcase_BEGIN IMG=/dev/ram0 MNT=/mnt mkfs.ext4 $IMG mount $IMG $MNT #Enable feature directly on disk, on mounted fs tune2fs -O metadata_csum $IMG # Provoke metadata update, likey result in OOPS touch $MNT/test umount $MNT #Testcase_END # Replacement script @@ expression E; @@ - EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) + ext4_has_metadata_csum(E) https://bugzilla.kernel.org/show_bug.cgi?id=82201 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix reservation overflow in ext4_da_write_beginEric Sandeen
commit 0ff8947fc5f700172b37cbca811a38eb9cb81e08 upstream. Delalloc write journal reservations only reserve 1 credit, to update the inode if necessary. However, it may happen once in a filesystem's lifetime that a file will cross the 2G threshold, and require the LARGE_FILE feature to be set in the superblock as well, if it was not set already. This overruns the transaction reservation, and can be demonstrated simply on any ext4 filesystem without the LARGE_FILE feature already set: dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \ conv=notrunc of=testfile sync dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \ conv=notrunc of=testfile leads to: EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28 EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28 EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28 Adjust the number of credits based on whether the flag is already set, and whether the current write may extend past the LARGE_FILE limit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: add ext4_iget_normal() which is to be used for dir tree lookupsTheodore Ts'o
commit f4bb2981024fc91b23b4d09a8817c415396dbabb upstream. If there is a corrupted file system which has directory entries that point at reserved, metadata inodes, prohibit them from being used by treating them the same way we treat Boot Loader inodes --- that is, mark them to be bad inodes. This prohibits them from being opened, deleted, or modified via chmod, chown, utimes, etc. In particular, this prevents a corrupted file system which has a directory entry which points at the journal inode from being deleted and its blocks released, after which point Much Hilarity Ensues. Reported-by: Sami Liedes <sami.liedes@iki.fi> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: grab missed write_count for EXT4_IOC_SWAP_BOOTDmitry Monakhov
commit 3e67cfad22230ebed85c56cbe413876f33fea82b upstream. Otherwise this provokes complain like follows: WARNING: CPU: 12 PID: 5795 at fs/ext4/ext4_jbd2.c:48 ext4_journal_check_start+0x4e/0xa0() Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 12 PID: 5795 Comm: python Not tainted 3.17.0-rc2-00175-gae5344f #158 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 0000000000000030 ffff8808116cfd28 ffffffff815c7dfc 0000000000000030 0000000000000000 ffff8808116cfd68 ffffffff8106ce8c ffff8808116cfdc8 ffff880813b16000 ffff880806ad6ae8 ffffffff81202008 0000000000000000 Call Trace: [<ffffffff815c7dfc>] dump_stack+0x51/0x6d [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81202008>] ? ext4_ioctl+0x9e8/0xeb0 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20 [<ffffffff8122867e>] ext4_journal_check_start+0x4e/0xa0 [<ffffffff81228c10>] __ext4_journal_start_sb+0x90/0x110 [<ffffffff81202008>] ext4_ioctl+0x9e8/0xeb0 [<ffffffff8107b0bd>] ? ptrace_stop+0x24d/0x2f0 [<ffffffff81088530>] ? alloc_pid+0x480/0x480 [<ffffffff8107b1f2>] ? ptrace_do_notify+0x92/0xb0 [<ffffffff81186545>] do_vfs_ioctl+0x4e5/0x550 [<ffffffff815cdbcb>] ? _raw_spin_unlock_irq+0x2b/0x40 [<ffffffff81186603>] SyS_ioctl+0x53/0x80 [<ffffffff815ce2ce>] tracesys+0xd0/0xd5 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix mmap data corruption when blocksize < pagesizeJan Kara
commit d6320cbfc92910a3e5f10c42d98c231c98db4f60 upstream. Use truncate_isize_extended() when hole is being created in a file so that ->page_mkwrite() will get called for the partial tail page if it is mmaped (see the first patch in the series for details). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: don't check quota format when there are no quota filesJan Kara
commit 279bf6d390933d5353ab298fcc306c391a961469 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check EA value offset when loadingDarrick J. Wong
commit a0626e75954078cfacddb00a4545dde821170bc5 upstream. When loading extended attributes, check each entry's value offset to make sure it doesn't collide with the entries. Without this check it is easy to crash the kernel by mounting a malicious FS containing a file with an EA wherein e_value_offs = 0 and e_value_size > 0 and then deleting the EA, which corrupts the name list. (See the f_ea_value_crash test's FS image in e2fsprogs for an example.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong
commit db9ee220361de03ee86388f9ea5e529eaad5323c upstream. It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: update i_disksize coherently with block allocation on error pathDmitry Monakhov
commit 6603120e96eae9a5d6228681ae55c7fdc998d1bb upstream. In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: fix BUG_ON in mb_free_blocks()Theodore Ts'o
commit c99d1e6e83b06744c75d9f5e491ed495a7086b7b upstream. If we suffer a block allocation failure (for example due to a memory allocation failure), it's possible that we will call ext4_discard_allocated_blocks() before we've actually allocated any blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0) triggering the BUG_ON on mb_free_blocks(): BUG_ON(last >= (sb->s_blocksize << 3)); Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len is zero. Also fix a missing ext4_mb_unload_buddy() call in ext4_discard_allocated_blocks(). Google-Bug-Id: 16844242 Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa structTheodore Ts'o
commit 86f0afd463215fc3e58020493482faa4ac3a4d69 upstream. If there is a failure while allocating the preallocation structure, a number of blocks can end up getting marked in the in-memory buddy bitmap, and then not getting released. This can result in the following corruption getting reported by the kernel: EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126, 12793 clusters in bitmap, 12729 in gd In that case, we need to release the blocks using mb_free_blocks(). Tested: fs smoke test; also demonstrated that with injected errors, the file system is no longer getting corrupted Google-Bug-Id: 16657874 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: fix a potential deadlock in __ext4_es_shrink()Theodore Ts'o
commit 3f1f9b851311a76226140b55b1ea22111234a7c2 upstream. This fixes the following lockdep complaint: [ INFO: possible circular locking dependency detected ] 3.16.0-rc2-mm1+ #7 Tainted: G O ------------------------------------------------------- kworker/u24:0/4356 is trying to acquire lock: (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 but task is already holding lock: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 which lock already depends on the new lock. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); *** DEADLOCK *** 6 locks held by kworker/u24:0/4356: #0: ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #2: (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90 #3: (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0 #4: (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550 #5: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 stack backtrace: CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G O 3.16.0-rc2-mm1+ #7 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback bdi_writeback_workfn (flush-253:0) ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930 Call Trace: [<ffffffff815df0bb>] dump_stack+0x4e/0x68 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff810a8707>] lock_acquire+0x87/0x120 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00 ... Reported-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Minchan Kim <minchan@kernel.org> Cc: Zheng Liu <gnehzuil.liu@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen
commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream. The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()Theodore Ts'o
commit 94d4c066a4ff170a2671b1a9b153febbf36796f6 upstream. We are spending a lot of time explaining to users what this error means. Let's try to improve the message to avoid this problem. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: clarify error count warning messagesTheodore Ts'o
commit ae0f78de2c43b6fadd007c231a352b13b5be8ed2 upstream. Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: fix unjournalled bg descriptor while initializing inode bitmapTheodore Ts'o
commit 61c219f5814277ecb71d64cb30297028d6665979 upstream. The first time that we allocate from an uninitialized inode allocation bitmap, if the block allocation bitmap is also uninitalized, we need to get write access to the block group descriptor before we start modifying the block group descriptor flags and updating the free block count, etc. Otherwise, there is the potential of a bad journal checksum (if journal checksums are enabled), and of the file system becoming inconsistent if we crash at exactly the wrong time. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09ext4: Fix hole punching for files with indirect blocksJan Kara
commit a93cd4cf86466caa49cfe64607bea7f0bde3f916 upstream. Hole punching code for files with indirect blocks wrongly computed number of blocks which need to be cleared when traversing the indirect block tree. That could result in punching more blocks than actually requested and thus effectively cause a data loss. For example: fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Fix the calculation. Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09ext4: Fix buffer double free in ext4_alloc_branch()Jan Kara
commit c5c7b8ddfbf8cb3b2291e515a34ab1b8982f5a2d upstream. Error recovery in ext4_alloc_branch() calls ext4_forget() even for buffer corresponding to indirect block it did not allocate. This leads to brelse() being called twice for that buffer (once from ext4_forget() and once from cleanup in ext4_ind_map_blocks()) leading to buffer use count misaccounting. Eventually (but often much later because there are other users of the buffer) we will see messages like: VFS: brelse: Trying to free free buffer Another manifestation of this problem is an error: JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh); inconsistent data on disk The fix is easy - don't forget buffer we did not allocate. Also add an explanatory comment because the indexing at ext4_alloc_branch() is somewhat subtle. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30ext4: fix wrong assert in ext4_mb_normalize_request()Maurizio Lombardi
commit b5b60778558cafad17bbcbf63e0310bd3c68eb17 upstream. The variable "size" is expressed as number of blocks and not as number of clusters, this could trigger a kernel panic when using ext4 with the size of a cluster different from the size of a block. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30ext4: fix zeroing of page during writebackJan Kara
commit eeece469dedadf3918bad50ad80f4616a0064e90 upstream. Tail of a page straddling inode size must be zeroed when being written out due to POSIX requirement that modifications of mmaped page beyond inode size must not be written to the file. ext4_bio_write_page() did this only for blocks fully beyond inode size but didn't properly zero blocks partially beyond inode size. Fix this. The problem has been uncovered by mmap_11-4 test in openposix test suite (part of LTP). Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com> Fixes: 5a0dc7365c240 Fixes: bd2d0210cf22f CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30ext4: fix data integrity sync in ordered modeNamjae Jeon
commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream. When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ext4: use i_size_read in ext4_unaligned_aio()Theodore Ts'o
commit 6e6358fc3c3c862bfe9a5bc029d3f8ce43dc9765 upstream. We haven't taken i_mutex yet, so we need to use i_size_read(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent()Theodore Ts'o
commit 622cad1325e404598fe3b148c3fa640dbaabc235 upstream. The function ext4_update_i_disksize() is used in only one place, in the function mpage_map_and_submit_extent(). Move its code to simplify the code paths, and also move the call to ext4_mark_inode_dirty() into the i_data_sem's critical region, to be consistent with all of the other places where we update i_disksize. That way, we also keep the raw_inode's i_disksize protected, to avoid the following race: CPU #1 CPU #2 down_write(&i_data_sem) Modify i_disk_size up_write(&i_data_sem) down_write(&i_data_sem) Modify i_disk_size Copy i_disk_size to on-disk inode up_write(&i_data_sem) Copy i_disk_size to on-disk inode Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ext4: fix jbd2 warning under heavy xattr loadJan Kara
commit ec4cb1aa2b7bae18dd8164f2e9c7c51abcf61280 upstream. When heavily exercising xattr code the assertion that jbd2_journal_dirty_metadata() shouldn't return error was triggered: WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237 jbd2_journal_dirty_metadata+0x1ba/0x260() CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G W 3.10.0-ceph-00049-g68d04c9 #1 Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978 Call Trace: [<ffffffff816311b0>] dump_stack+0x19/0x1b [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0 [<ffffffff8119421e>] setxattr+0x13e/0x1e0 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b The reason for the warning is that buffer_head passed into jbd2_journal_dirty_metadata() didn't have journal_head attached. This is caused by the following race of two ext4_xattr_release_block() calls: CPU1 CPU2 ext4_xattr_release_block() ext4_xattr_release_block() lock_buffer(bh); /* False */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) } else { le32_add_cpu(&BHDR(bh)->h_refcount, -1); unlock_buffer(bh); lock_buffer(bh); /* True */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) get_bh(bh); ext4_free_blocks() ... jbd2_journal_forget() jbd2_journal_unfile_buffer() -> JH is gone error = ext4_handle_dirty_xattr_block(handle, inode, bh); -> triggers the warning We fix the problem by moving ext4_handle_dirty_xattr_block() under the buffer lock. Sadly this cannot be done in nojournal mode as that function can call sync_dirty_buffer() which would deadlock. Luckily in nojournal mode the race is harmless (we only dirty already freed buffer) and thus for nojournal mode we leave the dirtying outside of the buffer lock. Reported-by: Sage Weil <sage@inktank.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ext4: note the error in ext4_end_bio()Matthew Wilcox
commit 9503c67c93ed0b95ba62d12d1fd09da6245dbdd6 upstream. ext4_end_bio() currently throws away the error that it receives. Chances are this is part of a spate of errors, one of which will end up getting the error returned to userspace somehow, but we shouldn't take that risk. Also print out the errno to aid in debug. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKSKazuya Mio
commit 4adb6ab3e0fa71363a5ef229544b2d17de6600d7 upstream. When we try to get 2^32-1 block of the file which has the extent (ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON in ext4_ext_put_gap_in_cache(). To avoid the problem, ext4_map_blocks() needs to check the file logical block number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot handle 2^32-1 because the maximum file logical block number is 2^32-2. Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid. So ext4_map_blocks() should also return the same errno. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26ext4: fix premature freeing of partial clusters split across leaf blocksEric Whitney
commit ad6599ab3ac98a4474544086e048ce86ec15a4d1 upstream. Xfstests generic/311 and shared/298 fail when run on a bigalloc file system. Kernel error messages produced during the tests report that blocks to be freed are already on the to-be-freed list. When e2fsck is run at the end of the tests, it typically reports bad i_blocks and bad free blocks counts. The bug that causes these failures is located in ext4_ext_rm_leaf(). Code at the end of the function frees a partial cluster if it's not shared with an extent remaining in the leaf. However, if all the extents in the leaf have been removed, the code dereferences an invalid extent pointer (off the front of the leaf) when the check for sharing is made. This generally has the effect of unconditionally freeing the partial cluster, which leads to the observed failures when the partial cluster is shared with the last extent in the next leaf. Fix this by attempting to free the cluster only if extents remain in the leaf. Any remaining partial cluster will be freed if possible when the next leaf is processed or when leaf removal is complete. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26ext4: fix partial cluster handling for bigalloc file systemsEric Whitney
commit c06344939422bbd032ac967223a7863de57496b5 upstream. Commit 9cb00419fa, which enables hole punching for bigalloc file systems, exposed a bug introduced by commit 6ae06ff51e in an earlier release. When run on a bigalloc file system, xfstests generic/013, 068, 075, 083, 091, 100, 112, 127, 263, 269, and 270 fail with e2fsck errors or cause kernel error messages indicating that previously freed blocks are being freed again. The latter commit optimizes the selection of the starting extent in ext4_ext_rm_leaf() when hole punching by beginning with the extent supplied in the path argument rather than with the last extent in the leaf node (as is still done when truncating). However, the code in rm_leaf that initially sets partial_cluster to track cluster sharing on extent boundaries is only guaranteed to run if rm_leaf starts with the last node in the leaf. Consequently, partial_cluster is not correctly initialized when hole punching, and a cluster on the boundary of a punched region that should be retained may instead be deallocated. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26ext4: fix error return from ext4_ext_handle_uninitialized_extents()Eric Whitney
commit ce37c42919608e96ade3748fe23c3062a0a966c5 upstream. Commit 3779473246 breaks the return of error codes from ext4_ext_handle_uninitialized_extents() in ext4_ext_map_blocks(). A portion of the patch assigns that function's signed integer return value to an unsigned int. Consequently, negatively valued error codes are lost and can be treated as a bogus allocated block count. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-30ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o
Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-16ext4: don't leave i_crtime.tv_sec uninitializedTheodore Ts'o
If the i_crtime field is not present in the inode, don't leave the field uninitialized. Fixes: ef7f38359 ("ext4: Add nanosecond timestamps") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-02-15ext4: fix online resize with a non-standard blocks per group settingTheodore Ts'o
The set_flexbg_block_bitmap() function assumed that the number of blocks in a blockgroup was sb->blocksize * 8, which is normally true, but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block bitmap corruption after: mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jon Bernard <jbernard@tuxion.com> Cc: stable@vger.kernel.org
2014-02-15ext4: fix online resize with very large inode tablesTheodore Ts'o
If a file system has a large number of inodes per block group, all of the metadata blocks in a flex_bg may be larger than what can fit in a single block group. Unfortunately, ext4_alloc_group_tables() in resize.c was never tested to see if it would handle this case correctly, and there were a large number of bugs which caused the following sequence to result in a BUG_ON: kernel bug at fs/ext4/resize.c:409! ... call trace: [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830 [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80 [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00 [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0 [<ffffffff811b9df2>] ? final_putname+0x22/0x50 [<ffffffff811c1371>] sys_ioctl+0x81/0xa0 [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0 rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180 This can be reproduced with the following command sequence: mke2fs -t ext4 -i 4096 /dev/vdd 1G mount -t ext4 /dev/vdd /vdd resize2fs /dev/vdd 8G To fix this, we need to make sure the right thing happens when a block group's inode table straddles two block groups, which means the following bugs had to be fixed: 1) Not clearing the BLOCK_UNINIT flag in the second block group in ext4_alloc_group_tables --- the was proximate cause of the BUG_ON. 2) Incorrectly determining how many block groups contained contiguous free blocks in ext4_alloc_group_tables(). 3) Incorrectly setting the start of the next block range to be marked in use after a discontinuity in setup_new_flex_group_blocks(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org