summaryrefslogtreecommitdiff
path: root/fs/jbd2
AgeCommit message (Collapse)Author
2012-10-07jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_bufferEric Sandeen
commit 15291164b22a357cb211b618adfef4fa82fc0de3 upstream. journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head state ala discard_buffer(), but does not touch _Delay or _Unwritten as discard_buffer() does. This can be problematic in some areas of the ext4 code which assume that if they have found a buffer marked unwritten or delay, then it's a live one. Perhaps those spots should check whether it is mapped as well, but if jbd2 is going to tear down a buffer, let's really tear it down completely. Without this I get some fsx failures on sub-page-block filesystems up until v3.2, at which point 4e96b2dbbf1d7e81f22047a50f862555a6cb87cb and 189e868fa8fdca702eb9db9d8afc46b5cb9144c9 make the failures go away, because buried within that large change is some more flag clearing. I still think it's worth doing in jbd2, since ->invalidatepage leads here directly, and it's the right place to clear away these flags. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org BugLink: http://bugs.launchpad.net/bugs/929781 CVE-2011-4086 Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2011-12-21jbd/jbd2: validate sb->s_first in journal_get_superblock()Eryu Guan
commit 8762202dd0d6e46854f786bdb6fb3780a1625efe upstream. I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3 image has s_first = 0 in journal superblock, and the 0 is passed to journal->j_head in journal_reset(), then to blocknr in cleanup_journal_tail(), in the end the J_ASSERT failed. So validate s_first after reading journal superblock from disk in journal_get_superblock() to ensure s_first is valid. The following script could reproduce it: fstype=ext3 blocksize=1024 img=$fstype.img offset=0 found=0 magic="c0 3b 39 98" dd if=/dev/zero of=$img bs=1M count=8 mkfs -t $fstype -b $blocksize -F $img filesize=`stat -c %s $img` while [ $offset -lt $filesize ] do if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then echo "Found journal: $offset" found=1 break fi offset=`echo "$offset+$blocksize" | bc` done if [ $found -ne 1 ];then echo "Magic \"$magic\" not found" exit 1 fi dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1 mkdir -p ./mnt mount -o loop $img ./mnt Cc: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Moritz Mühlenhoff <jmm@inutil.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-02ext4, jbd2: Add barriers for file systems with exernal journalsTheodore Ts'o
commit cc3e1bea5d87635c519da657303690f5538bb4eb upstream (as of v2.6.33-rc3) This is a bit complicated because we are trying to optimize when we send barriers to the fs data disk. We could just throw in an extra barrier to the data disk whenever we send a barrier to the journal disk, but that's not always strictly necessary. We only need to send a barrier during a commit when there are data blocks which are must be written out due to an inode written in ordered mode, or if fsync() depends on the commit to force data blocks to disk. Finally, before we drop transactions from the beginning of the journal during a checkpoint operation, we need to guarantee that any blocks that were flushed out to the data disk are firmly on the rust platter before we drop the transaction from the journal. Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05jbd: jbd-debug and jbd2-debug should be writableYin Kangkai
commit 765f8361902d015c864d5e62019b2f139452d7ef upstream. jbd-debug and jbd2-debug is currently read-only (S_IRUGO), which is not correct. Make it writable so that we can start debuging. Signed-off-by: Yin Kangkai <kangkai.yin@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18jbd2: don't wipe the journal on a failed journal checksumTheodore Ts'o
commit e6a47428de84e19fda52f21ab73fde2906c40d09 upstream. If there is a failed journal checksum, don't reset the journal. This allows for userspace programs to decide how to recover from this situation. It may be that ignoring the journal checksum failure might be a better way of recovering the file system. Once we add per-block checksums, we can definitely do better. Until then, a system administrator can try backing up the file system image (or taking a snapshot) and and trying to determine experimentally whether ignoring the checksum failure or aborting the journal replay results in less data loss. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14ext4: Wait for proper transaction commit on fsyncJan Kara
(cherry picked from commit b436b9bef84de6893e86346d8fbf7104bc520645) We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-14jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()Theodore Ts'o
(cherry picked from commit e6ec116b67f46e0e7808276476554727b2e6240b) OOM happens. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-11-11JBD/JBD2: free j_wbuf if journal init fails.Tao Ma
If journal init fails, we need to free j_wbuf. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-10-01const: constify remaining file_operationsAlexey Dobriyan
[akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-30jbd2: Use tracepoints for history fileTheodore Ts'o
The /proc/fs/jbd2/<dev>/history was maintained manually; by using tracepoints, we can get all of the existing functionality of the /proc file plus extra capabilities thanks to the ftrace infrastructure. We save memory as a bonus. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29ext4, jbd2: Drop unneeded printks at mount and unmount timeTheodore Ts'o
There are a number of kernel printk's which are printed when an ext4 filesystem is mounted and unmounted. Disable them to economize space in the system logs. In addition, disabling the mballoc stats by default saves a number of unneeded atomic operations for every block allocation or deallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-23seq_file: constify seq_operationsJames Morris
Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-11ext4: Fix async commit mode to be safe by using a barrierTheodore Ts'o
Previously the journal_async_commit mount option was equivalent to using barrier=0 (and just as unsafe). This patch fixes it so that we eliminate the barrier before the commit block (by not using ordered mode), and explicitly issuing an empty barrier bio after writing the commit block. Because of the journal checksum, it is safe to do this; if the journal blocks are not all written before a power failure, the checksum in the commit block will prevent the last transaction from being replayed. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement: FSUse% Count Size Files/sec App Overhead 8 1000 10240 30.5 28242 vs. FSUse% Count Size Files/sec App Overhead 8 1000 10240 45.8 28620 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17jbd2: Annotate transaction start also for jbd2_journal_restart()Jan Kara
lockdep annotation for a transaction start has been at the end of jbd2_journal_start(). But a transaction is also started from jbd2_journal_restart(). Move the lockdep annotation to start_this_handle() which covers both cases. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-10jbd2: round commit timer up to avoid uncommitted transactionAndreas Dilger
fix jiffie rounding in jbd commit timer setup code. Rounding down could cause the timer to be fired before the corresponding transaction has expired. That transaction can stay not committed forever if no new transaction is created or expicit sync/umount happens. Signed-off-by: Alex Zhuravlev (Tomas) <alex.zhuravlev@sun.com> Signed-off-by: Andreas Dilger <adilger@sun.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17jbd2: Fail to load a journal if it is too shortJan Kara
Due to on disk corruption, it can happen that journal is too short. Fail to load it in such case so that we don't oops somewhere later. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16writeback: get rid of wbc->for_writepagesJens Axboe
It's only set, it's never checked. Kill it. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-13jbd2: fix race between write_metadata_buffer and get_write_accessdingdinghua
The function jbd2_journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in) too early; this could potentially allow another thread to call get_write_access on the buffer head, modify the data, and dirty it, and allowing the wrong data to be written into the journal. Fortunately, if we lose this race, the only time this will actually cause filesystem corruption is if there is a system crash or other unclean shutdown of the system before the next commit can take place. Signed-off-by: dingdinghua <dingdinghua85@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13jbd2: Fix a race between checkpointing code and journal_get_write_access()Jan Kara
The following race can happen: CPU1 CPU2 checkpointing code checks the buffer, adds it to an array for writeback do_get_write_access() ... lock_buffer() unlock_buffer() flush_batch() submits the buffer for IO __jbd2_journal_file_buffer() So a buffer under writeout is returned from do_get_write_access(). Since the filesystem code relies on the fact that journaled buffers cannot be written out, it does not take the buffer lock and so it can modify buffer while it is under writeout. That can lead to a filesystem corruption if we crash at the right moment. We fix the problem by clearing the buffer dirty bit under buffer_lock even if the buffer is on BJ_None list. Actually, we clear the dirty bit regardless the list the buffer is in and warn about the fact if the buffer is already journalled. Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>. Reported-by: dingdinghua <dingdinghua85@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-20jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical regionTheodore Ts'o
Fix jbd2_dev_to_name(), a function used when pretty-printting jbd2 and ext4 tracepoints. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17jbd2: clean up jbd2_journal_try_to_free_buffers()Hisashi Hifumi
This patch reverts 3f31fddf, which is no longer needed because if a race between freeing buffer and committing transaction functionality occurs and dio gets error, currently dio falls back to buffered IO due to the commit 6ccfa806. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Mingming Cao <cmm@us.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17jbd2: convert instrumentation from markers to tracepointsTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09jbd2: Fix minor typos in comments in fs/jbd2/journal.cAlberto Bertogli
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-04-14jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke recordsTheodore Ts'o
The revoke records must be written using the same way as the rest of the blocks during the commit process; that is, either marked as synchronous writes or as asynchornous writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-06jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNCJens Axboe
When you are going to be submitting several sync writes, we want to give the IO scheduler a chance to merge some of them. Instead of using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG and rely on sync_buffer() doing the unplug when someone does a wait_on_buffer()/lock_buffer(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-27jbd2: Update locking comentsJan Kara
Update information about locking in JBD2 revoke code. Inconsistency in comments found by Lin Tan <tammy000@gmail.com>. CC: Lin Tan <tammy000@gmail.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-25ext4: Use WRITE_SYNC for commits which are caused by fsync()Theodore Ts'o
If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()Jan Kara
If we race with commit code setting i_transaction to NULL, we could possibly dereference it. Proper locking requires the journal pointer (to access journal->j_list_lock), which we don't have. So we have to change the prototype of the function so that filesystem passes us the journal pointer. Also add a more detailed comment about why the function jbd2_journal_begin_ordered_truncate() does what it does and how it should be used. Thanks to Dan Carpenter <error27@gmail.com> for pointing to the suspitious code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Joel Becker <joel.becker@oracle.com> CC: linux-ext4@vger.kernel.org CC: ocfs2-devel@oss.oracle.com CC: mfasheh@suse.de CC: Dan Carpenter <error27@gmail.com>
2009-02-10jbd2: Fix return value of jbd2_journal_start_commit()Jan Kara
The function jbd2_journal_start_commit() returns 1 if either a transaction is committing or the function has queued a transaction commit. But it returns 0 if we raced with somebody queueing the transaction commit as well. This resulted in ext4_sync_fs() not functioning correctly (description from Arthur Jones): In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. This can be reproduced with a script created by Eric Sandeen <sandeen@redhat.com>: #!/bin/bash umount /mnt/test2 mount /dev/sdb4 /mnt/test2 rm -f /mnt/test2/* dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512 touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename /mnt/test2/link umount /mnt/test2 mount /dev/sdb4 /mnt/test2 ls /mnt/test2/ This patch fixes jbd2_journal_start_commit() to always return 1 when there's a transaction committing or queued for commit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> CC: Eric Sandeen <sandeen@redhat.com> CC: linux-ext4@vger.kernel.org
2009-01-11ext4: fix wrong use of do_divSimon Holm Thøgersen
the following warning: fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’: fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘uint32_t’ is caused by wrong usage of do_div that modifies the dividend in-place and returns the quotient. So not only would an incorrect value be displayed, but s->journal->j_average_commit_time would also be changed to a wrong value! Fix it by using div_u64 instead. Signed-off-by: Simon Holm Thøgersen <odie@cs.aau.dk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-08Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits) jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs ext4: Remove "extents" mount option block: Add Kconfig help which notes that ext4 needs CONFIG_LBD ext4: Make printk's consistently prefixed with "EXT4-fs: " ext4: Add sanity checks for the superblock before mounting the filesystem ext4: Add mount option to set kjournald's I/O priority jbd2: Submit writes to the journal using WRITE_SYNC jbd2: Add pid and journal device name to the "kjournald2 starting" message ext4: Add markers for better debuggability ext4: Remove code to create the journal inode ext4: provide function to release metadata pages under memory pressure ext3: provide function to release metadata pages under memory pressure add releasepage hooks to block devices which can be used by file systems ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc ext4: Init the complete page while building buddy cache ext4: Don't allow new groups to be added during block allocation ext4: mark the blocks/inode bitmap beyond end of group as used ext4: Use new buffer_head flag to check uninit group bitmaps initialization ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() ext4: code cleanup ...
2009-01-06jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fsJan Kara
On 32-bit system with CONFIG_LBD getblk can fail because provided block number is too big. Add error checks so we fail gracefully if getblk() returns NULL (which can also happen on memory allocation failures). Thanks to David Maciejak from Fortinet's FortiGuard Global Security Research Team for reporting this bug. http://bugzilla.kernel.org/show_bug.cgi?id=12370 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> cc: stable@kernel.org
2009-01-05jbd2: Add buffer triggersJoel Becker
Filesystems often to do compute intensive operation on some metadata. If this operation is repeated many times, it can be very expensive. It would be much nicer if the operation could be performed once before a buffer goes to disk. This adds triggers to jbd2 buffer heads. Just before writing a metadata buffer to the journal, jbd2 will optionally call a commit trigger associated with the buffer. If the journal is aborted, an abort trigger will be called on any dirty buffers as they are dropped from pending transactions. ocfs2 will use this feature. Initially I tried to come up with a more generic trigger that could be used for non-buffer-related events like transaction completion. It doesn't tie nicely, because the information a buffer trigger needs (specific to a journal_head) isn't the same as what a transaction trigger needs (specific to a tranaction_t or perhaps journal_t). So I implemented a buffer set, with the understanding that journal/transaction wide triggers should be implemented separately. There is only one trigger set allowed per buffer. I can't think of any reason to attach more than one set. Contrast this with a journal or transaction in which multiple places may want to watch the entire transaction separately. The trigger sets are considered static allocation from the jbd2 perspective. ocfs2 will just have one trigger set per block type, setting the same set on every bh of the same type. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-04jbd2: Submit writes to the journal using WRITE_SYNCTheodore Ts'o
Since we will be waiting the write of the commit record to the journal to complete in journal_submit_commit_record(), submit it using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03jbd2: Add pid and journal device name to the "kjournald2 starting" messageTheodore Ts'o
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06ext4: Remove code to create the journal inodeTheodore Ts'o
This code has been obsolete in quite some time, since the supported method for adding a journal inode is to use tune2fs (or to creating new filesystem with a journal via mke2fs or mkfs.ext4). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-06jbd2: Call journal commit callback without holding j_list_lockAneesh Kumar K.V
Avoid freeing the transaction in __jbd2_journal_drop_transaction() so the journal commit callback can run without holding j_list_lock, to avoid lock contention on this spinlock. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05jbd2: Remove a large array of bh's from the stack of the checkpoint routineTheodore Ts'o
jbd2_log_do_checkpoint()n is one of the kernel's largest stack users. Move the array of buffer head's from the stack of jbd2_log_do_checkpoint() to the in-core journal structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03ext4: add fsync batch tuning knobsTheodore Ts'o
Add new mount options, min_batch_time and max_batch_time, which controls how long the jbd2 layer should wait for additional filesystem operations to get batched with a synchronous write transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-17ext4: display average commit timeTheodore Ts'o
Display the average commit time (which is used by the ext4 fsync batching patch) in /proc/fs/jbd2/*/info for performance tuning purposes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26jbd2: improve jbd2 fsync batchingJosef Bacik
This patch removes the static sleep time in favor of a more self optimizing approach where we measure the average amount of time it takes to commit a transaction to disk and the ammount of time a transaction has been running. If somebody does a sync write or an fsync() traditionally we would sleep for 1 jiffies, which depending on the value of HZ could be a significant amount of time compared to how long it takes to commit a transaction to the underlying storage. With this patch instead of sleeping for a jiffie, we check to see if the amount of time this transaction has been running is less than the average commit time, and if it is we sleep for the delta using schedule_hrtimeout to give us a higher precision sleep time. This greatly benefits high end storage where you could end up sleeping for longer than it takes to commit the transaction and therefore sitting idle instead of allowing the transaction to be committed by keeping the sleep time to a minimum so you are sure to always be doing something. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05jbd2: Add barrier not supported test to journal_wait_on_commit_recordTheodore Ts'o
Xen doesn't report that barriers are not supported until buffer I/O is reported as completed, instead of when the buffer I/O is submitted. Add a check and a fallback codepath to journal_wait_on_commit_record() to detect this case, so that attempts to mount ext4 filesystems on LVM/devicemapper devices on Xen guests don't blow up with an "Aborting journal on device XXX"; "Remounting filesystem read-only" error. Thanks to Andreas Sundstrom for reporting this issue. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2008-11-02jbd2: deregister proc on failure in jbd2_journal_init_inodeSami Liedes
jbd2_journal_init_inode() does not call jbd2_stats_proc_exit() on all failure paths after calling jbd2_stats_proc_init(). This leaves dangling references to the fs in proc. This patch fixes a bug reported by Sami Leides at: http://bugzilla.kernel.org/show_bug.cgi?id=11493 Signed-off-by: Sami Liedes <sliedes@cc.hut.fi> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-06jbd2: don't give up looking for space so easily in __jbd2_log_wait_for_spaceTheodore Ts'o
Commit 23f8b79e introducd a regression because it assumed that if there were no transactions ready to be checkpointed, that no progress could be made on making space available in the journal, and so the journal should be aborted. This assumption is false; it could be the case that simply calling jbd2_cleanup_journal_tail() will recover the necessary space, or, for small journals, the currently committing transaction could be responsible for chewing up the required space in the log, so we need to wait for the currently committing transaction to finish before trying to force a checkpoint operation. This patch fixes a bug reported by Mihai Harpau at: https://bugzilla.redhat.com/show_bug.cgi?id=469582 This patch fixes a bug reported by François Valenduc at: http://bugzilla.kernel.org/show_bug.cgi?id=11840 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Duane Griffin <duaneg@dghda.com> Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
2008-10-28jbd2: Call the commit callback before the transaction could get droppedTheodore Ts'o
The transaction can potentially get dropped if there are no buffers that need to be written. Make sure we call the commit callback before potentially deciding to drop the transaction. Also avoid dereferencing the commit_transaction pointer in the marker for the same reason. This patch fixes the bug reported by Eric Paris at: http://bugzilla.kernel.org/show_bug.cgi?id=11838 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Eric Paris <eparis@redhat.com>
2008-10-20fs/Kconfig: move ext2, ext3, ext4, JBD, JBD2 outAlexey Dobriyan
Use fs/*/Kconfig more, which is good because everything related to one filesystem is in one place and fs/Kconfig is quite fat. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16ext4: Replace hackish ext4_mb_poll_new_transaction with commit callbackTheodore Ts'o
The multiblock allocator needs to be able to release blocks (and issue a blkdev discard request) when the transaction which freed those blocks is committed. Previously this was done via a polling mechanism when blocks are allocated or freed. A much better way of doing things is to create a jbd2 callback function and attaching the list of blocks to be freed directly to the transaction structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10ext4: add an option to control error handling on file dataHidehiro Kawai
If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount an ext4 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Here is the corresponding patch of the ext3 version: http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374 Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10jbd2: don't dirty original metadata buffer on abortHidehiro Kawai
Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal gets aborted, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10jbd2: fix error handling for checkpoint ioHidehiro Kawai
When a checkpointing IO fails, current JBD2 code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of jbd2_log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. jbd2_log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext4 layer so that ext4 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent jbd2_cleanup_journal_tail() from being called between __jbd2_journal_drop_transaction() and jbd2_journal_abort() (a possible race issue between jbd2_log_do_checkpoint()s called by jbd2_journal_flush() and __jbd2_log_wait_for_space()) Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>