diff options
Diffstat (limited to 'drivers/staging/zcache/ramster/ramster-howto.txt')
-rw-r--r-- | drivers/staging/zcache/ramster/ramster-howto.txt | 366 |
1 files changed, 0 insertions, 366 deletions
diff --git a/drivers/staging/zcache/ramster/ramster-howto.txt b/drivers/staging/zcache/ramster/ramster-howto.txt deleted file mode 100644 index 7b1ee3bbfdd5..000000000000 --- a/drivers/staging/zcache/ramster/ramster-howto.txt +++ /dev/null @@ -1,366 +0,0 @@ - RAMSTER HOW-TO - -Author: Dan Magenheimer -Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com> - -This is a HOWTO document for ramster which, as of this writing, is in -the kernel as a subdirectory of zcache in drivers/staging, called ramster. -(Zcache can be built with or without ramster functionality.) If enabled -and properly configured, ramster allows memory capacity load balancing -across multiple machines in a cluster. Further, the ramster code serves -as an example of asynchronous access for zcache (as well as cleancache and -frontswap) that may prove useful for future transcendent memory -implementations, such as KVM and NVRAM. While ramster works today on -any network connection that supports kernel sockets, its features may -become more interesting on future high-speed fabrics/interconnects. - -Ramster requires both kernel and userland support. The userland support, -called ramster-tools, is known to work with EL6-based distros, but is a -set of poorly-hacked slightly-modified cluster tools based on ocfs2, which -includes an init file, a config file, and a userland binary that interfaces -to the kernel. This state of userland support reflects the abysmal userland -skills of this suitably-embarrassed author; any help/patches to turn -ramster-tools into more distributable rpms/debs useful for a wider range -of distros would be appreciated. The source RPM that can be used as a -starting point is available at: - http://oss.oracle.com/projects/tmem/files/RAMster/ - -As a result of this author's ignorance, userland setup described in this -HOWTO assumes an EL6 distro and is described in EL6 syntax. Apologies -if this offends anyone! - -Kernel support has only been tested on x86_64. Systems with an active -ocfs2 filesystem should work, but since ramster leverages a lot of -code from ocfs2, there may be latent issues. A kernel configuration that -includes CONFIG_OCFS2_FS should build OK, and should certainly run OK -if no ocfs2 filesystem is mounted. - -This HOWTO demonstrates memory capacity load balancing for a two-node -cluster, where one node called the "local" node becomes overcommitted -and the other node called the "remote" node provides additional RAM -capacity for use by the local node. Ramster is capable of more complex -topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES". - -If you find any terms in this HOWTO unfamiliar or don't understand the -motivation for ramster, the following LWN reading is recommended: --- Transcendent Memory in a Nutshell (lwn.net/Articles/454795) --- The future calculus of memory management (lwn.net/Articles/475681) -And since ramster is built on top of zcache, this article may be helpful: --- In-kernel memory compression (lwn.net/Articles/545244) - -Now that you've memorized the contents of those articles, let's get started! - -A. PRELIMINARY - -1) Install two x86_64 Linux systems that are known to work when - upgraded to a recent upstream Linux kernel version. - -On each system: - -2) Configure, build and install, then boot Linux, just to ensure it - can be done with an unmodified upstream kernel. Confirm you booted - the upstream kernel with "uname -a". - -3) If you plan to do any performance testing or unless you plan to - test only swapping, the "WasActive" patch is also highly recommended. - (Search lkml.org for WasActive, apply the patch, rebuild your kernel.) - For a demo or simple testing, the patch can be ignored. - -4) Install ramster-tools as root. An x86_64 rpm for EL6-based systems - can be found at: - http://oss.oracle.com/projects/tmem/files/RAMster/ - (Sorry but for now, non-EL6 users must recreate ramster-tools on - their own from source. See above.) - -5) Ensure that debugfs is mounted at each boot. Examples below assume it - is mounted at /sys/kernel/debug. - -B. BUILDING RAMSTER INTO THE KERNEL - -Do the following on each system: - -1) Using the kernel configuration mechanism of your choice, change - your config to include: - - CONFIG_CLEANCACHE=y - CONFIG_FRONTSWAP=y - CONFIG_STAGING=y - CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m - CONFIG_ZCACHE=y - CONFIG_RAMSTER=y - - For a linux-3.10 or later kernel, you should also set: - - CONFIG_ZCACHE_DEBUG=y - CONFIG_RAMSTER_DEBUG=y - - Before building the kernel please doublecheck your kernel config - file to ensure all of the settings are correct. - -2) Build this kernel and change your boot file (e.g. /etc/grub.conf) - so that the new kernel will boot. - -3) Add "zcache" and "ramster" as kernel boot parameters for the new kernel. - -4) Reboot each system approximately simultaneously. - -5) Check dmesg to ensure there are some messages from ramster, prefixed - by "ramster:" - - # dmesg | grep ramster - - You should also see a lot of files in: - - # ls /sys/kernel/debug/zcache - # ls /sys/kernel/debug/ramster - - These are mostly counters for various zcache and ramster activities. - You should also see files in: - - # ls /sys/kernel/mm/ramster - - These are sysfs files that control ramster as we shall see. - - Ramster now will act as a single-system zcache on each system - but doesn't yet know anything about the cluster so can't yet do - anything remotely. - -C. CONFIGURING THE RAMSTER CLUSTER - -This part can be error prone unless you are familiar with clustering -filesystems. We need to describe the cluster in a /etc/ramster.conf -file and the init scripts that parse it are extremely picky about -the syntax. - -1) Create a /etc/ramster.conf file and ensure it is identical on both - systems. This file mimics the ocfs2 format and there is a good amount - of documentation that can be searched for ocfs2.conf, but you can use: - - cluster: - name = ramster - node_count = 2 - node: - name = system1 - cluster = ramster - number = 0 - ip_address = my.ip.ad.r1 - ip_port = 7777 - node: - name = system2 - cluster = ramster - number = 1 - ip_address = my.ip.ad.r2 - ip_port = 7777 - - You must ensure that the "name" field in the file exactly matches - the output of "hostname" on each system; if "hostname" shows a - fully-qualified hostname, ensure the name is fully qualified in - /etc/ramster.conf. Obviously, substitute my.ip.ad.rx with proper - ip addresses. - -2) Enable the ramster service and configure it. If you used the - EL6 ramster-tools, this would be: - - # chkconfig --add ramster - # service ramster configure - - Set "load on boot" to "y", cluster to start is "ramster" (or whatever - name you chose in ramster.conf), heartbeat dead threshold as "500", - network idle timeout as "1000000". Leave the others as default. - -3) Reboot both systems. After reboot, try (assuming EL6 ramster-tools): - - # service ramster status - - You should see "Checking RAMSTER cluster "ramster": Online". If you do - not, something is wrong and ramster will not work. Note that you - should also see that the driver for "configfs" is loaded and mounted, - the driver for ocfs2_dlmfs is not loaded, and some numbers for network - parameters. You will also see "Checking RAMSTER heartbeat: Not active". - That's all OK. - -4) Now you need to start the cluster heartbeat; the cluster is not "up" - until all nodes detect a heartbeat. In a real cluster, heartbeat detection - is done via a cluster filesystem, but ramster doesn't require one. Some - hack-y kernel code in ramster can start the heartbeat for you though if - you tell it what nodes are "up". To enable the heartbeat, do: - - # echo 0 > /sys/kernel/mm/ramster/manual_node_up - # echo 1 > /sys/kernel/mm/ramster/manual_node_up - - This must be done on BOTH nodes and, to avoid timeouts, must be done - approximately concurrently on both nodes. On an EL6 system, it is - convenient to put these lines in /etc/rc.local. To confirm that the - cluster is now up, on both systems do: - - # dmesg | grep ramster - - You should see ramster "Accepted connection" messages in dmesg on both - nodes after this. Note that if you check userland status again with - - # service ramster status - - you will still see "Checking RAMSTER heartbeat: Not active". That's - still OK... the ramster kernel heartbeat hack doesn't communicate to - userland. - -5) You now must tell each node the node to which it should "remotify" pages. - On this two node cluster, we will assume the "local" node, node 0, has - memory overcommitted and will use ramster to utilize RAM capacity on - the "remote node", node 1. To configure this, on node 0, you do: - - # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum - - You should see "ramster: node 1 set as remotification target" in dmesg - on node 0. Again, on EL6, /etc/rc.local is a good place to put this - on node 0 so you don't forget to do it at each boot. - -6) One more step: By default, the ramster code does not "remotify" any - pages; this is primarily for testing purposes, but sometimes it is - useful. This may change in the future, but for now, on node 0, you do: - - # echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable - # echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable - - The first enables remotifying swap (persistent, aka frontswap) pages, - the second enables remotifying of page cache (ephemeral, cleancache) - pages. - - On EL6, these lines can also be put in /etc/rc.local (AFTER the - node_up lines), or at the beginning of a script that runs a workload. - -7) Note that most testing has been done with both/all machines booted - roughly simultaneously to avoid cluster timeouts. Ideally, you should - do this too unless you are trying to break ramster rather than just - use it. ;-) - -D. TESTING RAMSTER - -1) Note that ramster has no value unless pages get "remotified". For - swap/frontswap/persistent pages, this doesn't happen unless/until - the workload would cause swapping to occur, at which point pages - are put into frontswap/zcache, and the remotification thread starts - working. To get to the point where the system swaps, you either - need a workload for which the working set exceeds the RAM in the - system; or you need to somehow reduce the amount of RAM one of - the system sees. This latter is easy when testing in a VM, but - harder on physical systems. In some cases, "mem=xxxM" on the - kernel command line restricts memory, but for some values of xxx - the kernel may fail to boot. One may also try creating a fixed - RAMdisk, doing nothing with it, but ensuring that it eats up a fixed - amount of RAM. - -2) To see if ramster is working, on the "remote node", node 1, try: - - # grep . /sys/kernel/debug/ramster/foreign_* - # # note, that is space-dot-space between grep and the pathname - - to monitor the number (and max) ephemeral and persistent pages - that ramster has sent. If these stay at zero, ramster is not working - either because the workload on the local node (node 0) isn't creating - enough memory pressure or because "remotifying" isn't working. On the - local system, node 0, you can watch lots of useful information also. - Try: - - grep . /sys/kernel/debug/zcache/*pageframes* \ - /sys/kernel/debug/zcache/*zbytes* \ - /sys/kernel/debug/zcache/*zpages* \ - /sys/kernel/debug/ramster/*remote* - - Of particular note are the remote_*_pages_succ_get counters. These - show how many disk reads and/or disk writes have been avoided on the - overcommitted local system by storing pages remotely using ramster. - - At the risk of information overload, you can also grep: - - /sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/* - - These show, for example, how many disk reads and/or disk writes have - been avoided by using zcache to optimize RAM on the local system. - - -AUTOMATIC SWAP REPATRIATION - -You may notice that while the systems are idle, the foreign persistent -page count on the remote machine slowly decreases. This is because -ramster implements "frontswap selfshrinking": When possible, swap -pages that have been remotified are slowly repatriated to the local -machine. This is so that local RAM can be used when possible and -so that, in case of remote machine crash, the probability of loss -of data is reduced. - -REBOOTING / POWEROFF - -If a system is shut down while some of its swap pages still reside -on a remote system, the system may lock up during the shutdown -sequence. This will occur if the network is shut down before the -swap mechansim is shut down, which is the default ordering on many -distros. To avoid this annoying problem, simply shut off the swap -subsystem before starting the shutdown sequence, e.g.: - - # swapoff -a - # reboot - -Ideally, this swapoff-before-ifdown ordering should be enforced permanently -using shutdown scripts. - -KNOWN PROBLEMS - -1) You may periodically see messages such as: - - ramster_r2net, message length problem - - This is harmless but indicates that a node is sending messages - containing compressed pages that exceed the maximum for zcache - (PAGE_SIZE*15/16). The sender side needs to be fixed. - -2) If you see a "No longer connected to node..." message or a "No connection - established with node X after N seconds", it is possible you may - be in an unrecoverable state. If you are certain all of the - appropriate cluster configuration steps described above have been - performed, try rebooting the two servers concurrently to see if - the cluster starts. - - Note that "Connection to node... shutdown, state 7" is an intermediate - connection state. As long as you later see "Accepted connection", the - intermediate states are harmless. - -3) There are known issues in counting certain values. As a result - you may see periodic warnings from the kernel. Almost always you - will see "ramster: bad accounting for XXX". There are also "WARN_ONCE" - messages. If you see kernel warnings with a tombstone, please report - them. They are harmless but reflect bugs that need to be eventually fixed. - -ADVANCED RAMSTER TOPOLOGIES - -The kernel code for ramster can support up to eight nodes in a cluster, -but no testing has been done with more than three nodes. - -In the example described above, the "remote" node serves as a RAM -overflow for the "local" node. This can be made symmetric by appropriate -settings of the sysfs remote_target_nodenum file. For example, by setting: - - # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum - -on node 0, and - - # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum - -on node 1, each node can serve as a RAM overflow for the other. - -For more than two nodes, a "RAM server" can be configured. For a -three node system, set: - - # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum - -on node 1, and - - # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum - -on node 2. Then node 0 is a RAM server for node 1 and node 2. - -In this implementation of ramster, any remote node is potentially a single -point of failure (SPOF). Though the probability of failure is reduced -by automatic swap repatriation (see above), a proposed future enhancement -to ramster improves high-availability for the cluster by sending a copy -of each page of date to two other nodes. Patches welcome! |