--- /dev/null
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
+ "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [
+ <!ENTITY rapidio SYSTEM "rapidio.xml">
+ ]>
+
+<book id="RapidIO-Guide">
+ <bookinfo>
+ <title>RapidIO Subsystem Guide</title>
+
+ <authorgroup>
+ <author>
+ <firstname>Matt</firstname>
+ <surname>Porter</surname>
+ <affiliation>
+ <address>
+ <email>mporter@kernel.crashing.org</email>
+ <email>mporter@mvista.com</email>
+ </address>
+ </affiliation>
+ </author>
+ </authorgroup>
+
+ <copyright>
+ <year>2005</year>
+ <holder>MontaVista Software, Inc.</holder>
+ </copyright>
+
+ <legalnotice>
+ <para>
+ This documentation is free software; you can redistribute
+ it and/or modify it under the terms of the GNU General Public
+ License version 2 as published by the Free Software Foundation.
+ </para>
+
+ <para>
+ This program is distributed in the hope that it will be
+ useful, but WITHOUT ANY WARRANTY; without even the implied
+ warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ See the GNU General Public License for more details.
+ </para>
+
+ <para>
+ You should have received a copy of the GNU General Public
+ License along with this program; if not, write to the Free
+ Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
+ MA 02111-1307 USA
+ </para>
+
+ <para>
+ For more details see the file COPYING in the source
+ distribution of Linux.
+ </para>
+ </legalnotice>
+ </bookinfo>
+
+<toc></toc>
+
+ <chapter id="intro">
+ <title>Introduction</title>
+ <para>
+ RapidIO is a high speed switched fabric interconnect with
+ features aimed at the embedded market. RapidIO provides
+ support for memory-mapped I/O as well as message-based
+ transactions over the switched fabric network. RapidIO has
+ a standardized discovery mechanism not unlike the PCI bus
+ standard that allows simple detection of devices in a
+ network.
+ </para>
+ <para>
+ This documentation is provided for developers intending
+ to support RapidIO on new architectures, write new drivers,
+ or to understand the subsystem internals.
+ </para>
+ </chapter>
+
+ <chapter id="bugs">
+ <title>Known Bugs and Limitations</title>
+
+ <sect1>
+ <title>Bugs</title>
+ <para>None. ;)</para>
+ </sect1>
+ <sect1>
+ <title>Limitations</title>
+ <para>
+ <orderedlist>
+ <listitem><para>Access/management of RapidIO memory regions is not supported</para></listitem>
+ <listitem><para>Multiple host enumeration is not supported</para></listitem>
+ </orderedlist>
+ </para>
+ </sect1>
+ </chapter>
+
+ <chapter id="drivers">
+ <title>RapidIO driver interface</title>
+ <para>
+ Drivers are provided a set of calls in order
+ to interface with the subsystem to gather info
+ on devices, request/map memory region resources,
+ and manage mailboxes/doorbells.
+ </para>
+ <sect1>
+ <title>Functions</title>
+!Iinclude/linux/rio_drv.h
+!Edrivers/rapidio/rio-driver.c
+!Edrivers/rapidio/rio.c
+ </sect1>
+ </chapter>
+
+ <chapter id="internals">
+ <title>Internals</title>
+
+ <para>
+ This chapter contains the autogenerated documentation of the RapidIO
+ subsystem.
+ </para>
+
+ <sect1><title>Structures</title>
+!Iinclude/linux/rio.h
+ </sect1>
+ <sect1><title>Enumeration and Discovery</title>
+!Idrivers/rapidio/rio-scan.c
+ </sect1>
+ <sect1><title>Driver functionality</title>
+!Idrivers/rapidio/rio.c
+!Idrivers/rapidio/rio-access.c
+ </sect1>
+ <sect1><title>Device model support</title>
+!Idrivers/rapidio/rio-driver.c
+ </sect1>
+ <sect1><title>Sysfs support</title>
+!Idrivers/rapidio/rio-sysfs.c
+ </sect1>
+ <sect1><title>PPC32 support</title>
+!Iarch/ppc/kernel/rio.c
+!Earch/ppc/syslib/ppc85xx_rio.c
+!Iarch/ppc/syslib/ppc85xx_rio.c
+ </sect1>
+ </chapter>
+
+ <chapter id="credits">
+ <title>Credits</title>
+ <para>
+ The following people have contributed to the RapidIO
+ subsystem directly or indirectly:
+ <orderedlist>
+ <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem>
+ <listitem><para>Randy Vinson<email>rvinson@mvista.com</email></para></listitem>
+ <listitem><para>Dan Malek<email>dan@embeddedalley.com</email></para></listitem>
+ </orderedlist>
+ </para>
+ <para>
+ The following people have contributed to this document:
+ <orderedlist>
+ <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem>
+ </orderedlist>
+ </para>
+ </chapter>
+</book>
This guide describes the basics of Message Signaled Interrupts (MSI),
the advantages of using MSI over traditional interrupt mechanisms,
and how to enable your driver to use MSI or MSI-X. Also included is
-a Frequently Asked Questions.
+a Frequently Asked Questions (FAQ) section.
+
+1.1 Terminology
+
+PCI devices can be single-function or multi-function. In either case,
+when this text talks about enabling or disabling MSI on a "device
+function," it is referring to one specific PCI device and function and
+not to all functions on a PCI device (unless the PCI device has only
+one function).
2. Copyright 2003 Intel Corporation
3. What is MSI/MSI-X?
Message Signaled Interrupt (MSI), as described in the PCI Local Bus
-Specification Revision 2.3 or latest, is an optional feature, and a
+Specification Revision 2.3 or later, is an optional feature, and a
required feature for PCI Express devices. MSI enables a device function
to request service by sending an Inbound Memory Write on its PCI bus to
the FSB as a Message Signal Interrupt transaction. Because MSI is
A PCI device that supports MSI must also support pin IRQ assertion
interrupt mechanism to provide backward compatibility for systems that
-do not support MSI. In Systems, which support MSI, the bus driver is
+do not support MSI. In systems which support MSI, the bus driver is
responsible for initializing the message address and message data of
the device function's MSI/MSI-X capability structure during device
initial configuration.
- MSI and MSI-X both support per-vector masking. Per-vector
masking is an optional extension of MSI but a required
- feature for MSI-X. Per-vector masking provides the kernel
- the ability to mask/unmask MSI when servicing its software
- interrupt service routing handler. If per-vector masking is
+ feature for MSI-X. Per-vector masking provides the kernel the
+ ability to mask/unmask a single MSI while running its
+ interrupt service routine. If per-vector masking is
not supported, then the device driver should provide the
hardware/software synchronization to ensure that the device
generates MSI when the driver wants it to do so.
4. Why use MSI?
-As a benefit the simplification of board design, MSI allows board
-designers to remove out of band interrupt routing. MSI is another
+As a benefit to the simplification of board design, MSI allows board
+designers to remove out-of-band interrupt routing. MSI is another
step towards a legacy-free environment.
Due to increasing pressure on chipset and processor packages to
support for better interrupt performance.
Using MSI enables the device functions to support two or more
-vectors, which can be configured to target different CPU's to
+vectors, which can be configured to target different CPUs to
increase scalability.
5. Configuring a driver to use MSI/MSI-X
int pci_enable_msi(struct pci_dev *dev)
-With this new API, any existing device driver, which like to have
-MSI enabled on its device function, must call this API to enable MSI
+With this new API, a device driver that wants to have MSI
+enabled on its device function must call this API to enable MSI.
A successful call will initialize the MSI capability structure
with ONE vector, regardless of whether a device function is
capable of supporting multiple messages. This vector replaces the
-pre-assigned dev->irq with a new MSI vector. To avoid the conflict
-of new assigned vector with existing pre-assigned vector requires
+pre-assigned dev->irq with a new MSI vector. To avoid a conflict
+of the new assigned vector with existing pre-assigned vector requires
a device driver to call this API before calling request_irq().
5.2.2 API pci_disable_msi
the pre-assigned IOAPIC vector and switches a device's interrupt
mode to PCI pin-irq assertion/INTx emulation mode.
-Note that a device driver should always call free_irq() on MSI vector
-it has done request_irq() on before calling this API. Failure to do
-so results a BUG_ON() and a device will be left with MSI enabled and
+Note that a device driver should always call free_irq() on the MSI vector
+that it has done request_irq() on before calling this API. Failure to do
+so results in a BUG_ON() and a device will be left with MSI enabled and
leaks its vector.
5.2.3 MSI mode vs. legacy mode diagram
-The below diagram shows the events, which switches the interrupt
+The below diagram shows the events which switch the interrupt
mode on the MSI-capable device function between MSI mode and
PIN-IRQ assertion mode.
------------ pci_disable_msi ------------------------
-Figure 1.0 MSI Mode vs. Legacy Mode
+Figure 1. MSI Mode vs. Legacy Mode
-In Figure 1.0, a device operates by default in legacy mode. Legacy
+In Figure 1, a device operates by default in legacy mode. Legacy
in this context means PCI pin-irq assertion or PCI-Express INTx
emulation. A successful MSI request (using pci_enable_msi()) switches
a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
To return back to its default mode, a device driver should always call
pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
-device driver should always call free_irq() on MSI vector it has done
-request_irq() on before calling pci_disable_msi(). Failure to do so
-results a BUG_ON() and a device will be left with MSI enabled and
+device driver should always call free_irq() on the MSI vector it has
+done request_irq() on before calling pci_disable_msi(). Failure to do
+so results in a BUG_ON() and a device will be left with MSI enabled and
leaks its vector. Otherwise, the PCI subsystem restores a device's
-dev->irq with a pre-assigned IOAPIC vector and marks released
+dev->irq with a pre-assigned IOAPIC vector and marks the released
MSI vector as unused.
Once being marked as unused, there is no guarantee that the PCI
the availability of current PCI vector resources and the number of
MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
-For the case where the PCI subsystem re-assigned this MSI vector
-another driver, a request to switching back to MSI mode may result
+For the case where the PCI subsystem re-assigns this MSI vector to
+another driver, a request to switch back to MSI mode may result
in being assigned a different MSI vector or a failure if no more
vectors are available.
does not replace the pre-assigned IOAPIC dev->irq with a new MSI
vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
into the field vector of each element contained in a second argument.
-Note that the pre-assigned IO-APIC dev->irq is valid only if the device
-operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
+Note that the pre-assigned IOAPIC dev->irq is valid only if the device
+operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at
using dev->irq by the device driver to request for interrupt service
may result unpredictabe behavior.
-For each MSI-X vector granted, a device driver is responsible to call
+For each MSI-X vector granted, a device driver is responsible for calling
other functions like request_irq(), enable_irq(), etc. to enable
this vector with its corresponding interrupt service handler. It is
a device driver's choice to assign all vectors with the same
The PCI 3.0 specification has implementation notes that MMIO address
space for a device's MSI-X structure should be isolated so that the
-software system can set different page for controlling accesses to
-the MSI-X structure. The implementation of MSI patch requires the PCI
+software system can set different pages for controlling accesses to the
+MSI-X structure. The implementation of MSI support requires the PCI
subsystem, not a device driver, to maintain full control of the MSI-X
-table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
-A device driver is prohibited from requesting the MMIO address space
-of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
-enabling MSI-X on its hardware device when it calls the function
+table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X
+table/MSI-X PBA. A device driver is prohibited from requesting the MMIO
+address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem
+will fail enabling MSI-X on its hardware device when it calls the function
pci_enable_msix().
5.3.2 Handling MSI-X allocation
than requested, the function pci_enable_msix() will return the
maximum number of MSI-X vectors available to the caller. A device
driver may re-send its request with fewer or equal vectors indicated
-in a return. For example, if a device driver requests 5 vectors, but
-the number of available vectors is 3 vectors, a value of 3 will be a
-return as a result of pci_enable_msix() call. A function could be
+in the return. For example, if a device driver requests 5 vectors, but
+the number of available vectors is 3 vectors, a value of 3 will be
+returned as a result of pci_enable_msix() call. A function could be
designed for its driver to use only 3 MSI-X table entries as
different combinations as ABC--, A-B-C, A--CB, etc. Note that this
patch does not support multiple entries with the same vector. Such
pci_enable_msix(). Below are the reasons why supporting multiple
entries with the same vector is an undesirable solution.
- - The PCI subsystem can not determine which entry, which
- generated the message, to mask/unmask MSI while handling
+ - The PCI subsystem cannot determine the entry that
+ generated the message to mask/unmask MSI while handling
software driver ISR. Attempting to walk through all MSI-X
table entries (2048 max) to mask/unmask any match vector
is an undesirable solution.
- - Walk through all MSI-X table entries (2048 max) to handle
+ - Walking through all MSI-X table entries (2048 max) to handle
SMP affinity of any match vector is an undesirable solution.
5.3.4 API pci_enable_msix
-int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
+int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
This API enables a device driver to request the PCI subsystem
-for enabling MSI-X messages on its hardware device. Depending on
+to enable MSI-X messages on its hardware device. Depending on
the availability of PCI vectors resources, the PCI subsystem enables
-either all or nothing.
+either all or none of the requested vectors.
-Argument dev points to the device (pci_dev) structure.
+Argument 'dev' points to the device (pci_dev) structure.
-Argument entries is a pointer of unsigned integer type. The number of
-elements is indicated in argument nvec. The content of each element
-will be mapped to the following struct defined in /driver/pci/msi.h.
+Argument 'entries' is a pointer to an array of msix_entry structs.
+The number of entries is indicated in argument 'nvec'.
+struct msix_entry is defined in /driver/pci/msi.h:
struct msix_entry {
u16 vector; /* kernel uses to write alloc vector */
u16 entry; /* driver uses to specify entry */
};
-A device driver is responsible for initializing the field entry of
-each element with unique entry supported by MSI-X table. Otherwise,
+A device driver is responsible for initializing the field 'entry' of
+each element with a unique entry supported by MSI-X table. Otherwise,
-EINVAL will be returned as a result. A successful return of zero
-indicates the PCI subsystem completes initializing each of requested
+indicates the PCI subsystem completed initializing each of the requested
entries of the MSI-X table with message address and message data.
Last but not least, the PCI subsystem will write the 1:1
-vector-to-entry mapping into the field vector of each element. A
-device driver is responsible of keeping track of allocated MSI-X
+vector-to-entry mapping into the field 'vector' of each element. A
+device driver is responsible for keeping track of allocated MSI-X
vectors in its internal data structure.
-Argument nvec is an integer indicating the number of messages
-requested.
-
-A return of zero indicates that the number of MSI-X vectors is
+A return of zero indicates that the number of MSI-X vectors was
successfully allocated. A return of greater than zero indicates
MSI-X vector shortage. Or a return of less than zero indicates
a failure. This failure may be a result of duplicate entries
This API should always be used to undo the effect of pci_enable_msix()
when a device driver is unloading. Note that a device driver should
always call free_irq() on all MSI-X vectors it has done request_irq()
-on before calling this API. Failure to do so results a BUG_ON() and
+on before calling this API. Failure to do so results in a BUG_ON() and
a device will be left with MSI-X enabled and leaks its vectors.
5.3.6 MSI-X mode vs. legacy mode diagram
-The below diagram shows the events, which switches the interrupt
+The below diagram shows the events which switch the interrupt
mode on the MSI-X capable device function between MSI-X mode and
PIN-IRQ assertion mode (legacy).
| | ===============> | |
------------ pci_disable_msix ------------------------
-Figure 2.0 MSI-X Mode vs. Legacy Mode
+Figure 2. MSI-X Mode vs. Legacy Mode
-In Figure 2.0, a device operates by default in legacy mode. A
+In Figure 2, a device operates by default in legacy mode. A
successful MSI-X request (using pci_enable_msix()) switches a
device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
stored in dev->irq will be saved by the PCI subsystem; however,
unlike MSI mode, the PCI subsystem will not replace dev->irq with
assigned MSI-X vector because the PCI subsystem already writes the 1:1
-vector-to-entry mapping into the field vector of each element
+vector-to-entry mapping into the field 'vector' of each element
specified in second argument.
To return back to its default mode, a device driver should always call
pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
a device driver should always call free_irq() on all MSI-X vectors it
has done request_irq() on before calling pci_disable_msix(). Failure
-to do so results a BUG_ON() and a device will be left with MSI-X
+to do so results in a BUG_ON() and a device will be left with MSI-X
enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
device function's interrupt mode from MSI-X mode to legacy mode and
marks all allocated MSI-X vectors as unused.
re-assigned.
For the case where the PCI subsystem re-assigned these MSI-X vectors
-to other driver, a request to switching back to MSI-X mode may result
+to other drivers, a request to switch back to MSI-X mode may result
being assigned with another set of MSI-X vectors or a failure if no
more vectors are available.
-5.4 Handling function implementng both MSI and MSI-X capabilities
+5.4 Handling function implementing both MSI and MSI-X capabilities
For the case where a function implements both MSI and MSI-X
capabilities, the PCI subsystem enables a device to run either in MSI
mode or MSI-X mode but not both. A device driver determines whether it
wants MSI or MSI-X enabled on its hardware device. Once a device
-driver requests for MSI, for example, it is prohibited to request for
+driver requests for MSI, for example, it is prohibited from requesting
MSI-X; in other words, a device driver is not permitted to ping-pong
between MSI mod MSI-X mode during a run-time.
5.5 Hardware requirements for MSI/MSI-X support
+
MSI/MSI-X support requires support from both system hardware and
individual hardware device functions.
5.5.1 System hardware support
+
Since the target of MSI address is the local APIC CPU, enabling
-MSI/MSI-X support in Linux kernel is dependent on whether existing
-system hardware supports local APIC. Users should verify their
-system whether it runs when CONFIG_X86_LOCAL_APIC=y.
+MSI/MSI-X support in the Linux kernel is dependent on whether existing
+system hardware supports local APIC. Users should verify that their
+system supports local APIC operation by testing that it runs when
+CONFIG_X86_LOCAL_APIC=y.
In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
however, in UP environment, users must manually set
CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
-CONFIG_PCI_MSI enables the VECTOR based scheme and
-the option for MSI-capable device drivers to selectively enable
-MSI/MSI-X.
+CONFIG_PCI_MSI enables the VECTOR based scheme and the option for
+MSI-capable device drivers to selectively enable MSI/MSI-X.
Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
vector is allocated new during runtime and MSI/MSI-X support does not
depend on BIOS support. This key independency enables MSI/MSI-X
-support on future IOxAPIC free platform.
+support on future IOxAPIC free platforms.
5.5.2 Device hardware support
+
The hardware device function supports MSI by indicating the
MSI/MSI-X capability structure on its PCI capability list. By
default, this capability structure will not be initialized by
the kernel to enable MSI during the system boot. In other words,
the device function is running on its default pin assertion mode.
Note that in many cases the hardware supporting MSI have bugs,
-which may result in system hang. The software driver of specific
-MSI-capable hardware is responsible for whether calling
+which may result in system hangs. The software driver of specific
+MSI-capable hardware is responsible for deciding whether to call
pci_enable_msi or not. A return of zero indicates the kernel
-successfully initializes the MSI/MSI-X capability structure of the
+successfully initialized the MSI/MSI-X capability structure of the
device function. The device function is now running on MSI/MSI-X mode.
5.6 How to tell whether MSI/MSI-X is enabled on device function
its device function is initialized successfully and ready to run in
MSI/MSI-X mode.
-At the user level, users can use command 'cat /proc/interrupts'
-to display the vector allocated for a device and its interrupt
-MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
-enabled on a SCSI Adaptec 39320D Ultra320.
+At the user level, users can use the command 'cat /proc/interrupts'
+to display the vectors allocated for devices and their interrupt
+MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is
+enabled on a SCSI Adaptec 39320D Ultra320 controller.
CPU0 CPU1
0: 324639 0 IO-APIC-edge timer
15: 1 0 IO-APIC-edge ide1
169: 0 0 IO-APIC-level uhci-hcd
185: 0 0 IO-APIC-level uhci-hcd
-193: 138 10 PCI MSI aic79xx
-201: 30 0 PCI MSI aic79xx
+193: 138 10 PCI-MSI aic79xx
+201: 30 0 PCI-MSI aic79xx
225: 30 0 IO-APIC-level aic7xxx
233: 30 0 IO-APIC-level aic7xxx
NMI: 0 0
specification 2.3 or latest, then it should work.
Q4. From the driver point of view, if the MSI is lost because
-of the errors occur during inbound memory write, then it may
-wait for ever. Is there a mechanism for it to recover?
+of errors occurring during inbound memory write, then it may
+wait forever. Is there a mechanism for it to recover?
A4. Since the target of the transaction is an inbound memory
write, all transaction termination conditions (Retry,
--- /dev/null
+RCU-based dcache locking model
+==============================
+
+On many workloads, the most common operation on dcache is to look up a
+dentry, given a parent dentry and the name of the child. Typically,
+for every open(), stat() etc., the dentry corresponding to the
+pathname will be looked up by walking the tree starting with the first
+component of the pathname and using that dentry along with the next
+component to look up the next level and so on. Since it is a frequent
+operation for workloads like multiuser environments and web servers,
+it is important to optimize this path.
+
+Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in
+every component during path look-up. Since 2.5.10 onwards, fast-walk
+algorithm changed this by holding the dcache_lock at the beginning and
+walking as many cached path component dentries as possible. This
+significantly decreases the number of acquisition of
+dcache_lock. However it also increases the lock hold time
+significantly and affects performance in large SMP machines. Since
+2.5.62 kernel, dcache has been using a new locking model that uses RCU
+to make dcache look-up lock-free.
+
+The current dcache locking model is not very different from the
+existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock
+protected the hash chain, d_child, d_alias, d_lru lists as well as
+d_inode and several other things like mount look-up. RCU-based changes
+affect only the way the hash chain is protected. For everything else
+the dcache_lock must be taken for both traversing as well as
+updating. The hash chain updates too take the dcache_lock. The
+significant change is the way d_lookup traverses the hash chain, it
+doesn't acquire the dcache_lock for this and rely on RCU to ensure
+that the dentry has not been *freed*.
+
+
+Dcache locking details
+======================
+
+For many multi-user workloads, open() and stat() on files are very
+frequently occurring operations. Both involve walking of path names to
+find the dentry corresponding to the concerned file. In 2.4 kernel,
+dcache_lock was held during look-up of each path component. Contention
+and cache-line bouncing of this global lock caused significant
+scalability problems. With the introduction of RCU in Linux kernel,
+this was worked around by making the look-up of path components during
+path walking lock-free.
+
+
+Safe lock-free look-up of dcache hash table
+===========================================
+
+Dcache is a complex data structure with the hash table entries also
+linked together in other lists. In 2.4 kernel, dcache_lock protected
+all the lists. We applied RCU only on hash chain walking. The rest of
+the lists are still protected by dcache_lock. Some of the important
+changes are :
+
+1. The deletion from hash chain is done using hlist_del_rcu() macro
+ which doesn't initialize next pointer of the deleted dentry and
+ this allows us to walk safely lock-free while a deletion is
+ happening.
+
+2. Insertion of a dentry into the hash table is done using
+ hlist_add_head_rcu() which take care of ordering the writes - the
+ writes to the dentry must be visible before the dentry is
+ inserted. This works in conjunction with hlist_for_each_rcu() while
+ walking the hash chain. The only requirement is that all
+ initialization to the dentry must be done before
+ hlist_add_head_rcu() since we don't have dcache_lock protection
+ while traversing the hash chain. This isn't different from the
+ existing code.
+
+3. The dentry looked up without holding dcache_lock by cannot be
+ returned for walking if it is unhashed. It then may have a NULL
+ d_inode or other bogosity since RCU doesn't protect the other
+ fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
+ indicate unhashed dentries and use this in conjunction with a
+ per-dentry lock (d_lock). Once looked up without the dcache_lock,
+ we acquire the per-dentry lock (d_lock) and check if the dentry is
+ unhashed. If so, the look-up is failed. If not, the reference count
+ of the dentry is increased and the dentry is returned.
+
+4. Once a dentry is looked up, it must be ensured during the path walk
+ for that component it doesn't go away. In pre-2.5.10 code, this was
+ done holding a reference to the dentry. dcache_rcu does the same.
+ In some sense, dcache_rcu path walking looks like the pre-2.5.10
+ version.
+
+5. All dentry hash chain updates must take the dcache_lock as well as
+ the per-dentry lock in that order. dput() does this to ensure that
+ a dentry that has just been looked up in another CPU doesn't get
+ deleted before dget() can be done on it.
+
+6. There are several ways to do reference counting of RCU protected
+ objects. One such example is in ipv4 route cache where deferred
+ freeing (using call_rcu()) is done as soon as the reference count
+ goes to zero. This cannot be done in the case of dentries because
+ tearing down of dentries require blocking (dentry_iput()) which
+ isn't supported from RCU callbacks. Instead, tearing down of
+ dentries happen synchronously in dput(), but actual freeing happens
+ later when RCU grace period is over. This allows safe lock-free
+ walking of the hash chains, but a matched dentry may have been
+ partially torn down. The checking of DCACHE_UNHASHED flag with
+ d_lock held detects such dentries and prevents them from being
+ returned from look-up.
+
+
+Maintaining POSIX rename semantics
+==================================
+
+Since look-up of dentries is lock-free, it can race against a
+concurrent rename operation. For example, during rename of file A to
+B, look-up of either A or B must succeed. So, if look-up of B happens
+after A has been removed from the hash chain but not added to the new
+hash chain, it may fail. Also, a comparison while the name is being
+written concurrently by a rename may result in false positive matches
+violating rename semantics. Issues related to race with rename are
+handled as described below :
+
+1. Look-up can be done in two ways - d_lookup() which is safe from
+ simultaneous renames and __d_lookup() which is not. If
+ __d_lookup() fails, it must be followed up by a d_lookup() to
+ correctly determine whether a dentry is in the hash table or
+ not. d_lookup() protects look-ups using a sequence lock
+ (rename_lock).
+
+2. The name associated with a dentry (d_name) may be changed if a
+ rename is allowed to happen simultaneously. To avoid memcmp() in
+ __d_lookup() go out of bounds due to a rename and false positive
+ comparison, the name comparison is done while holding the
+ per-dentry lock. This prevents concurrent renames during this
+ operation.
+
+3. Hash table walking during look-up may move to a different bucket as
+ the current dentry is moved to a different bucket due to rename.
+ But we use hlists in dcache hash table and they are
+ null-terminated. So, even if a dentry moves to a different bucket,
+ hash chain walk will terminate. [with a list_head list, it may not
+ since termination is when the list_head in the original bucket is
+&nb