Lines Matching +full:- +full:a +full:-
5 This write-up is based on three articles published at lwn.net:
7 - <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8 - <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9 - <https://lwn.net/Articles/650786/> A walk among the symlinks
15 - per-directory parallel name lookup.
16 - ``openat2()`` resolution restriction flags.
27 the early parts of the analysis we will divide off symlinks - leaving
30 will allow us to review "REF-walk" and "RCU-walk" separately. But we
35 --------------------------
37 .. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
43 non-"``/``" characters. These form two kinds of paths. Those that
46 from some other location specified by a file descriptor given to
49 .. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
51 It is tempting to describe the second kind as starting with a
52 component, but that isn't always accurate: a pathname can lack both
62 it must identify a directory that already exists, otherwise an error
68 pathname that is just slashes have a final component. If it does
74 If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
77 particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
81 A pathname that contains at least one non-<slash> character and
84 the trailing <slash> characters names an existing directory or a
85 directory entry that is to be created for a directory immediately
95 While one process is looking up a pathname, another might be making
97 "a/b" were renamed to "a/c/b" while another process were looking up
98 "a/b/..", that process might successfully resolve on "a/c".
99 Most races are much more subtle, and a big part of the task of
105 More than just a cache
106 ----------------------
109 make them quickly available for lookup. Each entry (known as a
110 "dentry") contains three significant fields: a component name, a
111 pointer to a parent dentry, and a pointer to the "inode" which
115 dentry of a directory to the dentries of the children, that linkage is
118 The dcache has a number of uses apart from accelerating lookup. One
129 the VFS to determine if a particular file does or doesn't exist
135 These are typically filesystems that are shared across a network,
142 REF-walk: simple concurrency management with refcounts and spinlocks
143 --------------------------------------------------------------------
146 looking at the actual process of walking along a path. In particular
147 we will start with the handling of the "everything else" part of a
148 pathname, and focus on the "REF-walk" approach to concurrency
151 (indicating the use of RCU-walk) is set.
155 REF-walk is fairly heavy-handed with locks and reference counts. Not
156 as heavy-handed as in the old "big kernel lock" days, but certainly not
157 afraid of taking a lock when one is needed. It uses a variety of
158 different concurrency controls. A background understanding of the
162 The locking mechanisms used by REF-walk include:
164 dentry->d_lockref
167 This uses the lockref primitive to provide both a spinlock and a
168 reference count. The special-sauce of this primitive is that the
170 with a single atomic memory operation.
172 Holding a reference on a dentry ensures that the dentry won't suddenly
174 will behave as expected. It also protects the ``->d_inode`` reference
177 The association between a dentry and its inode is fairly permanent.
178 For example, when a file is renamed, the dentry and inode move
179 together to the new location. When a file is created the dentry will
183 When a file is deleted, this can be reflected in the cache either by
190 ``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a
193 So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
196 dentry->d_lock
199 ``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above.
205 When looking for a name in a directory, REF-walk takes ``d_lock`` on
210 When looking for the parent for a given name (to handle "``..``"),
211 REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
212 but it first tries a more lightweight approach. As seen in
213 ``dget_parent()``, if a reference can be claimed on the parent, and if
220 Looking up a given name in a given directory involves computing a hash
222 accessing that slot in a hash table, and searching the linked list
225 When a dentry is renamed, the name and the parent dentry can both
227 dentry to a different chain in the hash table. If a filename search
228 happened to be looking at a dentry that was moved in this way,
232 The name-lookup process (``d_lookup()``) does *not* try to prevent this
234 ``rename_lock`` is a seqlock that is updated whenever any dentry is
235 renamed. If ``d_lookup`` finds that a rename happened while it
236 unsuccessfully scanned a chain in the hash table, it simply tries
243 a "..", a potential attack occurred and ``handle_dots()`` will bail out with
244 ``-EAGAIN``.
246 inode->i_rwsem
249 ``i_rwsem`` is a read/write semaphore that serializes all changes to a particular
250 directory. This ensures that, for example, an ``unlink()`` and a ``rename()``
252 stable while the filesystem is asked to look up a name that is not
253 currently in the dcache or, optionally, when the list of entries in a
256 This has a complementary role to that of ``d_lock``: ``i_rwsem`` on a
258 on a name protects just one name in a directory. Most changes to the
265 prevents changes during lookup of a name in a directory. ``walk_component()`` uses
268 falls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that
269 the name isn't in the cache, and then calls in to the filesystem to get a
270 definitive answer. A new dentry will be added to the cache regardless of
277 issues addressed in a subsequent section.
279 If two threads attempt to look up the same name at the same time - a
280 name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
283 based around a secondary hash table (``in_lookup_hashtable``) and a
284 per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
286 To add a new dentry to the cache while only holding a shared lock on
287 ``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a
289 is already a matching dentry in the primary or secondary hash
293 If a matching dentry was found in the primary hash table then that is
294 returned and the caller can know that it lost a race with some other
302 dentry from the secondary hash table - it will normally have been
303 added to the primary hash table already. Note that a ``struct
308 If a matching dentry is found in the secondary hash table,
309 ``d_alloc_parallel()`` has a little more work to do. It first waits for
310 ``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed
321 mnt->mnt_count
324 ``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325 Per-CPU here means that incrementing the count is cheap as it only
326 uses CPU-local memory, but checking if the count is zero is expensive as
327 it needs to check with every CPU. Taking a ``mnt_count`` reference
329 unmount operations, but does not prevent a "lazy" unmount. So holding
331 in particular, doesn't stabilize the link to the mounted-on dentry. It
333 and it provides a reference to the root dentry of the mounted
334 filesystem. So a reference through ``->mnt_count`` provides a stable
335 reference to the mounted dentry, but not the mounted-on dentry.
340 ``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to
344 crossing a mount point to check that the crossing was safe. That is,
349 was a change, the ``mnt_count`` is decremented and the whole process is
352 When walking up the tree (towards the root) by following a ".." link,
353 a little more care is needed. In this case the seqlock (which
354 contains both a counter and a spinlock) is fully locked to prevent
356 needed to stabilize the link to the mounted-on dentry, which the
363 a "..", a potential attack occurred and ``handle_dots()`` will bail out with
364 ``-EAGAIN``.
377 ----------------------------------------------
379 .. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
381 Throughout the process of walking a path, the current status is stored
382 in a ``struct nameidata``, "namei" being the traditional name - dating
383 all the way back to `First Edition Unix`_ - of the function that
384 converts a "name" to an "inode". ``struct nameidata`` contains (among
390 A ``path`` contains a ``struct vfsmount`` (which is
391 embedded in a ``struct mount``) and a ``struct dentry``. Together these
394 directory identified by a file descriptor), and are updated on each
395 step. A reference through ``d_lockref`` and ``mnt_count`` is always
401 This is a string together with a length (i.e. *not* ``nul`` terminated)
413 This is used to hold a reference to the effective root of the
415 only assigned the first time it is used, or when a non-standard root
416 is requested. Keeping a reference in the ``nameidata`` ensures that
418 with a ``chroot()`` system call.
425 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
430 ``mount_subtree()``. In each case a pathname is being looked up in a very
432 escape that subtree. It works a bit like a local ``chroot()``.
438 Given a path (``name``) and a nameidata structure (``nd``), check that the
446 described. If it finds a ``LAST_NORM`` component it first calls
449 If that doesn't get a good result, it calls "``lookup_slow()``" which
451 to find a definitive answer.
455 handle_mounts(), to check and handle mount points, in which a new
456 ``struct path`` is created containing a counted reference to the new dentry and
457 a reference to the new ``vfsmount`` which is only counted if it is
459 a symbolic link, step_into() calls pick_link() to deal with it,
463 This "hand-over-hand" sequencing of getting a reference to the new
466 analogue in the "RCU-walk" version.
469 ----------------------------
471 ``link_path_walk()`` only walks as far as setting ``nd->last`` and
472 ``nd->last_type`` to refer to the final component of the path. It does
479 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
482 aiming to create a name (via ``filename_create()``) or remove or rename
483 a name (in which case ``user_path_parent()`` is used). They will use
487 ``path_lookupat()`` is nearly as simple - it is used when an existing
489 calls ``walk_component()`` on the final component through a call to
494 This is important when unmounting a filesystem that is inaccessible, such as
495 one provided by a dead NFS server.
512 the final component, it must be a trailing slash.
515 ---------------------------
517 Apart from symbolic links, there are only two parts of the "REF-walk"
522 ``->d_revalidate()`` dentry method to ensure that the cached information
523 is current. This will often confirm validity or update a few details
524 from a server. In some cases it may find that there has been change
532 lookup a name can trigger changes to how that lookup should be
533 handled, in particular by mounting a filesystem there. These are
535 tree, but a few notes specifically related to path lookup are in order
538 The Linux VFS has a concept of "managed" dentries. There are three
540 to three different flags that might be set in ``dentry->d_flags``:
552 trigger a new automount.
554 It can selectively allow only some processes to transit through a
555 mount point. When a server process is managing automounts, it may
556 need to access a directory without triggering normal automount
558 filesystem, which will then give it a special pass through
559 ``d_manage()`` by returning ``-EISDIR``.
567 other. So this flag is seen as a hint, not a promise.
569 If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
571 ``mount_lock`` described earlier) and possibly return a new ``vfsmount``
572 and a new ``dentry`` (both with counted references).
578 find a mount point, then this flag causes the ``d_automount()`` dentry
592 This will become more important next time when we examine RCU-walk
595 RCU-walk - faster pathname lookup in Linux
598 RCU-walk is another algorithm for performing pathname lookup in Linux.
599 It is in many ways similar to REF-walk and the two share quite a bit
600 of code. The significant difference in RCU-walk is how it allows for
603 We noted that REF-walk is complex because there are numerous details
604 and special cases. RCU-walk reduces this complexity by simply
605 refusing to handle a number of cases -- it instead falls back to
606 REF-walk. The difficulty with RCU-walk comes from a different
608 quite different from traditional locking, so we will spend a little extra
612 --------------------------
615 thread from changing the data structures that a given thread is
621 goal when reading a shared data structure that no other process is
625 The REF-walk mechanism already described certainly doesn't follow this
627 be other threads modifying the data. RCU-walk, in contrast, is
631 other parts it is important that RCU-walk can quickly fall back to
632 using REF-walk.
634 Pathname lookup always starts in RCU-walk mode but only remains there
640 REF-walk.
642 This stopping requires getting a counted reference on the current
643 ``vfsmount`` and ``dentry``, and ensuring that these are still valid -
644 that a path walk with REF-walk would have found the same entries.
645 This is an invariant that RCU-walk must guarantee. It can only make
647 REF-walk could also have made if it were walking down the tree at the
649 processed with the reliable, if slightly sluggish, REF-walk. If
650 RCU-walk finds it cannot stop gracefully, it simply gives up and
651 restarts from the top with REF-walk.
653 This pattern of "try RCU-walk, if that fails try REF-walk" can be
659 called using different mode flags until a mode is found which works.
660 They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
662 special flag to request "REF-walk". If either of those report the
663 error ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no
665 revalidated - normally entries are only revalidated if the filesystem
669 REF-walk, but will never then try to switch back to RCU-walk. Places
670 that trip up RCU-walk are much more likely to be near the leaves and
675 --------------------------------
677 RCU is, unsurprisingly, critical to RCU-walk mode. The
678 ``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
679 down a path. The particular guarantee it provides is that the key
680 data structures - dentries, inodes, super_blocks, and mounts - will
687 As we saw above, REF-walk holds a counted reference to the current
691 taken to prevent certain changes from happening. RCU-walk must not
693 Instead, it checks to see if a change has been made, and aborts or
696 To preserve the invariant mentioned above (that RCU-walk may only make
697 decisions that REF-walk could have made), it must make the checks at
698 or near the same places that REF-walk holds the references. So, when
699 REF-walk increments a reference count or takes a spinlock, RCU-walk
700 samples the status of a seqlock using ``read_seqcount_begin()`` or a
701 similar function. When REF-walk decrements the count or drops the
702 lock, RCU-walk checks if the sampled status is still valid using
705 However, there is a little bit more to seqlocks than that. If
706 RCU-walk accesses two different fields in a seqlock-protected
707 structure, or accesses the same field twice, there is no a priori
709 is needed - which it usually is - RCU-walk must take a copy and then
713 imposes a memory barrier so that no memory-read instruction from
715 CPU or by the compiler. A simple example of this can be seen in
717 byte-wise name equality, calls into the filesystem to compare a name
718 against a dentry. The length and name pointer are copied into local
720 are consistent, and only then is ``->d_compare()`` called. When
723 instead has a large comment explaining why the consistency guarantee
724 isn't necessary. A subsequent ``read_seqcount_retry()`` will be
728 the bigger picture of how RCU-walk uses seqlocks.
730 ``mount_lock`` and ``nd->m_seq``
733 We already met the ``mount_lock`` seqlock when REF-walk used it to
734 ensure that crossing a mount point is performed safely. RCU-walk uses
735 it for that too, but for quite a bit more.
737 Instead of taking a counted reference to each ``vfsmount`` as it
738 descends the tree, RCU-walk samples the state of ``mount_lock`` at the
743 relatively rare, it is reasonable to fall back on REF-walk any time
746 ``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
747 sequence, whether switching to REF-walk for the rest of the path or
749 down over a mount point (in ``__follow_mount_rcu()``) or up (in
751 whole RCU-walk sequence is aborted and the path is processed again by
752 REF-walk.
754 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
755 that, had REF-walk taken counted references on each vfsmount, the
759 ``dentry->d_seq`` and ``nd->seq``
762 In place of taking a count or lock on ``d_reflock``, RCU-walk samples
763 the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
764 ``seq`` field of the nameidata structure, so ``nd->seq`` should always be
765 the current sequence number of ``nd->dentry``. This number needs to be
773 When not at a mount point, ``d_parent`` is followed and its ``d_seq`` is
774 collected. When we are at a mount point, we instead follow the
775 ``mnt->mnt_mountpoint`` link to get a new dentry and collect its
776 ``d_seq``. Then, after finally finding a ``d_parent`` to follow, we must
777 check if we have landed on a mount point and, if so, must find that
778 mount point and follow the ``mnt->mnt_root`` link. This would imply a
783 The inode pointer, stored in ``->d_inode``, is a little more
786 permissions. Symlink handling requires a validated inode pointer too.
787 Rather than revalidating on each access, a copy is made on the first
791 ``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
797 ``__d_lookup_rcu()`` which, on success, returns a new ``dentry`` and a
803 getting a counted reference to the new dentry before dropping that for
804 the old dentry which we saw in REF-walk.
806 No ``inode->i_rwsem`` or even ``rename_lock``
809 A semaphore is a fairly heavyweight lock that can only be taken when it is
811 ``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
812 take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
813 to notice, the result will be either that RCU-walk fails to find the
814 dentry that it is looking for, or it will find a dentry which
816 REF-walk mode which can take whatever locks are needed.
818 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
819 any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
822 something that actually is there. When RCU-walk fails to find
824 already drops down to REF-walk and tries again with appropriate
829 -----------------------------------------
831 That "dropping down to REF-walk" typically involves a call to
832 ``unlazy_walk()``, so named because "RCU-walk" is also sometimes
839 automount point is found, or in a couple of cases involving symlinks.
844 Other reasons for dropping out of RCU-walk that do not trigger a call
847 seqlocks reporting a change. In these cases the relevant function
848 will return ``-ECHILD`` which will percolate up until it triggers a new
849 attempt from the top using REF-walk.
852 takes a reference on each of the pointers that it holds (vfsmount,
855 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
856 has been a success and the lookup process continues.
858 Taking a reference on those pointers is not quite as simple as just
859 incrementing a counter. That works to take a second reference if you
861 isn't sufficient if you don't actually have a counted reference at
862 all. For ``dentry->d_lockref``, it is safe to increment the reference
863 counter to get a reference unless it has been explicitly marked as
864 "dead" which involves setting the counter to ``-128``.
867 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 the standard way of calling ``mnt_put()`` - an unmount may have
873 ``MNT_SYNC_UMOUNT`` flag to determine if a simple ``mnt_put()`` is
878 --------------------------
880 RCU-walk depends almost entirely on cached information and often will
882 besides the already-mentioned component-name comparison, where the
883 file system might be included in RCU-walk, and it must know to be
886 If the filesystem has non-standard permission-checking requirements -
887 such as a networked filesystem which may need to check with the server
888 - the ``i_op->permission`` interface might be called during RCU-walk.
890 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
891 promptly. ``i_op->permission`` is given the inode pointer, not the
901 ``d_op->d_revalidate`` may be called in RCU-walk too. This interface
909 A pair of patterns
910 ------------------
912 In various places in the details of REF-walk and RCU-walk, and also in
913 the big picture, there are a couple of related patterns that are worth
917 can see that in the high-level approach of first trying RCU-walk and
918 then trying REF-walk, and in places where ``unlazy_walk()`` is used to
919 switch to REF-walk for the rest of the path. We also saw it earlier
920 in ``dget_parent()`` when following a "``..``" link. It tries a quick way
921 to get a reference, then falls back to taking locks if needed.
924 again - repeatedly". This is seen with the use of ``rename_lock`` and
925 ``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
926 if anything goes wrong it is much safer to just abort and try a more
931 needed is a reminder that the system is dynamic and only a limited
937 A walk among the symlinks
944 Then a consideration of access-time updates and summary of the various
948 -----------------
951 appear in a path prior to the final component: directories and symlinks.
954 component on the path. Handling symbolic links requires a bit more
958 a component name refers to a symbolic link, then that component is
959 replaced by the body of the link and, if that body starts with a '/',
961 "``readlink -f``" command does, though it also edits out "``.``" and
965 up a path, and discarding early components is pointless as they aren't
969 which in turn can refer to a third, we may need to keep the remaining
971 ones are completed. These path remnants are kept on a stack of
975 occur in a single path lookup. The most obvious is to avoid loops.
976 If a symlink referred to itself either directly or through
978 successfully - the error ``ELOOP`` must be returned. Loops can be
986 Because it's a latency and DoS issue too. We need to react well to
987 true loops, but also to "very deep" non-loops. It's not about memory
990 Linux imposes a limit on the length of any pathname: ``PATH_MAX``, which
991 is 4096. There are a number of reasons for this limit; not letting the
994 sort of limit is needed for the same reason. Linux imposes a limit of
996 a further limit of eight on the maximum depth of recursion, but that was
997 raised to 40 when a separate stack was implemented, so there is now
1000 The ``nameidata`` structure that we met in an earlier article contains a
1002 symlinks. In many cases this will be sufficient. If it isn't, a
1008 this stack, but we need a bit more. To see that, we need to move on to
1012 ---------------------------------------
1016 to external storage. It is particularly important for RCU-walk to be
1018 it doesn't need to drop down into REF-walk.
1020 .. _object-oriented design pattern: https://lwn.net/Articles/446317/
1024 stored directly in the inode. When a filesystem allocates a ``struct
1025 inode`` it typically allocates extra space to store private data (a
1026 common `object-oriented design pattern`_ in the kernel). This will
1027 sometimes include space for a symlink. The other common location is
1029 pathname in a symlink can be seen as the content of that symlink and
1037 the inode which, itself, is protected by RCU or by a counted reference
1045 situation is not so straightforward. A reference on a dentry or even
1048 a page will not disappear. So for these symlinks the pathname lookup
1049 code needs to ask the filesystem to provide a stable reference and,
1053 Taking a reference to a cache page is often possible even in RCU-walk
1055 but that isn't necessarily a big cost and it is better than dropping
1056 out of RCU-walk mode completely. Even filesystems that allocate
1058 allocate memory without the need to drop out of RCU-walk. If a
1059 filesystem cannot successfully get a reference in RCU-walk mode, it
1060 must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1061 REF-walk mode in which the filesystem is allowed to sleep.
1063 The place for all this to happen is the ``i_op->get_link()`` inode
1064 method. This is called both in RCU-walk and REF-walk. In RCU-walk the
1065 ``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
1066 RCU-walk. Much like the ``i_op->permission()`` method we
1067 looked at previously, ``->get_link()`` would need to be careful that
1069 holding no counted reference, only the RCU lock. A callback
1070 ``struct delayed_called`` will be passed to ``->get_link()``:
1076 whether in RCU-walk or REF-walk, the symlink stack needs to contain,
1079 - the ``struct path`` to provide a reference to the previous path
1080 - the ``const char *`` to provide a reference to the to previous name
1081 - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1082 - the ``struct delayed_call`` for later invocation.
1086 remnant). On a 64-bit system, this is about 40 bytes per entry;
1088 half a page. So it might seem like a lot, but is by no means
1091 Note that, in a given stack frame, the path remnant (``name``) is not
1096 ---------------------
1099 components in the path and all of the non-final symlinks. As symlinks
1100 are processed, the ``name`` pointer is adjusted to point to a new
1104 a little more complex.
1106 When a symlink is found, walk_component() calls pick_link() via step_into()
1109 stack, and the new value is used as the ``name`` for a while. When the end of
1115 the last component of a symlink itself points to a symlink, we
1116 want to pop the symlink-just-completed off the stack before pushing
1117 the symlink-just-found to avoid leaving empty path remnants that would
1127 forbids it from following a symlink if it finds one, ``WALK_MORE``
1131 decide whether follow it when it is a symlink and call ``may_follow_link()`` to
1137 A pair of special-case symlinks deserve a little further explanation.
1138 Both result in a new ``struct path`` (with mount and dentry) being set
1141 The more obvious case is a symlink to "``/``". All symlinks starting
1149 aren't really (and are therefore commonly referred to as "magic-links")::
1151 $ ls -l /proc/self/fd/1
1152 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1155 something that looks like a symlink. It is really a reference to the
1157 objects you get a name that might refer to the same file - unless it
1159 one of these, the ``->get_link()`` method in "procfs" doesn't return
1160 a string name, but instead calls nd_jump_link() which updates the
1161 ``nameidata`` in place to point to that target. ``->get_link()`` then
1166 --------------------------------------------
1173 callers will want to follow a symlink if one is found, and possibly
1181 path_lookupat(), path_openat() using a loop that calls link_path_walk(),
1183 lookup_last(). If it is a symlink that needs to be followed,
1191 with do_open() for opening a file. Part of open_last_lookups() runs
1192 with ``i_rwsem`` held and this part is in a separate function: lookup_open().
1195 of this article, but a few highlights should help those interested in exploring
1203 will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
1208 2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
1209 wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
1210 otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
1213 3. An open with O_CREAT **does** follow a symlink in the final component,
1216 ln -s bar /tmp/foo
1219 will create a file called ``/tmp/bar``. This is not permitted if
1221 like for a non-creating open: lookup_last() or open_last_lookup()
1222 returns a non ``NULL`` value, and link_path_walk() gets called and the
1226 ------------------------
1228 We previously said of RCU-walk that it would "take no locks, increment
1230 "footprints" can be needed when handling symlinks as a counted
1231 reference (or even a memory allocation) may be needed. But these
1232 footprints are best kept to a minimum.
1234 One other place where walking down a symlink can involve leaving
1235 footprints in a way that doesn't affect directories is in updating access times.
1236 In Unix (and Linux) every filesystem object has a "last accessed
1237 time", or "``atime``". Passing through a directory to access a file
1239 ``atime``; only listing the contents of a directory can update its ``atime``.
1240 Symlinks are different it seems. Both reading a symlink (with ``readlink()``)
1241 and looking up a symlink on the way to some other destination can
1247 subject. The `clearest statement`_ is that, if a particular implementation
1248 updates a timestamp in a place not specified by POSIX, this must be
1251 care about access-time updates during pathname lookup.
1256 filesystem, at least, didn't update atime when following a link.
1260 quite complex. Trying to stay in RCU-walk while doing it is best
1266 ``relatime``, many filesystems record ``atime`` with a one-second
1269 It is easy to test if an ``atime`` update is needed while in RCU-walk
1270 mode and, if it isn't, the update can be skipped and RCU-walk mode
1272 path walk drop down to REF-walk. All of this is handled in the
1275 A few flags
1276 -----------
1278 A suitable way to wrap up this tour of pathname walking is to list
1294 to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1298 context of a particular access being audited.
1306 following "``..``", following a symlink to ``/``, crossing a mount point
1307 or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
1310 to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1311 ``ND_JUMPED`` is set when the look completes - which may be at the
1314 Resolution-restriction flags
1318 and attack scenarios involving changing path components, a series of flags are
1322 ``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1326 ``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1328 ``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1331 bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1332 lookup is determined by the first mountpoint the path lookup reaches --
1334 with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1341 resolution of "..". Magic-links are also blocked.
1345 the starting point, and ".." at the starting point will act as a no-op. As with
1347 attacks against ".." resolution. Magic-links are also blocked.
1349 Final-component flags
1359 needs to trigger the mount but otherwise behaves a lot like ``stat()``, so
1361 "``mount --bind``".
1363 ``LOOKUP_FOLLOW`` has a similar function to ``LOOKUP_AUTOMOUNT`` but for
1367 ``WALK_GET`` that we already met, but it is used in a different way.
1369 ``LOOKUP_DIRECTORY`` insists that the final component is a directory.
1371 is found to be followed by a slash.
1375 available to the filesystem and particularly the ``->d_revalidate()``
1376 method. A filesystem can choose not to bother revalidating too hard
1378 These flags were previously useful for ``->lookup()`` too but with the
1379 introduction of ``->atomic_open()`` they are less relevant there.
1382 ---------------
1385 in good shape - various parts are certainly easier to understand now
1386 than even a couple of releases ago. But that doesn't mean it is
1387 "finished". As already mentioned, RCU-walk currently only follows