1.. SPDX-License-Identifier: GPL-2.0
2
3======================
4Memory Protection Keys
5======================
6
7Memory Protection Keys provide a mechanism for enforcing page-based
8protections, but without requiring modification of the page tables when an
9application changes protection domains.
10
11Pkeys Userspace (PKU) is a feature which can be found on:
12        * Intel server CPUs, Skylake and later
13        * Intel client CPUs, Tiger Lake (11th Gen Core) and later
14        * Future AMD CPUs
15        * arm64 CPUs implementing the Permission Overlay Extension (FEAT_S1POE)
16
17x86_64
18======
19Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
20a "protection key", giving 16 possible keys.
21
22Protections for each key are defined with a per-CPU user-accessible register
23(PKRU).  Each of these is a 32-bit register storing two bits (Access Disable
24and Write Disable) for each of 16 keys.
25
26Being a CPU register, PKRU is inherently thread-local, potentially giving each
27thread a different set of protections from every other thread.
28
29There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
30register.  The feature is only available in 64-bit mode, even though there is
31theoretically space in the PAE PTEs.  These permissions are enforced on data
32access only and have no effect on instruction fetches.
33
34arm64
35=====
36
37Pkeys use 3 bits in each page table entry, to encode a "protection key index",
38giving 8 possible keys.
39
40Protections for each key are defined with a per-CPU user-writable system
41register (POR_EL0).  This is a 64-bit register encoding read, write and execute
42overlay permissions for each protection key index.
43
44Being a CPU register, POR_EL0 is inherently thread-local, potentially giving
45each thread a different set of protections from every other thread.
46
47Unlike x86_64, the protection key permissions also apply to instruction
48fetches.
49
50Syscalls
51========
52
53There are 3 system calls which directly interact with pkeys::
54
55	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
56	int pkey_free(int pkey);
57	int pkey_mprotect(unsigned long start, size_t len,
58			  unsigned long prot, int pkey);
59
60Before a pkey can be used, it must first be allocated with pkey_alloc().  An
61application writes to the architecture specific CPU register directly in order
62to change access permissions to memory covered with a key.  In this example
63this is wrapped by a C function called pkey_set().
64::
65
66	int real_prot = PROT_READ|PROT_WRITE;
67	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
68	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
69	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
70	... application runs here
71
72Now, if the application needs to update the data at 'ptr', it can
73gain access, do the update, then remove its write access::
74
75	pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
76	*ptr = foo; // assign something
77	pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again
78
79Now when it frees the memory, it will also free the pkey since it
80is no longer in use::
81
82	munmap(ptr, PAGE_SIZE);
83	pkey_free(pkey);
84
85.. note:: pkey_set() is a wrapper around writing to the CPU register.
86          Example implementations can be found in
87          tools/testing/selftests/mm/pkey-{arm64,powerpc,x86}.h
88
89Behavior
90========
91
92The kernel attempts to make protection keys consistent with the
93behavior of a plain mprotect().  For instance if you do this::
94
95	mprotect(ptr, size, PROT_NONE);
96	something(ptr);
97
98you can expect the same effects with protection keys when doing this::
99
100	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
101	pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
102	something(ptr);
103
104That should be true whether something() is a direct access to 'ptr'
105like::
106
107	*ptr = foo;
108
109or when the kernel does the access on the application's behalf like
110with a read()::
111
112	read(fd, ptr, 1);
113
114The kernel will send a SIGSEGV in both cases, but si_code will be set
115to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
116the plain mprotect() permissions are violated.
117
118Note that kernel accesses from a kthread (such as io_uring) will use a default
119value for the protection key register and so will not be consistent with
120userspace's value of the register or mprotect().
121