The address space operations are used to map parts of files into pages in Linux's page cache. This page cache represents data on some physical device (like a disk) that has been mapped into memory. The physical devices usually corresponds to a disk, but need not necessarily do so.
This is the structure of a page (note the existence of a
mapping field which is an address_space:
typedef struct page {
struct list_head list; /* ->mapping has some page lists. */
struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct page *next_hash; /* Next page sharing our hash bucket in
the pagecache hash table. */
atomic_t count; /* Usage count, see below. */
unsigned long flags; /* atomic flags, some possibly
updated asynchronously */
struct list_head lru; /* Pageout list, eg. active_list;
protected by pagemap_lru_lock !! */
struct page **pprev_hash; /* Complement to *next_hash. */
struct buffer_head * buffers; /* Buffer maps us to a disk block. */
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;
This is the structure of an address space:
struct address_space {
struct list_head clean_pages; /* list of clean pages */
struct list_head dirty_pages; /* list of dirty pages */
struct list_head locked_pages; /* list of locked pages */
unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
struct inode *host; /* owner: inode, block_device */
struct vm_area_struct *i_mmap; /* list of private mappings */
struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
spinlock_t i_shared_lock; /* and spinlock protecting it */
int gfp_mask; /* how to allocate the pages */
};
The list heads maintain doubly linked lists of clean, dirty, and locked
pages in this address_space (typically the pages of a single inode).
The a_ops field contains the functions for this specific
address space (for your file system), and the host field is either a
pointer to an inode or is null (as in the case of the swapper
address_space field. The host field will
typically be used by the a_ops functions, so it needn't
be set if it is not to be used.
Address space operations:
struct address_space_operations {
int (*writepage)(struct page *);
int (*readpage)(struct file *, struct page *);
int (*sync_page)(struct page *);
/*
* ext3 requires that a successful prepare_write() call be followed
* by a commit_write() call - they must be balanced
*/
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
int (*bmap)(struct address_space *, long);
int (*flushpage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
#define KERNEL_HAS_O_DIRECT /* this is for modules out of the kernel */
int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);
};
RamFS defines the minimal necessary set of address space operations:
static struct address_space_operations ramfs_aops = {
readpage: ramfs_readpage,
writepage: fail_writepage,
prepare_write: ramfs_prepare_write,
commit_write: ramfs_commit_write
};
The VM system calls readpage when it needs to get the
contents of a page initialized. It normally expects that this
initialization will happen by consulting the physical device
associated with the address_space host
field, and copying what is found there into the page that has been
allocated to hold that data. The RamFS version of this is somewhat
different.
static int ramfs_readpage(struct file *file, struct page * page)
{
if (!Page_Uptodate(page)) {
memset(kmap(page), 0, PAGE_CACHE_SIZE);
kunmap(page);
flush_dcache_page(page);
SetPageUptodate(page);
}
UnlockPage(page);
return 0;
}
Page_Uptodate is a macro that checks the PG_uptodate
bit in the page's flags:
#define Page_Uptodate(page) test_bit(PG_uptodate, &(page)->flags)
The kmap function makes sure a page is mapped correctly
into memory and returns its address. High memory pages may need to
be moved in order to operate upon them. The memset
call sets this (previously unreferenced -- we know because it is
being read) page to 0. The kunmap call is made to allow
the kernel to determine unmap high memory pages. The call
to flush_dcache_page makes sure that the data cache
line currently associated with this page's virtual address is flushed.
SetPageUptodate sets the PgUptodate bit we checked earlier.
Finally SetPageDirty must be set so that
this page does not get destroyed when the OS gets short of memory.
If the OS sees that this page is clean, it may reclaim it, thinking
that it has been written to disk.
The VM system calls writepage when it needs to
flush a page from its cache. It does this only if the subject
page was dirty. It expects that the page will be written to
disk (or some other physical device) and the storage associated with
the page will be reclaimed.
The writepage function for RamFS must fail to actually
clear the dirty bit, for the same reason that readpage
must (seemingly incorrectly) set the dirty bit.
Linux provides a default fail_writepage that does the
right thing for in-memory file systems.
/*
* In-memory filesystems have to fail their
* writepage function - and this has to be
* worked around in the VM layer..
*
* We
* - mark the page dirty again (but do NOT
* add it back to the inode dirty list, as
* that would livelock in fdatasync)
* - activate the page so that the page stealer
* doesn't try to write it out over and over
* again.
*/
int fail_writepage(struct page *page)
{
/* Only activate on memory-pressure, not fsync.. */
if (PageLaunder(page)) {
activate_page(page);
SetPageReferenced(page);
}
/* Set the page dirty again, unlock */
SetPageDirty(page);
UnlockPage(page);
return 0;
}
The Pg_launder bit is set when the virtual memory system
needs pages. The activate_page function makes sure this
page is placed on the active page list (rather than the inactive list).
Normally, when the user writes to a file, something like the following code gets executed:
page = __grab_cache_page(mapping, index, &cached_page); mapping->a_ops->prepare_write(file, page, offset, offset+bytes); copy_from_user(kaddr+offset, buf, bytes); mapping->a_ops->commit_write(file, page, offset, offset+bytes);
The __grab_cache_page call tries to find the page or
get a new one.
The prepare_write call must make sure the page is
mapped and marked dirty. Thus we can be sure that the page won't
disappear from the dcache before we actually write to it.
The copy_from_user call copies user space data into the
page.
Finally the commit_write call is responsible for doing
whatever else is necessary. Normally this entails:
page->buffers in such a way that
the appropriate page buffers will be written when memory gets
low,
or setting the page dirty bit and hoping that write_page
will do its job.
The second method may fail if the filesystem granularity is smaller
than the kernel PAGE_SIZE.
static int ramfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
void *addr = kmap(page);
if (!Page_Uptodate(page)) {
memset(addr, 0, PAGE_CACHE_SIZE);
flush_dcache_page(page);
SetPageUptodate(page);
}
SetPageDirty(page);
return 0;
}
If the Pg_uptodate flag has not been set, then this
page was not already in memory, yet, we are asking to write it.
I suppose this is possible, even though I'm not sure how it can
occur. In any event, we need to note that the page is dirty.
ramfs_commit_write the position of the
static int ramfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
struct inode *inode = page->mapping->host;
loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
kunmap(page);
if (pos > inode->i_size)
inode->i_size = pos;
return 0;
}
In looking at the kernel code, I ran across numerous occurrences of the lock identifier BKL. Its meaning was unexplained.