Lecture 21

Address Space Operations

The address space operations are used to map parts of files into pages in Linux's page cache. This page cache represents data on some physical device (like a disk) that has been mapped into memory. The physical devices usually corresponds to a disk, but need not necessarily do so.

This is the structure of a page (note the existence of a mapping field which is an address_space:

typedef struct page {
        struct list_head list;          /* ->mapping has some page lists. */
        struct address_space *mapping;  /* The inode (or ...) we belong to. */
        unsigned long index;            /* Our offset within mapping. */
        struct page *next_hash;         /* Next page sharing our hash bucket in
                                           the pagecache hash table. */
        atomic_t count;                 /* Usage count, see below. */
        unsigned long flags;            /* atomic flags, some possibly
                                           updated asynchronously */
        struct list_head lru;           /* Pageout list, eg. active_list;
                                           protected by pagemap_lru_lock !! */
        struct page **pprev_hash;       /* Complement to *next_hash. */
        struct buffer_head * buffers;   /* Buffer maps us to a disk block. */

        /*
         * On machines where all RAM is mapped into kernel address space,
         * we can simply calculate the virtual address. On machines with
         * highmem some memory is mapped into kernel virtual memory
         * dynamically, so we need a place to store that address.
         * Note that this field could be 16 bits on x86 ... ;)
         *
         * Architectures with slow multiplication can define
         * WANT_PAGE_VIRTUAL in asm/page.h
         */
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
        void *virtual;                  /* Kernel virtual address (NULL if
                                           not kmapped, ie. highmem) */
#endif /* CONFIG_HIGMEM || WANT_PAGE_VIRTUAL */
} mem_map_t;

This is the structure of an address space:

struct address_space {
        struct list_head        clean_pages;    /* list of clean pages */
        struct list_head        dirty_pages;    /* list of dirty pages */
        struct list_head        locked_pages;   /* list of locked pages */
        unsigned long           nrpages;        /* number of total pages */
        struct address_space_operations *a_ops; /* methods */
        struct inode            *host;          /* owner: inode, block_device */
        struct vm_area_struct   *i_mmap;        /* list of private mappings */
        struct vm_area_struct   *i_mmap_shared; /* list of shared mappings */
        spinlock_t              i_shared_lock;  /* and spinlock protecting it */
        int                     gfp_mask;       /* how to allocate the pages */
};

The list heads maintain doubly linked lists of clean, dirty, and locked pages in this address_space (typically the pages of a single inode). The a_ops field contains the functions for this specific address space (for your file system), and the host field is either a pointer to an inode or is null (as in the case of the swapper address_space field. The host field will typically be used by the a_ops functions, so it needn't be set if it is not to be used.

Address space operations:

struct address_space_operations {
        int (*writepage)(struct page *);
        int (*readpage)(struct file *, struct page *);
        int (*sync_page)(struct page *);
        /*
         * ext3 requires that a successful prepare_write() call be followed
         * by a commit_write() call - they must be balanced
         */
        int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
        int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
        /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
        int (*bmap)(struct address_space *, long);
        int (*flushpage) (struct page *, unsigned long);
        int (*releasepage) (struct page *, int);
#define KERNEL_HAS_O_DIRECT /* this is for modules out of the kernel */
        int (*direct_IO)(int, struct inode *, struct kiobuf *, unsigned long, int);
};

RamFS Address Space Operations

The simplest possible file system we can imagine is one that maintains all the pages just in RAM. With Linux's page cache, this becomes extremely simple. All we have to do is zero a page when it is first referenced (since no files exist in the RamFS before they have been created at run time). After that we just need to make sure the files stay in memory.

RamFS defines the minimal necessary set of address space operations:

static struct address_space_operations ramfs_aops = {
        readpage:       ramfs_readpage,
        writepage:      fail_writepage,
        prepare_write:  ramfs_prepare_write,
        commit_write:   ramfs_commit_write
};

readpage

The VM system calls readpage when it needs to get the contents of a page initialized. It normally expects that this initialization will happen by consulting the physical device associated with the address_space host field, and copying what is found there into the page that has been allocated to hold that data. The RamFS version of this is somewhat different.

static int ramfs_readpage(struct file *file, struct page * page)
{
        if (!Page_Uptodate(page)) {
                memset(kmap(page), 0, PAGE_CACHE_SIZE);
                kunmap(page);
                flush_dcache_page(page);
                SetPageUptodate(page);
        }
        UnlockPage(page);
        return 0;
}

Page_Uptodate is a macro that checks the PG_uptodate bit in the page's flags:

#define Page_Uptodate(page)     test_bit(PG_uptodate, &(page)->flags)

The kmap function makes sure a page is mapped correctly into memory and returns its address. High memory pages may need to be moved in order to operate upon them. The memset call sets this (previously unreferenced -- we know because it is being read) page to 0. The kunmap call is made to allow the kernel to determine unmap high memory pages. The call to flush_dcache_page makes sure that the data cache line currently associated with this page's virtual address is flushed. SetPageUptodate sets the PgUptodate bit we checked earlier. Finally SetPageDirty must be set so that this page does not get destroyed when the OS gets short of memory. If the OS sees that this page is clean, it may reclaim it, thinking that it has been written to disk.

writepage

The VM system calls writepage when it needs to flush a page from its cache. It does this only if the subject page was dirty. It expects that the page will be written to disk (or some other physical device) and the storage associated with the page will be reclaimed.

The writepage function for RamFS must fail to actually clear the dirty bit, for the same reason that readpage must (seemingly incorrectly) set the dirty bit.

Linux provides a default fail_writepage that does the right thing for in-memory file systems.

/*
 * In-memory filesystems have to fail their
 * writepage function - and this has to be
 * worked around in the VM layer..
 *
 * We
 *  - mark the page dirty again (but do NOT
 *    add it back to the inode dirty list, as
 *    that would livelock in fdatasync)
 *  - activate the page so that the page stealer
 *    doesn't try to write it out over and over
 *    again.
 */
int fail_writepage(struct page *page)
{
        /* Only activate on memory-pressure, not fsync.. */
        if (PageLaunder(page)) {
                activate_page(page);
                SetPageReferenced(page);
        }

        /* Set the page dirty again, unlock */
        SetPageDirty(page);
        UnlockPage(page);
        return 0;
}

The Pg_launder bit is set when the virtual memory system needs pages. The activate_page function makes sure this page is placed on the active page list (rather than the inactive list).

prepare_write

Normally, when the user writes to a file, something like the following code gets executed:

page = __grab_cache_page(mapping, index, &cached_page);
mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
copy_from_user(kaddr+offset, buf, bytes);
mapping->a_ops->commit_write(file, page, offset, offset+bytes);

The __grab_cache_page call tries to find the page or get a new one.

The prepare_write call must make sure the page is mapped and marked dirty. Thus we can be sure that the page won't disappear from the dcache before we actually write to it.

The copy_from_user call copies user space data into the page.

Finally the commit_write call is responsible for doing whatever else is necessary. Normally this entails:

Either manuipluating the page->buffers in such a way that the appropriate page buffers will be written when memory gets low,
or setting the page dirty bit and hoping that write_page will do its job.
The second method may fail if the filesystem granularity is smaller than the kernel PAGE_SIZE.
Unmapping the page.
Updating the i_node structure correctly.

static int ramfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
        void *addr = kmap(page);
        if (!Page_Uptodate(page)) {
                memset(addr, 0, PAGE_CACHE_SIZE);
                flush_dcache_page(page);
                SetPageUptodate(page);
        }
        SetPageDirty(page);
        return 0;
}

If the Pg_uptodate flag has not been set, then this page was not already in memory, yet, we are asking to write it. I suppose this is possible, even though I'm not sure how it can occur. In any event, we need to note that the page is dirty.

commit_write

In ramfs_commit_write the position of the

static int ramfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
        struct inode *inode = page->mapping->host;
        loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

        kunmap(page);
        if (pos > inode->i_size)
                inode->i_size = pos;
        return 0;
}

In looking at the kernel code, I ran across numerous occurrences of the lock identifier BKL. Its meaning was unexplained.