DAS Stack: Let’s continue the journey – About the “Disk”

From a typical internal construction perspective, we can say the disks can be broadly classified as three types. The physical hard disk, a flash based Solid State Disk (SSD) and virtual disk from Storage Arrays (Network Disks).

The physical hard disks are usually made up of a hard metal disk (usually aluminum) coated with a magnetic material. The magnetic material records the data and the aluminum disk provides the rigid support.

The diagram below shows a quick view of insides of a typical hard disk drive.

More detail at http://news.bbc.co.uk/2/hi/technology/6677545.stm

Let’s quickly touch a few ares in a word or two. For further information http://en.wikipedia.org/wiki/Hard_disk_drive is an excellent source of detailed information.

The disk or platters here is held and rotated by the spindle. The Read/Write head moves about the surface of the disk and, of course, reads or writes the disk as it rotates. The head arm holds the Read/Write head(s) and is positioned by the voice coil actuator. The air-filter filters out the dust etc. from entering the hard drive compartment.

Just for notes, the head “flies” above the disk with a very thin gap between the head and the disk. If it ever touches the disk during spin, the head “crashes” and the disk is essentially useless. There are some data recovery technologies which can try and recover the data that is not on the crash path. However, as far I know, it is not practical to extract data under the tracks that is actually in the crash zone.

Disk Block Layout

This diagram shows a rough view of Disk Layout. The disk itself if split in to tracks and sectors. There are two possibilities with this kind of arrangement, as you might guess or know. The first is variable track length (or circumference, if you prefer). This is a typical concentric circle division. The center circles are smaller and hence will have smaller circumference and the outer circles are larger having greater circumference.

The other is fixed track length. This can normally be achieved by spirals instead of concentric circles. The track is of fixed length. Near the center of the disk, the circle can have lesser tracks than near the edge of the disk. This is more standard industry practice. The diagram below shows a typical tracks and sector division. Each sector is further divided in to blocks and we have that division on display here as well.

The block is the smallest individually addressable entity on the disk. It is usually 512 or 520 bytes, with 512 is most common. There is an advanced format proposal which makes the block size to 4096 (or 4K) bytes. The host system might need to have some support for this version, though. More on this later.

Shown at http://computershopper.com/feature/how-it-works-platter-based-hard-drive

The drive has the disk or platter(s) (1), spindle(2) to hold and spin them, the head arm (3) to hold the Read/Write heads, the voice coil (4) to move the head around, the heads (5), and the head landing zone(9) where the head can rest without crashing on the disk during power down. The tracks (6), sectors (8) and blocks (7) are the locations on the disk where data is stored and retrieved.

Typically, the disk is accessed as a set of serially numbered blocks – the Logical Block Address (LBA). The on disk electronics worry about translating disk LBA in to a location on the disk and store retrieve information. This also makes the life of host system easier as it only needs to worry about the block address as a logical block number – no need to remember on which platter, on which side, on what track in which sector is the block of our interest.

The logical block addressing also helps the on disk controller map out bad blocks without the host system worry about them. This is an unwanted behavior in some cases, though.

For the Solid State Disks, or flash drives – there are no spindles as you might guess. There are just a few chips under the hood storing and retrieving blocks of information with a bit of glue logic to attach these storage chips to the bus. The blocks can again be accessed with LBA and the controller translates that in to chip, sector/group and block address depending on flash chip organization. Let me try to touch SSDs in some more detail in a separate post. A typical small scale example of SSD is our USB pen drive or SD Card.


Storage: The DAS stack

Let’s talk about a typical Direct Attached Storage stack from a server system’s perspective.

Let’s cover the storage stack in both software and hardware layers, which can later help us bring in the networked storage concepts with ease.

Let’s take a quick look at the stack for ease of understanding.

The Storage Subsystem

The storage subsystem stack diagram has software components in blue  and hardware components in red. As with common sense, both are in just reverse order to each other. In a way, this stack might look just like networking stack. Let’s walk through the maze.

When an application requests for some operation a file, the request is passed on to the filesystem layers. Like in Linux, it’s possible to have a Virtual File System or VFS layer which will then switch to actual filesystem drivers to do the job. The filesystem then starts working through it’s magic on it’s internal data structures to figure out where and how to go about serving the request. One best example is looking up an inode and the indirect blocks. Let’s save some for detailed discussion later.

The filesystem then asks the disk accessing components, e.g. SATA or SCSI about the blocks it’s interested in. Once the request arrived in here, from now on we are only talking in terms of block numbers, be it LBA or some other mechanism. We don’t know anything about filesystem at all. Tell the block number and get it, read or write. The disk accessing subsystem then, looks up to do what best it can do, often re arranging the commands for a sequential operation improving the performance. Finally places the request packet to the HCI.

The host controller interface (HCI) or host bus adapter (HBA) in some cases such as SCSI drivers take in the packet , does the job of taking the series of commands for the disks, appropriate buffers and the disk ID as input, makes the queue up to place the commands to the disk along with disk ID. The appropriate queue is then passed on as a bunch of bytes/words to the BUS interface driver. In some cases, the HCI is rather really complex piece of software, such as for SATA, containing may layers such as transport layer and link layer within itself.

Finally the data to be sent on the bus arrives at the bus drivers. The bus drivers often are just simple ones to fiddle with a few flags and place the data on the bus, often with DMA or direct memory accessing subsystem and then let things go. When the DMA is done an IRQ comes up and informs the driver that the operation is complete. Here, the driver then returns control the HCI driver and HCI driver might wait for it’s own IRQ telling the command queue is sent successfully to the disk or just return the control to the Disk driver. Here the disk driver will have to wait for the disk job done and the IRQ tells that it’s complete. Only then the control can reach back to the filesystem telling that job was well done and the day is going good.

The skipped RAID component, let’s talk about it in a while.

Talking about the hardware, let’s start with the disks. The disks are just a few stores for the data and a few electronics glued to transfer data in and out. Often, disks take a single Logical Block Address and return the appropriate block by maintaining internal LBA to track/sector mapping. In the initial days, the disks just had a minimal electronics, but later the integrated drive electronics or IDE came in to picture. The same way, the Small Computer System Interface (SCSI) came in to existence. Though this is a dinosaur age story, that might just amuse you a bit. Though the SCSI is a common interface for the system, it’s mostly used for mass storage. Often, most storage systems emulate SCSI, such as USB pen drives. Then came along the AT Attachment or ATA (AT coming from PC-AT) and then came down the Serial ATA or SATA.

The SATA, however has it’s own protocol stack, though it supports a legacy mode in which the system can access a SATA subsystem as if it were PATA. As the SATA HCI can be operated in both backward compatibility mode with PATA and advanced HCI mode, the controller includes both standard IDE electronics & advanced HCI components.

Finally, the Bus interfaces. Most of today’s systems run on top of PCIe or Peripheral Component Interconnect – Express bus. In few words, it’s an advanced serial IO bus system that interconnects the CPU with the peripherals at a very high speed. Please visit http://en.wikipedia.org/wiki/PCI_Express for a quick look. Some posts on it a bit later.

In the next post, let’s talk about blocks, RAID and stripes.

Storage: The first words – DAS, NAS & SAN

Let’s discuss about storage for a while. No I am not talking about food store or cold storage. Let’s talk about electronic data storage.

As most of us know, the data is stored in the computer at least two places. One is temporary store and the other is permanent. The temporary store holds the data and the programming as a book opened for reading/writing. This resource is often limited in size, volatile and costs pretty high. On the other hand, the permanent store is relatively cheap, like almost the cost of MB temporary store is equivalent to cost per GB permanent store. Also the permanent store or secondary store is abundant, and non volatile, but significantly slow.

Let’s call the temporary store ‘ the RAM’, which collectively refers to the SDRAM, the SRAM, the DRAM, the L1/L2/L3 Caches and such stores. And the let’s call the permanent store ‘the disk’, which includes hard disks, optical disks and such devices.

The way information is stored in the RAM and disk is also significantly different. The RAM is often divided in to pages and segments and addressed as linear address space (by linear, I mean all logical, virtual and physical address spaces).

The way information stored on the disk is a bit different and non linear. The disk itself is organized in to tracks and sectors to store the information physically. Logically information or data is organized in files & directories. Though there are other tiny stores, it is mostly meta data, like boot records or partition information.

The way to map the tracks and sectors in to files and directories is often called filesystem. There are more than one way to deal with such mapping, hence different filesystems – ext3, FAT, reiserFS, NTFS or less known, proprietary filesystems such as WAFL.

The way to attach the tracks, sectors & disks to the computing system is also diverse, Direct Attached Storage, Network Attached Storage & Storage Area Networks. The best way to study the storage stack is to start off with Direct Attached Storage. The networked storage works inserting the network either in between application and the filesystem (NAS) or the filesystem and the disks (SAN).


Let’s talk about DAS first, hence. IMO, the easiest diagram is what we have above here. Though the diagram is just a few a few blocks, I’m also trying to talk about the same, which I’m intending to reuse again and again, at least for a while.

The application never needs to know about the underlying storage architecture. The application needs the secondary store for at least two purposes, first to get the application program instructions such as the application itself or shared libraries. The other is for storing and retrieving the user data, such as databases, documents or spreadsheets.

The moment it starts execution, it might need program instructions such as a shared library or some part of the executable file itself. For this part, the application does not even need to deal with the filesystem at all. Here the application simply calls the required routine and the operating system worries about talking to the filesystem, identifying the way through tracks and sectors maze.

However for the user data part, the application needs to worry about the filesystem, though to a minimum extent. The application needs to ask the operating system to open a given file, read from it or write to it, delete it or create it and close it when job’s done. The operating system and the underlying filesystem stuff do the dirty job, just as a clerk at the filing cabinets. You would ask for the required file, the clerk is happy to get that for you, if not available you may ask to add a new one, he opens the file reads or modifies the same as you ask for and returns it to the cabinet when done. Similar way, the application gets the file (file handle more precisely), reads, modifies the file and closes when done. All that application worries about is whether the file at given path exists, if exists can be accessed, if it can access, open, read or modify, rinse and repeat till the job is done and then close the file. It never needs to worry about what happens under the hood, it may not even worry about where the file actually is stored. At times, the application is not using the data files, instead using the database, the application simply talks to the database program, which worries about storing and retrieving data from the filesystems.

The databases, often the small scale databases, store their data, records and tables in the files on the disks. The large scale databases however, toss off the filesystem, take the tracks and sectors off the disk directly and manage them using their own filesystem. In either case, both should have some sort of filesystem under the hood to manage the tracks and sectors.

The OS, with the help of filesystem code, often in the form of filesystem drivers, translates the application requests for files in to tracks and sectors, retrieves or writes the ones required and thus manages them. How does the filesystem track and manage the tracks and sectors? How does the mapping happen between logical entities such as files & directories to physical tracks and sectors? Let’s bother about them in a while. For, now, let’s continue for the next layer.

The RAID/HBA deals with translating the operating system request in to a language that’s understood by the disks and disk interfaces. Often, they sit on top of some stacked bus interfaces such as SATA on PCI-X bus. The HBA then talks to the disks in the language of the disks, such as LBA or Logical Block Addressing.

%d bloggers like this: