Let’s discuss about storage for a while. No I am not talking about food store or cold storage. Let’s talk about electronic data storage.
As most of us know, the data is stored in the computer at least two places. One is temporary store and the other is permanent. The temporary store holds the data and the programming as a book opened for reading/writing. This resource is often limited in size, volatile and costs pretty high. On the other hand, the permanent store is relatively cheap, like almost the cost of MB temporary store is equivalent to cost per GB permanent store. Also the permanent store or secondary store is abundant, and non volatile, but significantly slow.
Let’s call the temporary store ‘ the RAM’, which collectively refers to the SDRAM, the SRAM, the DRAM, the L1/L2/L3 Caches and such stores. And the let’s call the permanent store ‘the disk’, which includes hard disks, optical disks and such devices.
The way information is stored in the RAM and disk is also significantly different. The RAM is often divided in to pages and segments and addressed as linear address space (by linear, I mean all logical, virtual and physical address spaces).
The way information stored on the disk is a bit different and non linear. The disk itself is organized in to tracks and sectors to store the information physically. Logically information or data is organized in files & directories. Though there are other tiny stores, it is mostly meta data, like boot records or partition information.
The way to map the tracks and sectors in to files and directories is often called filesystem. There are more than one way to deal with such mapping, hence different filesystems – ext3, FAT, reiserFS, NTFS or less known, proprietary filesystems such as WAFL.
The way to attach the tracks, sectors & disks to the computing system is also diverse, Direct Attached Storage, Network Attached Storage & Storage Area Networks. The best way to study the storage stack is to start off with Direct Attached Storage. The networked storage works inserting the network either in between application and the filesystem (NAS) or the filesystem and the disks (SAN).
Let’s talk about DAS first, hence. IMO, the easiest diagram is what we have above here. Though the diagram is just a few a few blocks, I’m also trying to talk about the same, which I’m intending to reuse again and again, at least for a while.
The application never needs to know about the underlying storage architecture. The application needs the secondary store for at least two purposes, first to get the application program instructions such as the application itself or shared libraries. The other is for storing and retrieving the user data, such as databases, documents or spreadsheets.
The moment it starts execution, it might need program instructions such as a shared library or some part of the executable file itself. For this part, the application does not even need to deal with the filesystem at all. Here the application simply calls the required routine and the operating system worries about talking to the filesystem, identifying the way through tracks and sectors maze.
However for the user data part, the application needs to worry about the filesystem, though to a minimum extent. The application needs to ask the operating system to open a given file, read from it or write to it, delete it or create it and close it when job’s done. The operating system and the underlying filesystem stuff do the dirty job, just as a clerk at the filing cabinets. You would ask for the required file, the clerk is happy to get that for you, if not available you may ask to add a new one, he opens the file reads or modifies the same as you ask for and returns it to the cabinet when done. Similar way, the application gets the file (file handle more precisely), reads, modifies the file and closes when done. All that application worries about is whether the file at given path exists, if exists can be accessed, if it can access, open, read or modify, rinse and repeat till the job is done and then close the file. It never needs to worry about what happens under the hood, it may not even worry about where the file actually is stored. At times, the application is not using the data files, instead using the database, the application simply talks to the database program, which worries about storing and retrieving data from the filesystems.
The databases, often the small scale databases, store their data, records and tables in the files on the disks. The large scale databases however, toss off the filesystem, take the tracks and sectors off the disk directly and manage them using their own filesystem. In either case, both should have some sort of filesystem under the hood to manage the tracks and sectors.
The OS, with the help of filesystem code, often in the form of filesystem drivers, translates the application requests for files in to tracks and sectors, retrieves or writes the ones required and thus manages them. How does the filesystem track and manage the tracks and sectors? How does the mapping happen between logical entities such as files & directories to physical tracks and sectors? Let’s bother about them in a while. For, now, let’s continue for the next layer.
The RAID/HBA deals with translating the operating system request in to a language that’s understood by the disks and disk interfaces. Often, they sit on top of some stacked bus interfaces such as SATA on PCI-X bus. The HBA then talks to the disks in the language of the disks, such as LBA or Logical Block Addressing.