SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Related Sites

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

HOWTO: Multi Disk System Tuning
Stein Gjoen, sgjoen@nyx.net
v0.33a, 20 May 2002

This document describes how best to use multiple disks and partitions for a Linux system. Although some of this text is Linux specific the general approach outlined here can be applied to many other multi tasking operating systems.


Table of Contents

  1. Introduction

    1.1 Copyright 1.2 Disclaimer 1.3 News 1.4 Credits 1.5 Translations

  2. Structure

    2.1 Logical structure 2.2 Document structure 2.3 Reading plan

  3. Drive Technologies

    3.1 Drives 3.2 Geometry 3.3 Media

    3.3.1 Magnetic Drives 3.3.2 Optical Drives 3.3.3 Solid State Drives 3.4 Interfaces

    3.4.1 MFM and RLL 3.4.2 ESDI 3.4.3 IDE and ATA 3.4.4 EIDE, Fast-ATA and ATA-2 3.4.5 Ultra-ATA 3.4.6 Serial-ATA 3.4.7 ATAPI 3.4.8 SCSI 3.5 Cabling 3.6 Host Adapters 3.7 Multi Channel Systems 3.8 Multi Board Systems 3.9 Speed Comparison

    3.9.1 Controllers 3.9.2 Bus Types 3.10 Benchmarking 3.11 Comparisons 3.12 Future Development 3.13 Recommendations

  4. File System Structure

    4.1 File System Features

    4.1.1 Swap 4.1.2 Temporary Storage (/tmp and /var/tmp) 4.1.3 Spool Areas (/var/spool/news and /var/spool/mail) 4.1.4 Home Directories (/home) 4.1.5 Main Binaries ( /usr/bin and /usr/local/bin) 4.1.6 Libraries ( /usr/lib and /usr/local/lib) 4.1.7 Boot 4.1.8 Root 4.1.9 DOS etc. 4.2 Explanation of Terms

    4.2.1 Speed 4.2.2 Reliability 4.2.3 Files

  5. File Systems

    5.1 General Purpose File Systems

    5.1.1 minix 5.1.2 xiafs and extfs 5.1.3 ext2fs 5.1.4 ext3fs 5.1.5 ufs 5.1.6 efs 5.1.7 XFS 5.1.8 reiserfs 5.1.9 enh-fs 5.1.10 Tux2 fs 5.2 Microsoft File Systems

    5.2.1 fat 5.2.2 fat32 5.2.3 vfat 5.2.4 ntfs 5.3 Logging and Journaling File Systems 5.4 Read-only File Systems

    5.4.1 High Sierra 5.4.2 iso9660 5.4.3 Rock Ridge 5.4.4 Joliet 5.4.5 Trivia 5.4.6 UDF 5.5 Networking File Systems

    5.5.1 NFS 5.5.2 AFS 5.5.3 Coda 5.5.4 nbd 5.5.5 enbd 5.5.6 GFS 5.6 Special File Systems

    5.6.1 tmpfs and swapfs 5.6.2 userfs 5.6.3 devfs 5.6.4 smugfs 5.7 File System Recommendations

  6. Technologies

    6.1 RAID

    6.1.1 SCSI-to-SCSI 6.1.2 PCI-to-SCSI 6.1.3 Software RAID 6.1.4 RAID Levels 6.2 Volume Management 6.3 Linux md Kernel Patch 6.4 Compression 6.5 ACL 6.6 cachefs 6.7 Translucent or Inheriting File Systems 6.8 Physical Track Positioning

    6.8.1 Disk Speed Values 6.9 Yoke 6.10 Stacking 6.11 Recommendations

  7. Other Operating Systems

    7.1 DOS 7.2 Windows 7.3 OS/2 7.4 NT 7.5 Windows 2000 7.6 Sun OS

    7.6.1 Sun OS 4 7.6.2 Sun OS 5 (aka Solaris) 7.7 BeOS

  8. Clusters
  9. Mount Points
  10. Considerations and Dimensioning

    10.1 Home Systems 10.2 Servers

    10.2.1 Home Directories 10.2.2 Anonymous FTP 10.2.3 WWW 10.2.4 Mail 10.2.5 News 10.2.6 Others 10.2.7 Server Recommendations 10.3 Pitfalls

  11. Disk Layout

    11.1 Selection for Partitioning 11.2 Mapping Partitions to Drives 11.3 Sorting Partitions on Drives 11.4 Optimizing

    11.4.1 Optimizing by Characteristics 11.4.2 Optimizing by Drive Parallelising 11.5 Compromises

  12. Implementation

    12.1 Checklist 12.2 Drives and Partitions 12.3 Partitioning 12.4 Repartitioning 12.5 Microsoft Partition Bug 12.6 Multiple Devices (md) 12.7 Formatting 12.8 Mounting 12.9 fstab 12.10 Mount options 12.11 Recommendations

  13. Maintenance

    13.1 Backup 13.2 Defragmentation 13.3 Deletions 13.4 Upgrades 13.5 Recovery 13.6 Rescue Disk

  14. Advanced Issues

    14.1 Hard Disk Tuning 14.2 File System Tuning 14.3 Spindle Synchronizing

  15. Troubleshooting

    15.1 During Installation

    15.1.1 Locating Disks 15.1.2 Formatting 15.2 During Booting

    15.2.1 Booting fails 15.2.2 Getting into Single User Mode 15.3 During Running

    15.3.1 Swap 15.3.2 Partitions

  16. Further Information

    16.1 News groups 16.2 Mailing Lists 16.3 HOWTO 16.4 Mini-HOWTO 16.5 Local Resources 16.6 Web Pages 16.7 Search Engines

  17. Getting Help
  18. Concluding Remarks

    18.1 Coming Soon 18.2 Request for Information 18.3 Suggested Project Work

  19. Questions and Answers
  20. Bits and Pieces

    20.1 Swap Partition: to Use or Not to Use 20.2 Mount Point and /mnt 20.3 Power and Heating 20.4 Deja 20.5 Crash Recovery

  21. Appendix A: Partitioning Layout Table: Mounting and Linking
  22. Appendix B: Partitioning Layout Table: Numbering and Sizing
  23. Appendix C: Partitioning Layout Table: Partition Placement
  24. Appendix D: Example: Multipurpose Server
  25. Appendix E: Example: Mounting and Linking
  26. Appendix F: Example: Numbering and Sizing
  27. Appendix G: Example: Partition Placement
  28. Appendix H: Example II
  29. Appendix I: Example III: SPARC Solaris
  30. Appendix J: Example IV: Server with 4 Drives
  31. Appendix K: Example V: Dual Drive System
  32. Appendix L: Example VI: Single Drive System
  33. Appendix M: Disk System Documenter

  1. Introduction

For unclear reasons this brand new release is codenamed the Taylor3 release.

New code names will appear as per industry standard guidelines to emphasize the state-of-the-art-ness of this document.

This document was written for two reasons, mainly because I got hold of 3 old SCSI disks to set up my Linux system on and I was pondering how best to utilise the inherent possibilities of parallelizing in a SCSI system. Secondly I hear there is a prize for people who write documents...

This is intended to be read in conjunction with the Linux Filesystem Structure Standard (FSSTND). It does not in any way replace it but tries to suggest where physically to place directories detailed in the FSSTND, in terms of drives, partitions, types, RAID, file system (fs), physical sizes and other parameters that should be considered and tuned in a Linux system, ranging from single home systems to large servers on the Internet.

The followup to FSSTND is called the Filesystem Hierarchy Standard (FHS) and covers more than Linux alone. FHS versions 2.0, 2.1 and 2.2 have been released but there are still a few issues to be dealt with. Many recent distributions are now aiming for FHS compliance.

It is also a good idea to read the Linux Installation guides thoroughly and if you are using a PC system, which I guess the majority still does, you can find much relevant and useful information in the FAQs for the newsgroup comp.sys.ibm.pc.hardware especially for storage media.

This is also a learning experience for myself and I hope I can start the ball rolling with this HOWTO and that it perhaps can evolve into a larger more detailed and hopefully even more correct HOWTO.

First of all we need a bit of legalese. Recent development shows it is quite important.

1.1. Copyright

This document is Copyright 1996 Stein Gjoen. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

If you have any questions, please contact <{linuxhowto @metalab.unc.edu}>

1.2. Disclaimer

Use the information in this document at your own risk. I disavow any potential liability for the contents of this document. Use of the concepts, examples, and/or other content of this document is entirely at your own risk.

All copyrights are owned by their owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark.

Naming of particular products or brands should not be seen as endorsements.

You are strongly recommended to take a backup of your system before major installation and backups at regular intervals. 1.3. News

This is a major upgrade featuring a new copyright statement that is intended to be Debian compliant and allow for inclusion in their distribution. A number of mistakes are corrected and new features added such as descriptions of recent ATA features and more.

On the development front people are concentrating their energy towards completing Linux 2.4 and until that is released there is not going to be much news on disk technology for Linux.

Also now the document is available in postscript both for US letter as well as European A4 formats.

The latest version number of this document can be gleaned from my plan entry if you finger
<http://www.mit.edu:8001/finger?sgjoen@nox.nyx.net> my Nyx account.

Also, the latest version will be available on my web space on Nyx in a number of formats:

A European mirror of the Multi Disk HOWTO <http://home.online.no/~ggjoeen/stein/disk.html> just went on line.

1.4. Credits

In this version I have the pleasure of acknowledging even more people who have contributed in one way or another:

ronnej (at ) ucs.orst.edu
cm (at) kukuruz.ping.at
armbru (at) pond.sub.org
R.P.Blake (at) open.ac.uk
neuffer (at) goofy.zdv.Uni-Mainz.de
sjmudd (at) redestb.es
nat (at) nataa.fr.eu.org
sundbyk (at) oslo.geco-prakla.slb.com ggjoeen (at) online.no
mike (at) i-Connect.Net
roth (at) uiuc.edu
phall (at) ilap.com
szaka (at) mirror.cc.u-szeged.hu
CMckeon (at) swcp.com
kris (at) koentopp.de
edick (at) idcomm.com
pot (at) fly.cnuce.cnr.it
earl (at) sbox.tu-graz.ac.at
ebacon (at) oanet.com
vax (at) linkdead.paranoia.com
tschenk (at) theoffice.net
pjfarley (at) dorsai.org
jean (at) stat.ubc.ca
johnf (at) whitsunday.net.au
clasen (at) unidui.uni-duisburg.de
eeslgw (at) ee.surrey.asc.uk
adam (at) onshore.com
anikolae (at) wega-fddi2.rz.uni-ulm.de cjaeger (at) dwave.net
eperezte (at) c2i.net
yesteven (at) ms2.hinet.net
cj (at) samurajdata.se
tbotond (at) netx.hu
russel (at) coker.com.au
lars (at) iar.se
GALLAGS3 (at) labs.wyeth.com
morimoto (at) xantia.citroen.org
shulegaa (at) gatekeeper.txl.com
roman.legat (at) stud.uni-hannover.de ahamish (at) hicks.alien.usr.com
hduff2 (at) worldnet.att.net
mbaehr (at) email.archlab.tuwien.ac.at adc (at) postoffice.utas.edu.au
pjm (at) bofh.asn.au
jochen.berg (at) ac.com
jpotts (at) us.ibm.com
jarry (at) gmx.net
LeBlanc (at) mcc.ac.uk
masy (at) webmasters.gr.jp
karlheg (at) hegbloom.net
goeran (at) uddeborg.pp.se
wgm (at) telus.net

1.5. Translations

Special thanks go to nakano (at) apm.seikei.ac.jp for doing the Japanese translation <http://www.linux.or.jp/JF/JFdocs/Multi-DiskHOWTO. html>, general contributions as well as contributing an example of a computer in an academic setting, which is included at the end of this document.

There are now many new translations available and special thanks go to the translators for the job and the input they have given:

ICP Vortex is gratefully acknowledges for sending in-depth information on their range of RAID controllers.

Also DPT is acknowledged for sending me documentation on their controllers as well as permission to quote from the material. These quotes have been approved before appearing here and will be clearly labelled. No quotes as of yet but that is coming.

Not many still, so please read through this document, make a contribution and join the elite. If I have forgotten anyone, please let me know.

New in this version is an appendix with a few tables you can fill in for your system in order to simplify the design process.

Any comments or suggestions can be mailed to my mail address on Nyx: sgjoen@nyx.net.

So let's cut to the chase where swap and /tmp are racing along hard drive...

2. Structure

As this type of document is supposed to be as much for learning as a technical reference document I have rearranged the structure to this end. For the designer of a system it is more useful to have the information presented in terms of the goals of this exercise than from the point of view of the logical layer structure of the devices themselves. Nevertheless this document would not be complete without such a layer structure the computer field is so full of, so I will include it here as an introduction to how it works.

It is a long time since the mini in mini-HOWTO could be defended as proper but I am convinced that this document is as long as it needs to be in order to make the right design decisions, and not longer.

2.1. Logical structure

This is based on how each layer access each other, traditionally with the application on top and the physical layer on the bottom. It is quite useful to show the interrelationship between each of the layers used in controlling drives.


               |__     File structure          ( /usr /tmp etc)        __|
               |__     File system             (ext2fs, vfat etc)      __|
               |__     Volume management       (AFS)                   __|
               |__     RAID, concatenation     (md)                    __|
               |__     Device driver           (SCSI, IDE etc)         __|
               |__     Controller              (chip, card)            __|
               |__     Connection              (cable, network)        __|
               |__     Drive                   (magnetic, optical etc) __|
               -----------------------------------------------------------

In the above diagram both volume management and RAID and concatenation are optional layers. The 3 lower layers are in hardware. All parts are discussed at length later on in this document.

2.2. Document structure

Most users start out with a given set of hardware and some plans on what they wish to achieve and how big the system should be. This is the point of view I will adopt in this document in presenting the material, starting out with hardware, continuing with design constraints before detailing the design strategy that I have found to work well. I have used this both for my own personal computer at home, a multi purpose server at work and found it worked quite well. In addition my Japanese co-worker in this project have applied the same strategy on a server in an academic setting with similar success.

Finally at the end I have detailed some configuration tables for use in your own design. If you have any comments regarding this or notes from your own design work I would like to hear from you so this document can be upgraded.

2.3. Reading plan

Although not the biggest HOWTO it is nevertheless rather big already and I have been requested to make a reading plan to make it possible to cut down on the volume

     Expert
        (aka the elite). If you are familiar with Linux as well as disk
        drive technologies you will find most of what you need in the
        appendices. Additionally you are recommended to read the FAQ and
        the ``Bits'n'pieces'' chapter.

     Experienced
        (aka Competent). If you are familiar with computers in general
        you can go straight to the chapters on ``technologies'' and
        continue from there on.

     Newbie
        (mostly harmless). You just have to read the whole thing.
        Sorry. In addition you are also recommended to read all the
        other disk related HOWTOs.

3. Drive Technologies

A far more complete discussion on drive technologies for IBM PCs can be found at the home page of The Enhanced IDE/Fast-ATA FAQ <http://thef-nym.sci.kun.nl/~pieterh/storage.html> which is also regularly posted on Usenet News. There is also a site dedicated to ATA and ATAPI Information and Software <http://ata-atapi.com>.

Here I will just present what is needed to get an understanding of the technology and get you started on your setup.

3.1. Drives

This is the physical device where your data lives and although the operating system makes the various types seem rather similar they can in actual fact be very different. An understanding of how it works can be very useful in your design work. Floppy drives fall outside the scope of this document, though should there be a big demand I could perhaps be persuaded to add a little here.

3.2. Geometry

Physically disk drives consists of one or more platters containing data that is read in and out using sensors mounted on movable heads that are fixed with respects to themselves. Data transfers therefore happens across all surfaces simultaneously which defines a cylinder of tracks. The drive is also divided into sectors containing a number of data fields.

Drives are therefore often specified in terms of its geometry: the number of Cylinders, Heads and Sectors (CHS).

For various reasons there is now a number of translations between

  • the physical CHS of the drive itself
  • the logical CHS the drive reports to the BIOS or OS
  • the logical CHS used by the OS

Basically it is a mess and a source of much confusion. For more information you are strongly recommended to read the Large Disk mini- HOWTO

3.3. Media

The media technology determines important parameters such as read/write rates, seek times, storage size as well as if it is read/write or read only.

3.3.1. Magnetic Drives

This is the typical read-write mass storage medium, and as everything else in the computer world, comes in many flavours with different properties. Usually this is the fastest technology and offers read/write capability. The platter rotates with a constant angular velocity (CAV) with a variable physical sector density for more efficient magnetic media area utilisation. In other words, the number of bits per unit length is kept roughly constant by increasing the number of logical sectors for the outer tracks.

Typical values for rotational speeds are 4500 and 5400 RPM, though 7200 is also used. Very recently also 10000 RPM has entered the mass market. Seek times are around 10 ms, transfer rates quite variable from one type to another but typically 4-40 MB/s. With the extreme high performance drives you should remember that performance costs more electric power which is dissipated as heat, see the point on ``Power and Heating''.

Note that there are several kinds of transfers going on here, and that these are quoted in different units. First of all there is the platter-to-drive cache transfer, usually quoted in Mbits/s. Typical values here is about 50-250 Mbits/s. The second stage is from the built in drive cache to the adapter, and this is typically quoted in MB/s, and typical quoted values here is 3-40 MB/s. Note, however, that this assumed data is already in the cache and hence for maximum readout speed from the drive the effective transfer rate will decrease dramatically.

3.3.2. Optical Drives

Optical read/write drives exist but are slow and not so common. They were used in the NeXT machine but the low speed was a source for much of the complaints. The low speed is mainly due to the thermal nature of the phase change that represents the data storage. Even when using relatively powerful lasers to induce the phase changes the effects are still slower than the magnetic effect used in magnetic drives.

Today many people use CD-ROM drives which, as the name suggests, is read-only. Storage is about 650 MB, transfer speeds are variable, depending on the drive but can exceed 1.5 MB/s. Data is stored on a spiraling single track so it is not useful to talk about geometry for this. Data density is constant so the drive uses constant linear velocity (CLV). Seek is also slower, about 100 ms, partially due to the spiraling track. Recent, high speed drives, use a mix of CLV and CAV in order to maximize performance. This also reduces access time caused by the need to reach correct rotational speed for readout.

A new type (DVD) is on the horizon, offering up to about 18 GB on a single disk.

3.3.3. Solid State Drives

This is a relatively recent addition to the available technology and has been made popular especially in portable computers as well as in embedded systems. Containing no movable parts they are very fast both in terms of access and transfer rates. The most popular type is flash RAM, but also other types of RAM is used. A few years ago many had great hopes for magnetic bubble memories but it turned out to be relatively expensive and is not that common.

In general the use of RAM disks are regarded as a bad idea as it is normally more sensible to add more RAM to the motherboard and let the operating system divide the memory pool into buffers, cache, program and data areas. Only in very special cases, such as real time systems with short time margins, can RAM disks be a sensible solution.

Flash RAM is today available in several 10's of megabytes in storage and one might be tempted to use it for fast, temporary storage in a computer. There is however a huge snag with this: flash RAM has a finite life time in terms of the number of times you can rewrite data, so putting swap, /tmp or /var/tmp on such a device will certainly shorten its lifetime dramatically. Instead, using flash RAM for directories that are read often but rarely written to, will be a big performance win.

In order to get the optimum life time out of flash RAM you will need to use special drivers that will use the RAM evenly and minimize the number of block erases.

This example illustrates the advantages of splitting up your directory structure over several devices.

Solid state drives have no real cylinder/head/sector addressing but for compatibility reasons this is simulated by the driver to give a uniform interface to the operating system.

3.4. Interfaces

There is a plethora of interfaces to chose from widely ranging in price and performance. Most motherboards today include IDE interface which are part of modern chipsets.

Many motherboards also include a SCSI interface chip made by Symbios (formerly NCR) and that is connected directly to the PCI bus. Check what you have and what BIOS support you have with it.

3.4.1. MFM and RLL

Once upon a time this was the established technology, a time when 20 MB was awesome, which compared to todays sizes makes you think that dinosaurs roamed the Earth with these drives. Like the dinosaurs these are outdated and are slow and unreliable compared to what we have today. Linux does support this but you are well advised to think twice about what you would put on this. One might argue that an emergency partition with a suitable vintage of DOS might be fitting.

3.4.2. ESDI

Actually, ESDI was an adaptation of the very widely used SMD interface used on "big" computers to the cable set used with the ST506 interface, which was more convenient to package than the 60-pin + 26-pin connector pair used with SMD. The ST506 was a "dumb" interface which relied entirely on the controller and host computer to do everything from computing head/cylinder/sector locations and keeping track of the head location, etc. ST506 required the controller to extract clock from the recovered data, and control the physical location of detailed track features on the medium, bit by bit. It had about a 10-year life if you include the use of MFM, RLL, and ERLL/ARLL modulation schemes. ESDI, on the other hand, had intelligence, often using three or four separate microprocessors on a single drive, and high-level commands to format a track, transfer data, perform seeks, and so on. Clock recovery from the data stream was accomplished at the drive, which drove the clock line and presented its data in NRZ, though error correction was still the task of the controller. ESDI allowed the use of variable bit density recording, or, for that matter, any other modulation technique, since it was locally generated and resolved at the drive. Though many of the techniques used in ESDI were later incorporated in IDE, it was the increased popularity of SCSI which led to the demise of ESDI in computers. ESDI had a life of about 10 years, though mostly in servers and otherwise "big" systems rather than PC's.

3.4.3. IDE and ATA

Progress made the drive electronics migrate from the ISA slot card over to the drive itself and Integrated Drive Electronics was borne. It was simple, cheap and reasonably fast so the BIOS designers provided the kind of snag that the computer industry is so full of. A combination of an IDE limitation of 16 heads together with the BIOS limitation of 1024 cylinders gave us the infamous 504 MB limit. Following the computer industry traditions again, the snag was patched with a kludge and we got all sorts of translation schemes and BIOS bodges. This means that you need to read the installation documentation very carefully and check up on what BIOS you have and what date it has as the BIOS has to tell Linux what size drive you have. Fortunately with Linux you can also tell the kernel directly what size drive you have with the drive parameters, check the documentation for LILO and Loadlin, thoroughly. Note also that IDE is equivalent to ATA, AT Attachment. IDE uses CPU-intensive Programmed Input/Output (PIO) to transfer data to and from the drives and has no capability for the more efficient Direct Memory Access (DMA) technology. Highest transfer rate is 8.3 MB/s.

3.4.4. EIDE, Fast-ATA and ATA-2

These 3 terms are roughly equivalent, fast-ATA is ATA-2 but EIDE additionally includes ATAPI. ATA-2 is what most use these days which is faster and with DMA. Highest transfer rate is increased to 16.6 MB/s.

3.4.5. Ultra-ATA

A new, faster DMA mode that is approximately twice the speed of EIDE PIO-Mode 4 (33 MB/s). Disks with and without Ultra-ATA can be mixed on the same cable without speed penalty for the faster adapters. The Ultra-ATA interface is electrically identical with the normal Fast-ATA interface, including the maximum cable length.

The ATA/66 was superceeded by ATA/100 and very recently we have now gotten ATA/133. While the interface speed has iproved dramatically the disks are often limited by platter-to-cache limites which today stands at about 40 MB/s.

For more information read up on these overviews and whitepapers from Maxtor: Fast Drives Technology
<http://www.maxtor.com/products/FastDrive/default.htm> on the ATA/133 interface and Big Drives Technology
<http://www.maxtor.com/products/BigDrive/default.htm> on breaking the 137 GB limit.

3.4.6. Serial-ATA

A new, standard has been agreed upon, the Serial-ATA interface, backed by the The Serial ATA <http://www.serial-ata.org/> group who made the announcement in August 2001.

Advantages are numerous: simple, thin connectors rather than old cumbersome cable mats that also obstructued air flow, higher speeds (about 150 MB/s) and backward compatibility.

3.4.7. ATAPI

The ATA Packet Interface was designed to support CD-ROM drives using the IDE port and like IDE it is cheap and simple.

3.4.8. SCSI

The Small Computer System Interface is a multi purpose interface that can be used to connect to everything from drives, disk arrays, printers, scanners and more. The name is a bit of a misnomer as it has traditionally been used by the higher end of the market as well as in work stations since it is well suited for multi tasking environments.

The standard interface is 8 bits wide and can address 8 devices. There is a wide version with 16 bit that is twice as fast on the same clock and can address 16 devices. The host adapter always counts as a device and is usually number 7. It is also possible to have 32 bit wide busses but this usually requires a double set of cables to carry all the lines.

The old standard was 5 MB/s and the newer fast-SCSI increased this to 10 MB/s. Recently ultra-SCSI, also known as Fast-20, arrived with 20 MB/s transfer rates for an 8 bit wide bus. New low voltage differential (LVD) signalling allows these high speeds as well as much longer cabling than before.

Even more recently an even faster standard has been introduced: SCSI 160 (originally named SCSI 160/m) which is capable of a monstrous 160 MB/s over a 16 bit wide bus. Support is scarce yet but for a few 10000 RPM drives that can transfer 40 MB/s sustained. Putting 6 such drives on a RAID will keep such a bus saturated and also saturate most PCI busses. Obviously this is only for the very highest end servers per today. More information on this standard is available at The Ultra 160 SCSI home page <http://www.ultra160-scsi.com/>

Adaptec just announced a Linux driver for their SCSI 160 host adapter. More information will come when more information becomes available.

Now also SCSI/320 is available.

The higher performance comes at a cost that is usually higher than for (E)IDE. The importance of correct termination and good quality cables cannot be overemphasized. SCSI drives also often tend to be of a higher quality than IDE drives. Also adding SCSI devices tend to be easier than adding more IDE drives: Often it is only a matter of plugging or unplugging the device; some people do this without powering down the system. This feature is most convenient when you have multiple systems and you can just take the devices from one system to the other should one of them fail for some reason.

There is a number of useful documents you should read if you use SCSI, the SCSI HOWTO as well as the SCSI FAQ posted on Usenet News.

SCSI also has the advantage you can connect it easily to tape drives for backing up your data, as well as some printers and scanners. It is even possible to use it as a very fast network between computers while simultaneously share SCSI devices on the same bus. Work is under way but due to problems with ensuring cache coherency between the different computers connected, this is a non trivial task. SCSI numbers are also used for arbitration. If several drives request service, the drive with the lowest number is given priority.

Note that newer SCSI cards will simultaneously support an array of different types of SCSI devices all at individually optimized speeds.

3.5. Cabling

I do not intend to make too many comments on hardware but I feel I should make a little note on cabling. This might seem like a remarkably low technological piece of equipment, yet sadly it is the source of many frustrating problems. At todays high speeds one should think of the cable more of a an RF device with its inherent demands on impedance matching. If you do not take your precautions you will get a much reduced reliability or total failure. Some SCSI host adapters are more sensitive to this than others.

Shielded cables are of course better than unshielded but the price is much higher. With a little care you can get good performance from a cheap unshielded cable.

  • For Fast-ATA and Ultra-ATA, the maximum cable length is specified as 45cm (18"). The data lines of both IDE channels are connected on many boards, though, so they count as one cable. In any case EIDE cables should be as short as possible. If there are mysterious crashes or spontaneous changes of data, it is well worth investigating your cabling. Try a lower PIO mode or disconnect the second channel and see if the problem still occurs.
  • For Cable Select (ATA drives) you set the drive jumpers to cable select and use the cable to determine master and slave. This is not much used.
  • Do not have a slave on an ATA controller (primary or secondary) without a master on the same controller, behaviour in these cases is undetermined.
  • Use as short cable as possible, but do not forget the 30 cm minimum separation for ultra SCSI and 60 cm separation for differential SCSI.
  • Avoid long stubs between the cable and the drive, connect the plug on the cable directly to the drive without an extension.
  • SCSI Cabling limitations:

Bus Speed (MHz) | Max Length (m)

        5                      |        6
       10  (fast)              |        3
       20  (fast-20 / ultra)   |        3 (max 4 devices), 1.5 (max 8 devices)
       xx  (differential)      |       25 (max 16 devices
       --------------------------------------------------
  • Use correct termination for SCSI devices and at the correct positions: both ends of the SCSI chain. Remember the host adapter itself may have on board termination.
  • Do not mix shielded or unshielded cabling, do not wrap cables around metal, try to avoid proximity to metal parts along parts of the cabling. Any such discontinuities can cause impedance mismatching which in turn can cause reflection of signals which increases noise on the cable. This problems gets even more severe in the case of multi channel controllers. Recently someone suggested wrapping bubble plastic around the cables in order to avoid too close proximity to metal, a real problem inside crowded cabinets.

More information on SCSI cabling and termination can be found at various web pages around the net.

3.6. Host Adapters

This is the other end of the interface from the drive, the part that is connected to a computer bus. The speed of the computer bus and that of the drives should be roughly similar, otherwise you have a bottleneck in your system. Connecting a RAID 0 disk-farm to a ISA card is pointless. These days most computers come with 32 bit PCI bus capable of 132 MB/s transfers which should not represent a bottleneck for most people in the near future.

As the drive electronic migrated to the drives the remaining part that became the (E)IDE interface is so small it can easily fit into the PCI chip set. The SCSI host adapter is more complex and often includes a small CPU of its own and is therefore more expensive and not integrated into the PCI chip sets available today. Technological evolution might change this.

Some host adapters come with separate caching and intelligence but as this is basically second guessing the operating system the gains are heavily dependent on which operating system is used. Some of the more primitive ones, that shall remain nameless, experience great gains. Linux, on the other hand, have so much smarts of its own that the gains are much smaller.

Mike Neuffer, who did the drivers for the DPT controllers, states that the DPT controllers are intelligent enough that given enough cache memory it will give you a big push in performance and suggests that people who have experienced little gains with smart controllers just have not used a sufficiently intelligent caching controller.

3.7. Multi Channel Systems

In order to increase throughput it is necessary to identify the most significant bottlenecks and then eliminate them. In some systems, in particular where there are a great number of drives connected, it is advantageous to use several controllers working in parallel, both for SCSI host adapters as well as IDE controllers which usually have 2 channels built in. Linux supports this.

Some RAID controllers feature 2 or 3 channels and it pays to spread the disk load across all channels. In other words, if you have two SCSI drives you want to RAID and a two channel controller, you should put each drive on separate channels.

3.8. Multi Board Systems

In addition to having both a SCSI and an IDE in the same machine it is also possible to have more than one SCSI controller. Check the SCSIHOWTO on what controllers you can combine. Also you will most likely have to tell the kernel it should probe for more than just a single SCSI or a single IDE controller. This is done using kernel parameters when booting, for instance using LILO. Check the HOWTOs for SCSI and LILO for how to do this.

Multi board systems can offer significant speed gains if you configure your disks right, especially for RAID0. Make sure you interleave the controllers as well as the drives, so that you add drives to the md RAID device in the right order. If controller 1 is connected to drives sda and sdc while controller 2 is connected to drives sdb and sdd you will gain more paralellicity by adding in the order of sda - sdc - sdb - sdd rather than sda - sdb - sdc - sdd because a read or write over more than one cluster will be more likely to span two controllers.

The same methods can also be applied to IDE. Most motherboards come with typically 4 IDE ports:

  • hda primary master
  • hdb primary slave
  • hdc secondary master
  • hdd secondary slave

    where the two primaries share one flat cable and the secondaries share another cable. Modern chipsets keep these independent. Therefore it is best to RAID in the order hda - hdc - hdb - hdd as this will most likely parallelise both channels.

3.9. Speed Comparison

The following tables are given just to indicate what speeds are possible but remember that these are the theoretical maximum speeds. All transfer rates are in MB per second and bus widths are measured in bits.

3.9.1. Controllers

       IDE             :        8.3 - 16.7
       Ultra-ATA       :       33 - 66

       SCSI            :
                               Bus width (bits)

Bus Speed (MHz) | 8 16 32

5 | 5 10 20 10 (fast) | 10 20 40 20 (fast-20 / ultra) | 20 40 80 40 (fast-40 / ultra-2) | 40 80 -- --------------------------------------------------

3.9.2. Bus Types

       ISA             :        8-12
       EISA            :       33
       VESA            :       40    (Sometimes tuned to 50)

       PCI
                               Bus width (bits)

Bus Speed (MHz) | 32 64

33 | 132 264 66 | 264 528 --------------------------------------------------

3.10. Benchmarking

This is a very, very difficult topic and I will only make a few cautious comments about this minefield. First of all, it is more difficult to make comparable benchmarks that have any actual meaning. This, however, does not stop people from trying...

Instead one can use benchmarking to diagnose your own system, to check it is going as fast as it should, that is, not slowing down. Also you would expect a significant increase when switching from a simple file system to RAID, so a lack of performance gain will tell you something is wrong.

When you try to benchmark you should not hack up your own, instead look up iozone and bonnie and read the documentation very carefully. In particular make sure your buffer size is bigger than your RAM size, otherwise you test your RAM rather than your disks which will give you unrealistically high performance.

A very simple benchmark can be obtained using hdparm -tT which can be used both on IDE and SCSI drives.

For more information on benchmarking and software for a number of platforms, check out ACNC <http://www.acnc.com/benchmarks.html> benchmark page as well as this one <http://spin.ch/~tpo/bench/> and also The Benchmarking-HOWTO
<http://www.linuxdoc.org/HOWTO/Benchmarking-HOWTO.html>.

There are also official home pages for bonnie <http://www.textuality.com/bonnie/>, bonnie++ <http://www.coker.com.au/bonnie++/> and iozone <http://www.iozone.org>.

Trivia: Bonnie is intended to locate bottlenecks, the name is a tribute to Bonnie Raitt, "who knows how to use one" as the author puts it.

3.11. Comparisons

SCSI offers more performance than EIDE but at a price. Termination is more complex but expansion not too difficult. Having more than 4 (or in some cases 2) IDE drives can be complicated, with wide SCSI you can have up to 15 per adapter. Some SCSI host adapters have several channels thereby multiplying the number of possible drives even further.

For SCSI you have to dedicate one IRQ per host adapter which can control up to 15 drives. With EIDE you need one IRQ for each channel (which can connect up to 2 disks, master and slave) which can cause conflict.

RLL and MFM is in general too old, slow and unreliable to be of much use.

3.12. Future Development

SCSI-3 is under way and will hopefully be released soon. Faster devices are already being announced, recently an 80 MB/s and then a 160 MB/s monster specification has been proposed and also very recently became commercially available. These are based around the Ultra-2 standard (which used a 40 MHz clock) combined with a 16 bit cable.

Some manufacturers already announce SCSI-3 devices but this is currently rather premature as the standard is not yet firm. As the transfer speeds increase the saturation point of the PCI bus is getting closer. Currently the 64 bit version has a limit of 264 MB/s. The PCI transfer rate will in the future be increased from the current 33 MHz to 66 MHz, thereby increasing the limit to 528 MB/s.

The ATA development is continuing and is increasing the performance with the new ATA/100 standard. Since most ATA drives are slower in sustained transfer from platter than this the performance increase will for most people be small.

More interesting is the Serial ATA development, where the flat cable will be replaced with a high speed serial link. This makes cabling far simpler than today and also it solves the problem of cabling obstructing airflow over the drives.

Another trend is for larger and larger drives. I hear it is possible to get 75 GB on a single drive though this is rather expensive. Currently the optimum storage for your money is about 30 GB but also this is continuously increasing. The introduction of DVD will in the near future have a big impact, with nearly 20 GB on a single disk you can have a complete copy of even major FTP sites from around the world. The only thing we can be reasonably sure about the future is that even if it won't get any better, it will definitely be bigger.

Addendum: soon after I first wrote this I read that the maximum useful speed for a CD-ROM was 20x as mechanical stability would be too great a problem at these speeds. About one month after that again the first commercial 24x CD-ROMs were available... Currently you can get 40x and no doubt higher speeds are in the pipeline.

A project to encapsulate SCSI over TCP/IP, called iSCSI <http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-06.txt> has started, and one Linux iSCSI implementation <http://www.cs.uml.edu/~mbrown/iSCSI> has appeared.

3.13. Recommendations

My personal view is that EIDE or Ultra ATA is the best way to start out on your system, especially if you intend to use DOS as well on your machine. If you plan to expand your system over many years or use it as a server I would strongly recommend you get SCSI drives. Currently wide SCSI is a little more expensive. You are generally more likely to get more for your money with standard width SCSI. There is also differential versions of the SCSI bus which increases maximum length of the cable. The price increase is even more substantial and cannot therefore be recommended for normal users.

In addition to disk drives you can also connect some types of scanners and printers and even networks to a SCSI bus.

Also keep in mind that as you expand your system you will draw ever more power, so make sure your power supply is rated for the job and that you have sufficient cooling. Many SCSI drives offer the option of sequential spin-up which is a good idea for large systems. See also ``Power and Heating''.

4. File System Structure

Linux has been multi tasking from the very beginning where a number of programs interact and run continuously. It is therefore important to keep a file structure that everyone can agree on so that the system finds data where it expects to. Historically there has been so many different standards that it was confusing and compatibility was maintained using symbolic links which confused the issue even further and the structure ended looking like a maze.

In the case of Linux a standard was fortunately agreed on early on called the File Systems Standard (FSSTND) which today is used by all main Linux distributions.

Later it was decided to make a successor that should also support operating systems other than just Linux, called the Filesystem Hierarchy Standard (FHS) at version 2.2 currently. This standard is under continuous development and will soon be adopted by Linux distributions.

I recommend not trying to roll your own structure as a lot of thought has gone into the standards and many software packages comply with the standards. Instead you can read more about this at the FHS home page <http://www.pathname.com/fhs/>.

This HOWTO endeavours to comply with FSSTND and will follow FHS when distributions become available.

4.1. File System Features

The various parts of FSSTND have different requirements regarding speed, reliability and size, for instance losing root is a pain but can easily be recovered. Losing /var/spool/mail is a rather different issue. Here is a quick summary of some essential parts and their properties and requirements. Note that this is just a guide, there can be binaries in etc and lib directories, libraries in bin directories and so on.

4.1.1. Swap

     Speed
        Maximum! Though if you rely too much on swap you should consider
        buying some more RAM. Note, however, that on many old Pentium PC
        motherboards the cache will not work on RAM above 128 MB.

     Size
        Similar as for RAM. Quick and dirty algorithm: just as for tea:
        16 MB for the machine and 2 MB for each user. Smallest kernel
        run in 1 MB but is tight, use 4 MB for general work and light
        applications, 8 MB for X11 or GCC or 16 MB to be comfortable.
        (The author is known to brew a rather powerful cuppa tea...)

        Some suggest that swap space should be 1-2 times the size of the
        RAM, pointing out that the locality of the programs determines
        how effective your added swap space is. Note that using the same
        algorithm as for 4BSD is slightly incorrect as Linux does not
        allocate space for pages in core.

        A more thorough approach is to consider swap space plus RAM as
        your total working set, so if you know how much space you will
        need at most, you subtract the physical RAM you have and that is
        the swap space you will need.

        There is also another reason to be generous when dimensioning
        your swap space: memory leaks. Ill behaving programs that do not
        free the memory they allocate for themselves are said to have a
        memory leak.  This allocation remains even after the offending
        program has stopped so this is a source of memory consumption.
        Only after the program dies is the memory returned.  Once all
        physical RAM and swap space are exhausted the only solution is
        to kill the offending processes if possible, or failing that,
        reboot and start over.  Thankfully such programs are not too
        common but should you come across one you will find that extra
        swap space will buy you extra time between reboots.

        Also remember to take into account the type of programs you use.
        Some programs that have large working sets, such as image
        processing software have huge data structures loaded in RAM
        rather than working explicitly on disk files. Data and computing
        intensive programs like this will cause excessive swapping if
        you have less RAM than the requirements.

        Other types of programs can lock their pages into RAM. This can
        be for security reasons, preventing copies of data reaching a
        swap device or for performance reasons such as in a real time
        module. Either way, locking pages reduces the remaining amount
        of swappable memory and can cause the system to swap earlier
        then otherwise expected.

        In man 8 mkswap it is explained that each swap partition can be
        a maximum of just under 128 MB in size for 32-bit machines and
        just under 256 MB for 64-bit machines.

        This however changed with kernel 2.2.0 after which the limit is
        2 GB.  The man page has been updated to reflect this change.

     Reliability
        Medium. When it fails you know it pretty quickly and failure
        will cost you some lost work. You save often, don't you?

     Note 1
        Linux offers the possibility of interleaved swapping across
        multiple devices, a feature that can gain you much. Check out
        "man 8 swapon" for more details. However, software raiding swap
        across multiple devices adds more overheads than you gain.

        Thus the /etc/fstab file might look like this:

          /dev/sda1       swap            swap    pri=1           0       0
          /dev/sdc1       swap            swap    pri=1           0       0

     Remember that the fstab file is very sensitive to the formatting
     used, read the man page carefully and do not just cut and paste the
     lines above.

     Note 2
        Some people use a RAM disk for swapping or some other file
        systems. However, unless you have some very unusual requirements
        or setups you are unlikely to gain much from this as this cuts
        into the memory available for caching and buffering.

     Note 2b
        There is once exception: on a number of badly designed
        motherboards the on board cache memory is not able to cache all
        the RAM that can be addressed. Many older motherboards could
        accept 128 MB RAM but only cache the lower 64 MB. In such cases
        it would improve the performance if you used the upper
        (uncached) 64 MB RAM for RAMdisk based swap or other temporary
        storage.

4.1.2. Temporary Storage ( /tmp and /var/tmp )

     Speed
        Very high. On a separate disk/partition this will reduce
        fragmentation generally, though ext2fs handles fragmentation
        rather well.

     Size
        Hard to tell, small systems are easy to run with just a few MB
        but these are notorious hiding places for stashing files away
        from prying eyes and quota enforcement and can grow without
        control on larger machines. Suggested: small home machine: 8 MB,
        large home machine: 32 MB, small server: 128 MB, and large
        machines up to 500 MB (The machine used by the author at work
        has 1100 users and a 300 MB /tmp directory). Keep an eye on
        these directories, not only for hidden files but also for old
        files. Also be prepared that these partitions might be the first
        reason you might have to resize your partitions.

     Reliability
        Low. Often programs will warn or fail gracefully when these
        areas fail or are filled up. Random file errors will of course
        be more serious, no matter what file area this is.

     Files
        Mostly short files but there can be a huge number of them.
        Normally programs delete their old tmp files but if somehow an
        interruption occurs they could survive. Many distributions have
        a policy regarding cleaning out tmp files at boot time, you
        might want to check out what your setup is.

     Note1
        In FSSTND there is a note about putting /tmp on RAM disk. This,
        however, is not recommended for the same reasons as stated for
        swap. Also, as noted earlier, do not use flash RAM drives for
        these directories. One should also keep in mind that some
        systems are set to automatically clean tmp areas on rebooting.

     Note2
        Older systems had a /usr/tmp but this is no longer recommended
        and for historical reasons a symbolic link now makes it point to
        one of the other tmp areas.

(* That was 50 lines, I am home and dry! *)

4.1.3. Spool Areas ( /var/spool/news and /var/spool/mail )

     Speed
        High, especially on large news servers. News transfer and
        expiring are disk intensive and will benefit from fast drives.
        Print spools: low. Consider RAID0 for news.

     Size
        For news/mail servers: whatever you can afford. For single user
        systems a few MB will be sufficient if you read continuously.
        Joining a list server and taking a holiday is, on the other
        hand, not a good idea.  (Again the machine I use at work has 100
        MB reserved for the entire /var/spool)

     Reliability
        Mail: very high, news: medium, print spool: low. If your mail is
        very important (isn't it always?) consider RAID for reliability.

     Files
        Usually a huge number of files that are around a few KB in size.
        Files in the print spool can on the other hand be few but quite
        sizable.

     Note
        Some of the news documentation suggests putting all the
        .overview files on a drive separate from the news files, check
        out all news FAQs for more information.  Typical size is about
        3-10 percent of total news spool size.

4.1.4. Home Directories ( /home )

     Speed
        Medium. Although many programs use /tmp for temporary storage,
        others such as some news readers frequently update files in the
        home directory which can be noticeable on large multiuser
        systems. For small systems this is not a critical issue.

     Size
        Tricky! On some systems people pay for storage so this is
        usually then a question of finance. Large systems such as
        Nyx.net <http://www.nyx.net/> (which is a free Internet service
        with mail, news and WWW services) run successfully with a
        suggested limit of 100 KB per user and 300 KB as enforced
        maximum. Commercial ISPs offer typically about 5 MB in their
        standard subscription packages.

        If however you are writing books or are doing design work the
        requirements balloon quickly.

     Reliability
        Variable. Losing /home on a single user machine is annoying but
        when 2000 users call you to tell you their home directories are
        gone it is more than just annoying. For some their livelihood
        relies on what is here. You do regular backups of course?

     Files
        Equally tricky. The minimum setup for a single user tends to be
        a dozen files, 0.5 - 5 KB in size. Project related files can be
        huge though.

     Note1
        You might consider RAID for either speed or reliability. If you
        want extremely high speed and reliability you might be looking
        at other operating system and hardware platforms anyway.  (Fault
        tolerance etc.)

     Note2
        Web browsers often use a local cache to speed up browsing and
        this cache can take up a substantial amount of space and cause
        much disk activity. There are many ways of avoiding this kind of
        performance hits, for more information see the sections on
        ``Home Directories'' and ``WWW''.

     Note3
        Users often tend to use up all available space on the /home
        partition. The Linux Quota subsystem is capable of limiting the
        number of blocks and the number of inode a single user ID can
        allocate on a per-filesystem basis. See the Linux Quota mini-
        HOWTO <http://www.linuxdoc.org/HOWTO/mini/Quota.html> by Albert
        M.C. Tam bertie (at) scn.org for details on setup.

4.1.5. Main Binaries ( /usr/bin and /usr/local/bin )

     Speed
        Low. Often data is bigger than the programs which are demand
        loaded anyway so this is not speed critical. Witness the
        successes of live file systems on CD ROM.
     Size
        The sky is the limit but 200 MB should give you most of what you
        want for a comprehensive system. A big system, for software
        development or a multi purpose server should perhaps reserve 500
        MB both for installation and for growth.

     Reliability
        Low. This is usually mounted under root where all the essentials
        are collected. Nevertheless losing all the binaries is a pain...

     Files
        Variable but usually of the order of 10 - 100 KB.

4.1.6. Libraries ( /usr/lib and /usr/local/lib )

     Speed
        Medium. These are large chunks of data loaded often, ranging
        from object files to fonts, all susceptible to bloating. Often
        these are also loaded in their entirety and speed is of some use
        here.

     Size
        Variable. This is for instance where word processors store their
        immense font files. The few that have given me feedback on this
        report about 70 MB in their various lib directories.  A rather
        complete Debian 1.2 installation can take as much as 250 MB
        which can be taken as an realistic upper limit.  The following
        ones are some of the largest disk space consumers: GCC, Emacs,
        TeX/LaTeX, X11 and perl.

     Reliability
        Low. See point ``Main binaries''.

     Files
        Usually large with many of the order of 1 MB in size.

     Note
        For historical reasons some programs keep executables in the lib
        areas. One example is GCC which have some huge binaries in the
        /usr/lib/gcc/lib hierarchy.

4.1.7. Boot

     Speed
        Quite low: after all booting doesn't happen that often and
        loading the kernel is just a tiny fraction of the time it takes
        to get the system up and running.

     Size
        Quite small, a complete image with some extras fit on a single
        floppy so 5 MB should be plenty.

     Reliability
        High. See section below on Root.

     Note 1
        The most important part about the Boot partition is that on many
        systems it must reside below cylinder 1023. This is a BIOS
        limitation that Linux cannot get around.

     Note 1a
        The above is not necessarily true for recent IDE systems and not
        for any SCSI disks. For more information check the latest Large
        Disk HOWTO.

     Note 2
        Recently a new boot loader has been written that overcomes the
        1023 sector limit. For more information check out this article
        <http://www.linuxforum.com/plug/articles/nuni.html> on nuni.

4.1.8. Root

     Speed
        Quite low: only the bare minimum is here, much of which is only
        run at startup time.

     Size
        Relatively small. However it is a good idea to keep some
        essential rescue files and utilities on the root partition and
        some keep several kernel versions. Feedback suggests about 20 MB
        would be sufficient.

     Reliability
        High. A failure here will possibly cause a fair bit of grief and
        you might end up spending some time rescuing your boot
        partition. With some practice you can of course do this in an
        hour or so, but I would think if you have some practice doing
        this you are also doing something wrong.

        Naturally you do have a rescue disk? Of course this is updated
        since you did your initial installation? There are many ready
        made rescue disks as well as rescue disk creation tools you
        might find valuable.  Presumably investing some time in this
        saves you from becoming a root rescue expert.

     Note 1
        If you have plenty of drives you might consider putting a spare
        emergency boot partition on a separate physical drive. It will
        cost you a little bit of space but if your setup is huge the
        time saved, should something fail, will be well worth the extra
        space.

     Note 2
        For simplicity and also in case of emergencies it is not
        advisable to put the root partition on a RAID level 0 system.
        Also if you use RAID for your boot partition you have to
        remember to have the md option turned on for your emergency
        kernel.

     Note 3
        For simplicity it is quite common to keep Boot and Root on the
        same partition. If you do that, then in order to boot from LILO
        it is important that the essential boot files reside wholly
        within cylinder 1023. This includes the kernel as well as files
        found in /boot.

4.1.9. DOS etc.

At the danger of sounding heretical I have included this little section about something many reading this document have strong feelings about. Unfortunately many hardware items come with setup and maintenance tools based around those systems, so here goes.

     Speed
        Very low. The systems in question are not famed for speed so
        there is little point in using prime quality drives.
        Multitasking or multi-threading are not available so the command
        queueing facility found in SCSI drives will not be taken
        advantage of. If you have an old IDE drive it should be good
        enough. The exception is to some degree Win95 and more notably
        NT which have multi-threading support which should theoretically
        be able to take advantage of the more advanced features offered
        by SCSI devices.

     Size
        The company behind these operating systems is not famed for
        writing tight code so you have to be prepared to spend a few
        tens of MB depending on what version you install of the OS or
        Windows. With an old version of DOS or Windows you might fit it
        all in on 50 MB.

     Reliability
        Ha-ha. As the chain is no stronger than the weakest link you can
        use any old drive. Since the OS is more likely to scramble
        itself than the drive is likely to self destruct you will soon
        learn the importance of keeping backups here.

        Put another way: "Your mission, should you choose to accept it,
        is to keep this partition working. The warranty will self
        destruct in 10 seconds..."

        Recently I was asked to justify my claims here. First of all I
        am not calling DOS and Windows sorry excuses for operating
        systems. Secondly there are various legal issues to be taken
        into account. Saying there is a connection between the last two
        sentences are merely the ravings of the paranoid. Surely.
        Instead I shall offer the esteemed reader a few key words: DOS
        4.0, DOS 6.x and various drive compression tools that shall
        remain nameless.

4.2. Explanation of Terms

Naturally the faster the better but often the happy installer of Linux has several disks of varying speed and reliability so even though this document describes performance as 'fast' and 'slow' it is just a rough guide since no finer granularity is feasible. Even so there are a few details that should be kept in mind:

4.2.1. Speed

This is really a rather woolly mix of several terms: CPU load, transfer setup overhead, disk seek time and transfer rate. It is in the very nature of tuning that there is no fixed optimum, and in most cases price is the dictating factor. CPU load is only significant for IDE systems where the CPU does the transfer itself but is generally low for SCSI, see SCSI documentation for actual numbers. Disk seek time is also small, usually in the millisecond range. This however is not a problem if you use command queueing on SCSI where you then overlap commands keeping the bus busy all the time. News spools are a special case consisting of a huge number of normally small files so in this case seek time can become more significant.

There are two main parameters that are of interest here:

     Seek
        is usually specified in the average time take for the read/write
        head to seek from one track to another. This parameter is
        important when dealing with a large number of small files such
        as found in spool files.  There is also the extra seek delay
        before the desired sector rotates into position under the head.
        This delay is dependent on the angular velocity of the drive
        which is why this parameter quite often is quoted for a drive.
        Common values are 4500, 5400 and 7200 RPM (rotations per
        minute). Higher RPM reduces the seek time but at a substantial
        cost.  Also drives working at 7200 RPM have been known to be
        noisy and to generate a lot of heat, a factor that should be
        kept in mind if you are building a large array or "disk farm".
        Very recently drives working at 10000 RPM has entered the market
        and here the cooling requirements are even stricter and minimum
        figures for air flow are given.

     Transfer
        is usually specified in megabytes per second.  This parameter is
        important when handling large files that have to be transferred.
        Library files, dictionaries and image files are examples of
        this. Drives featuring a high rotation speed also normally have
        fast transfers as transfer speed is proportional to angular
        velocity for the same sector density.

It is therefore important to read the specifications for the drives very carefully, and note that the maximum transfer speed quite often is quoted for transfers out of the on board cache (burst speed) and not directly from the platter (sustained speed). See also section on ``Power and Heating''.

4.2.2. Reliability

Naturally no-one would want low reliability disks but one might be better off regarding old disks as unreliable. Also for RAID purposes (See the relevant information) it is suggested to use a mixed set of disks so that simultaneous disk crashes become less likely.

So far I have had only one report of total file system failure but here unstable hardware seemed to be the cause of the problems.

Disks are cheap these days yet people still underestimate the value of the contents of the drives. If you need higher reliability make sure you replace old drives and keep spares. It is not unusual that drives can work more or less continuous for years and years but what often kills a drive in the end is power cycling.

4.2.3. Files

The average file size is important in order to decide the most suitable drive parameters. A large number of small files makes the average seek time important whereas for big files the transfer speed is more important. The command queueing in SCSI devices is very handy for handling large numbers of small files, but for transfer EIDE is not too far behind SCSI and normally much cheaper than SCSI.

5. File Systems

Over time the requirements for file systems have increased and the demands for large structures, large files, long file names and more has prompted ever more advanced file systems, the system that accesses and organises the data on mass storage. Today there is a large number of file systems to choose from and this section will describe these in detail.

The emphasis is on Linux but with more input I will be happy to add information for a wider audience.

5.1. General Purpose File Systems

Most operating systems usually have a general purpose file system for every day use for most kinds of files, reflecting available features in the OS such as permission flags, protection and recovery.

5.1.1. minix

This was the original fs for Linux, back in the days Linux was hosted on minix machines. It is simple but limited in features and hardly ever used these days other than in some rescue disks as it is rather compact.

5.1.2. xiafs and extfs

These are also old and have fallen in disuse and are no longer recommended.

5.1.3. ext2fs

This is the established standard for general purpose in the Linux world. It is fast, efficient and mature and is under continuous development and features such as ACL and transparent compression are on the horizon.

For more information check the ext2fs <http://web.mit.edu/tytso/www/linux/ext2.html> home page.

5.1.4. ext3fs

This is the name for the upcoming successor to ext2fs due to enter stable kernel in the near future. Many features are added to ext2fs but to avoid confusion over the name after such a radical upgrade the name will be changed too. You may have heard of it already but source code is now in beta release .

Patches are available at Linux.org
<ftp://ftp.linux.org.uk/pub/linux/sct/fs/jfs>.

5.1.5. ufs

This is the fs used by BSD and variants thereof. It is mature but also developed for older types of disk drives where geometries were known. The fs uses a number of tricks to optimise performance but as disk geometries are translated in a number of ways the net effect is no longer so optimal.

5.1.6. efs

The Extent File System (efs) is Silicon Graphics' early file system widely used on IRIX before version 6.0 after which xfs has taken over. While migration to xfs is encouraged efs is still supported and much used on CDs.

There is a Linux driver available in early beta stage, available at Linux extent file system <http://aeschi.ch.eu.org/efs/> home page.

5.1.7. XFS

Silicon Graphics Inc (sgi) <http://www.sgi.com/> has started porting its mainframe grade file system to Linux. Source is not yet available as they are busily cleaning out legal encumbrance but once that is done they will provide the source code under GPL.

More information is already available on the XFS project page <http://oss.sgi.com/projects/xfs/> at SGI.

5.1.8. reiserfs

As of July, 23th 1997 Hans Reiser reiser (at) RICOCHET.NET has put up the source to his tree based reiserfs <http://www.namesys.com> on the web. While his filesystem has some very interesting features and is much faster than ext2fs and is in use by a number of people. Hopefully it will be ready for kernel 2.4.0 which might be ready at the end of the year.

5.1.9. enh-fs

The Enhanced File System project is now dead.

5.1.10. Tux2 fs

This is a variation on the ext2fs that adds robustness in case of unexpected interruptions such as power failure. After such an event Tux2 fs will restart with the file system in a consistent, recently recorded state without fsck or other recovery operations. To achieve this Tux2 fs uses a newly designed algorithm called Phase Tree.

More information can be found at the project home page <http://tux2.sourceforge.net>.

5.2. Microsoft File Systems

This company is responsible for a lot, including a number of filesystems that has at the very least caused confusions.

5.2.1. fat

Actually there are 2 fats out there, fat12 and fat16 depending on the partition size used but fortunately the difference is so minor that the whole issue is transparent.

On the plus side these are fast and simple and most OSes understands it and can both read and write this fs. And that is about it.

The minus side is limited safety, severely limited permission flags and atrocious scalability. For instance with fat you cannot have partitions larger than 2 GB.

5.2.2. fat32

After about 10 years Microsoft realised fat was about, well, 10 years behind the times and created this fs which scales reasonably well.

Permission flags are still limited. NT 4.0 cannot read this file system but Linux can.

5.2.3. vfat

At the same time as Microsoft launched fat32 they also added support for long file names, known as vfat.

Linux reads vfat and fat32 partitions by mounting with type vfat.

5.2.4. ntfs

This is the native fs of Win-NT but as complete information is not available there is limited support for other OSes.

5.3. Logging and Journaling File Systems

These take a radically different approach to file updates by logging modifications for files in a log and later at some time checkpointing the logs.

Reading is roughly as fast as traditional file systems that always update the files directly. Writing is much faster as only updates are appended to a log. All this is transparent to the user. It is in reliability and particularly in checking file system integrity that these file systems really shine. Since the data before last checkpointing is known to be good only the log has to be checked, and this is much faster than for traditional file systems.

Note that while logging filesystems keep track of changes made to both data and inodes, journaling filesystems keep track only of inode changes.

Linux has quite a choice in such file systems but none are yet in production quality. Some are also on hold.

  • Adam Richter from Yggdrasil posted some time ago that they have been working on a compressed log file based system but that this project is currently on hold. Nevertheless a non-working version is available on their FTP server. Check out the Yggdrasil ftp server <ftp://ftp.yggdrasil.com/private/adam> where special patched versions of the kernel can be found.
  • Another project is the Linux log-structured Filesystem Project <http://outflux.net/projects/lfs/> which sadly also is on hold. Nevertheless this page contains much information on the topic.
  • Then there is the LinLogFS -- A Log-Structured Filesystem For Linux <http://www.complang.tuwien.ac.at/czezatke/lfs.html> (formerly known as dtfs) which seems to be going strong. Still in alpha but sufficiently complete to make programs run off this file system
  • Finally there is the Journaling Flash File System <http://developer.axis.com/software/jffs/> designed for their embedded diskless systems such as their Linux based web camera.

Note that ext3fs, XFS and reiserfs also have features for logging or journaling.

5.4. Read-only File Systems

Read-only media has not escaped the ever increasing complexities seen in more general file systems so again there is a large choice to choose from with corresponding opportunities for exciting mistakes.

Note that ext2fs works quite well on a CD-ROM and seems to save space while offering the normal file system features such as long file names and permissions that can be retained when copying files across to read-write media. Also having /dev on a CD-ROM is possible.

Most of these are used with the CD-ROM media but also the new DVD can be used and you can even use it through the loopback device on a hard disk file for verifying an image before burning a ROM.

There is a read-only romfs for Linux but as that is not disk related nothing more will be said about it here.

5.4.1. High Sierra

This was one of the earliest standards for CD-ROM formats, supposedly named after the hotel where the final agreement took place.

High Sierra was so limited in features that new extensions simply had to appear and while there has been no end to new formats the original High Sierra remains the common precursor and is therefore still widely supported.

5.4.2. iso9660

The International Standards Organisation made their extensions and formalised the standard into what we know as the iso9660 standard.

The Linux iso9660 file system supports both High Sierra as well as Rock Ridge extensions.

5.4.3. Rock Ridge

Not everyone accepts limits like short filenames and lack of permissions so very soon the Rock Ridge extensions appeared to rectify these shortcomings.

5.4.4. Joliet

Microsoft, not be be outdone in the standards extension game, decided it should extend CD-ROM formats with some internationalisation features and called it Joliet.

Linux supports this standards in kernels 2.0.34 or newer. You need to enable NLS in order to use it.

5.4.5. Trivia

Joliet is a city outside Chicago; best known for being the site of the prison where Jake was locked up in the movie "Blues Brothers." Rock Ridge (the UNIX extensions to ISO 9660) is named after the (fictional) town in the movie "Blazing Saddles."

5.4.6. UDF

With the arrival of DVD with up to about 17 GB of storage capacity the world seemingly needed another format, this time ambitiously named Universal Disk Format (UDF). This is intended to replace iso9660 and will be required for DVD.

Currently this is not in the standard Linux kernel but a project is underway to make a <http://trylinux.com/projects/udf/index.html> name="UDF driver"> for Linux. Patches and documentation are available.

More information is also available at the Linux and DVDs <http://atv.ne.mediaone.net/linux-dvd/> page.

5.5. Networking File Systems

There is a large number of networking technologies available that lets you distribute disks throughout a local or even global networks. This is somewhat peripheral to the topic of this HOWTO but as it can be used with local disks I will cover this briefly. It would be best if someone (else) took this into a separate HOWTO...

5.5.1. NFS

This is one of the earliest systems that allows mounting a file space on one machine onto another. There are a number of problems with NFS ranging from performance to security but it has nevertheless become established.

5.5.2. AFS

This is a system that allows efficient sharing of files across large networks. Starting out as an academic project it is now sold by Transarc <http://www.transarc.com> whose home page gives you more details.

Derek Atkins, of MIT, ported AFS to Linux and has also set up the Linux AFS mailing List ( linux-afs@mit.edu) for this which is open to the public. Requests to join the list should go to linux-afsrequest @mit.edu and finally bug reports should be directed to linuxafs -bugs@mit.edu.

Important: as AFS uses encryption it is restricted software and cannot easily be exported from the US.

IBM who owns Transarc, has announced the availability of the latest version of client as well as server for Linux.

Arla is a free AFS implementation, check the Arla homepage <http://www.stacken.kth.se/projekt/arla/> for more information as well as documentation.

5.5.3. Coda

A networking filesystem similar to AFS is underway and is called Coda <http://coda.cs.cmu.edu/>. This is designed to be more robust and fault tolerant than AFS, and supports mobile, disconnected operations. Currently it does not scale very well, and does not really have proper administrative tools, as AFS does and ARLA is beginning to.

5.5.4. nbd

The Network Block Device <http://atrey.karlin.mff.cuni.cz/~pavel/> (nbd) is available in Linux kernel 2.2 and later and offers reportedly excellent performance. The interesting thing here is that it can be combined with RAID (see later).

5.5.5. enbd

The <http://www.it.uc3m.es/~ptb/nbd> name="Enhanced Network Block Device"> (enbd) is a project to enhance the nbd with features such as block journaled multi channel communications, internal failover and automatic balancing between channels and more.

The intended use is for RAID over the net.

5.5.6. GFS

The Global File System <http://gfs.lcse.umn.edu/> is a new file system designed for storage across a wide area network. It is currently in the early stages and more information will come later.

5.6. Special File Systems

In addition to the general file systems there is also a number of more specific ones, usually to provide higher performance or other features, usually with a tradeoff in other respects.

5.6.1. tmpfs and swapfs

For short term fast file storage SunOS offers tmpfs which is about the same as the swapfs on NeXT. This overcomes the inherent slowness in ufs by caching file data and keeping control information in memory. This means that data on such a file system will be lost when rebooting and is therefore mainly suitable for /tmp area but not /var/tmp which is where temporary data that must survive a reboot, is placed.

SunOS offers very limited tuning for tmpfs and the number of files is even limited by total physical memory of the machine.

Linux now features tmpfs since kernel version 2.4 and is enabled by turning on virtual memory file system support (former shm fs). Under certain circumstances tmpfs can lock up the system in early kerbel versions, make sure you use version 2.4.6 or later.

5.6.2. userfs

The user file system (userfs) allows a number of extensions to traditional file system use such as FTP based file system, compression (arcfs) and fast prototyping and many other features. The docfs is based on this filesystem. Check the userfs homepage <http://www.goop.org/~jeremy/userfs/> for more information.

5.6.3. devfs

When disks are added, removed or just fail it is likely that disk device names of the remaining disks will change. For instance if sdb fails then the old sdc becomes sdb, the old sdc becomes sdb and so on. Note that in this case hda, hdb etc will remain unchanged. Likewise if a new drive is added the reverse may happen.

There is no guarantee that SCSI ID 0 becomes sda and that adding disks in increasing ID order will just add a new device name without renaming previous entries, as some SCSI drivers assign from ID 0 and up while others reverse the scanning order. Likewise adding a SCSI host adapter can also cause renaming.

Generally device names are assigned in the order they are found.

The source of the problem lies in the limited number of bits available for major and minor numbering in the device files used to describe the device itself. You an see these in the /dev directory, info on the numbering and allocation can be found in man MAKEDEV. Currently there are 2 solutions to this problem in various stages of development:

     scsidev
        works by creating a database of drives and where they belong,
        check  man scsifs and the scsidev home page for more information

     devfs
        is a more long term project aimed at getting around the whole
        business of device numbering by making the /dev directory a
        kernel file system in the same way as /proc is.  More
        information will appear as it becomes available.

5.6.4. smugfs

For a number of reasons it is currently difficult to have files bigger than 2 GB. One file system that tries to overcome this limit is smugfs which is very fast but also simple. For instance there are no directories and the block allocation is simple.

It is available as compressed tarred source code <ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/> and while it worked with kernel version 2.1.85 it is quite possible some work is required to make it fit into newer kernels. Also the low version number (0.0) suggests extra care is required.

5.7. File System Recommendations

There is a jungle of choices but generally it is recommended to use the general file system that comes with your distribution. If you use ufs and have some kind of tmpfs available you should first start off with the general file system to get an idea of the space requirements and if necessary buy more RAM to support the size of tmpfs you need. Otherwise you will end up with mysterious crashes and lost time.

If you use dual boot and need to transfer data between the two OSes one of the simplest ways is to use an appropriately sized partition formatted with fat as most systems can reliably read and write this. Remember the limit of 2 GB for fat partitions.

For more information of file system interconnectivity you can check out the file system
<http://students.ceid.upatras.gr/~gef/fs/oldindex.html> page which has been superseded by file system <http://www.penguin.cz/~mhi/fs/> and the article Kragen's Amazing List of Filesystems <http://linuxtoday.com/stories/5556.html>.

That guide is being superseded by a HOWTO which is underway and a link will be added when it is ready.

To avoid total havoc with device renaming if a drive fails check out the scanning order of your system and try to keep your root system on hda or sda and removable media such as ZIP drives at the end of the scanning order.

6. Technologies

In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity.

Many of the techniques described below can be stacked in a number of ways to maximise performance and reliability, though at the cost of added complexity.

6.1. RAID

This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other documents and FAQs for more information.

For Linux one can set up a RAID system using either software (the md module in the kernel), a Linux compatible controller card (PCI-toSCSI) or a SCSI-to-SCSI controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost.

A summary of available hardware RAID solutions for Linux is available at Linux Consulting <http://www.LinuxConsulting. com/Raid/Docs/raid_hw.txt>.

6.1.1. SCSI-to-SCSI

SCSI-to-SCSI controllers are usually implemented as complete cabinets with drives and a controller that connects to the computer with a second SCSI bus. This makes the entire cabinet of drives look like a single large, fast SCSI drive and requires no special RAID driver. The disadvantage is that the SCSI bus connecting the cabinet to the computer becomes a bottleneck.

A significant disadvantage for people with large disk farms is that there is a limit to how many SCSI entries there can be in the /dev directory. In these cases using SCSI-to-SCSI will conserve entries.

Usually they are configured via the front panel or with a terminal connected to their on-board serial interface.

Some manufacturers of such systems are CMD <http://www.cmd.com> and Syred <http://www.syred.com> whose web pages describe several systems.

6.1.2. PCI-to-SCSI

PCI-to-SCSI controllers are, as the name suggests, connected to the high speed PCI bus and is therefore not suffering from the same bottleneck as the SCSI-to-SCSI controllers. These controllers require special drivers but you also get the means of controlling the RAID configuration over the network which simplifies management.

Currently only a few families of PCI-to-SCSI host adapters are supported under Linux.

     DPT
        The oldest and most mature is a range of controllers from DPT
        <http://www.dpt.com> including SmartCache I/III/IV and SmartRAID
        I/III/IV controller families.  These controllers are supported
        by the EATA-DMA driver in the standard kernel. This company also
        has an informative home page <http://www.dpt.com> which also
        describes various general aspects of RAID and SCSI in addition
        to the product related information.

        More information from  the author of the DPT controller drivers
        (EATA* drivers) can be found at his pages on SCSI
        <http://www.uni-mainz.de/~neuffer/scsi/> and DPT
        <http://www.uni-mainz.de/~neuffer/scsi/dpt/>.

        These are not the fastest but have a good track record of proven
        reliability.

        Note that the maintenance tools for DPT controllers currently
        run under DOS/Win only so you will need a small DOS/Win
        partition for some of the software. This also means you have to
        boot the system into Windows in order to maintain your RAID
        system.

     ICP-Vortex
        A very recent addition is a range of controllers from ICP-Vortex
        <http://www.icp-vortex.com> featuring up to 5 independent
        channels and very fast hardware based on the i960 chip. The
        Linux driver was written by the company itself which shows they
        support Linux.

        As ICP-Vortex supplies the maintenance software for Linux it is
        not necessary with a reboot to other operating systems for the
        setup and maintenance of your RAID system. This saves you also
        extra downtime.

     Mylex DAC-960
        This is one of the latest entries which is out in early beta.
        More information as well as drivers are available at Dandelion
        Digital's Linux DAC960 Page
        <http://www.dandelion.com/Linux/DAC960.html>.

     Compaq Smart-2 PCI Disk Array Controllers
        Another very recent entry and currently in beta release is the
        Smart-2 <http://www.insync.net/~frantzc/cpqarray.html> driver.

     IBM ServeRAID
        IBM has released their driver
        <http://www.developer.ibm.com/welcome/netfinity/serveraid_beta.html>
        as GPL.

6.1.3. Software RAID

A number of operating systems offer software RAID using ordinary disks and controllers. Cost is low and performance for raw disk IO can be very high. As this can be very CPU intensive it increases the load noticeably so if the machine is CPU bound in performance rather then IO bound you might be better off with a hardware PCI-to-RAID controller.

Real cost, performance and especially reliability of software vs. hardware RAID is a very controversial topic. Reliability on Linux systems have been very good so far.

The current software RAID project on Linux is the md system (multiple devices) which offers much more than RAID so it is described in more details later.

6.1.4. RAID Levels

RAID comes in many levels and flavours which I will give a brief overview of this here. Much has been written about it and the interested reader is recommended to read more about this in the Software RAID HOWTO <http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/>.

  • RAID 0 is not redundant at all but offers the best throughput of all levels here. Data is striped across a number of drives so read and write operations take place in parallel across all drives. On the other hand if a single drive fail then everything is lost. Did I mention backups?
  • RAID 1 is the most primitive method of obtaining redundancy by duplicating data across all drives. Naturally this is massively wasteful but you get one substantial advantage which is fast access. The drive that access the data first wins. Transfers are not any faster than for a single drive, even though you might get some faster read transfers by using one track reading per drive.

    Also if you have only 2 drives this is the only method of achieving redundancy.

  • RAID 2 and 4 are not so common and are not covered here.
  • RAID 3 uses a number of disks (at least 2) to store data in a striped RAID 0 fashion. It also uses an additional redundancy disk to store the XOR sum of the data from the data disks. Should the redundancy disk fail, the system can continue to operate as if nothing happened. Should any single data disk fail the system can compute the data on this disk from the information on the redundancy disk and all remaining disks. Any double fault will bring the whole RAID set off-line.

    RAID 3 makes sense only with at least 2 data disks (3 disks including the redundancy disk). Theoretically there is no limit for the number of disks in the set, but the probability of a fault increases with the number of disks in the RAID set. Usually the upper limit is 5 to 7 disks in a single RAID set.

    Since RAID 3 stores all redundancy information on a dedicated disk and since this information has to be updated whenever a write to any data disk occurs, the overall write speed of a RAID 3 set is limited by the write speed of the redundancy disk. This, too, is a limit for the number of disks in a RAID set. The overall read speed of a RAID 3 set with all data disks up and running is that of a RAID 0 set with that number of data disks. If the set has to reconstruct data stored on a failed disk from redundant information, the performance will be severely limited: All disks in the set have to be read and XOR-ed to compute the missing information.

  • RAID 5 is just like RAID 3, but the redundancy information is spread on all disks of the RAID set. This improves write performance, because load is distributed more evenly between all available disks.

There are also hybrids available based on RAID 0 or 1 and one other level. Many combinations are possible but I have only seen a few referred to. These are more complex than the above mentioned RAID levels.

RAID 0/1 combines striping with duplication which gives very high transfers combined with fast seeks as well as redundancy. The disadvantage is high disk consumption as well as the above mentioned complexity.

RAID 1/5 combines the speed and redundancy benefits of RAID5 with the fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but disk consumption is still substantial. Implementing such a system would involve typically more than 6 drives, perhaps even several controllers or SCSI channels.

6.2. Volume Management

Volume management is a way of overcoming the constraints of fixed sized partitions and disks while still having a control of where various parts of file space resides. With such a system you can add new disks to your system and add space from this drive to parts of the file space where needed, as well as migrating data out from a disk developing faults to other drives before catastrophic failure occurs.

The system developed by Veritas <http://www.veritas.com> has become the defacto standard for logical volume management.

Volume management is for the time being an area where Linux is lacking.

One is the virtual partition system project VPS <http://wwwwsg. cso.uiuc.edu/~roth/> that will reimplement many of the volume management functions found in IBM's AIX system. Unfortunately this project is currently on hold.

Another project is the Logical Volume Manager <http://www.sistina.com/lvm/> project that is similar to a project by HP.

6.3. Linux md Kernel Patch

The Linux Multi Disk (md) provides a number of block level features in various stages of development.

RAID 0 (striping) and concatenation are very solid and in production quality and also RAID 4 and 5 are quite mature.

It is also possible to stack some levels, for instance mirroring (RAID

  1. two pairs of drives, each pair set up as striped disks (RAID 0), which offers the speed of RAID 0 combined with the reliability of RAID 1.

In addition to RAID this system offers (in alpha stage) block level volume management and soon also translucent file space. Since this is done on the block level it can be used in combination with any file system, even for fat using Wine.

Think very carefully what drives you combine so you can operate all drives in parallel, which gives you better performance and less wear. Read more about this in the documentation that comes with md.

Unfortunately The Linux software RAID has split into two trees, the old stable versions 0.35 and 0.42 which are documented in the official Software-RAID HOWTO <http://linas.org/linux/Software-RAID/SoftwareRAID. html> and the newer less stable 0.90 series which is documented in the unofficial Software RAID HOWTO <http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/> which is a work in progress.

A patch for online growth of ext2fs <http://wwwmddsp. enel.ucalgary.ca/People/adilger/online-ext2/> is available in early stages and related work is taking place at the ext2fs resize project <http://ext2resize.sourceforge.net/> at Sourceforge.

Hint: if you cannot get it to work properly you have forgotten to set the persistent-block flag. Your best documentation is currently the source code.

6.4. Compression

Disk compression versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neck breaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given.

  • DouBle features file compression with some limitations.
  • Zlibc adds transparent on-the-fly decompression of files as they load.
  • there are many modules available for reading compressed files or partitions that are native to various other operating systems though currently most of these are read-only.
  • dmsdos <http://bf9nt.uniduisburg. de/mitarbeiter/gockel/software/dmsdos/> (currently in version 0.9.2.0) offer many of the compression options available for DOS and Windows. It is not yet complete but work is ongoing and new features added regularly.
  • e2compr is a package that extends ext2fs with compression capabilities. It is still under testing and will therefore mainly be of interest for kernel hackers but should soon gain stability for wider use. Check the <http://e2compr.memalpha.cx/e2compr/> name="e2compr homepage"> for more information. I have reports of speed and good stability which is why it is mentioned here.

6.5. ACL

Access Control List (ACL) offers finer control over file access on a user by user basis, rather than the traditional owner, group and others, as seen in directory listings (drwxr-xr-x). This is currently not available in Linux but is expected in kernel 2.3 as hooks are already in place in ext2fs.

6.6. cachefs

This uses part of a hard disk to cache slower media such as CD-ROM. It is available under SunOS but not yet for Linux.

6.7. Translucent or Inheriting File Systems

This is a copy-on-write system where writes go to a different system than the original source while making it look like an ordinary file space. Thus the file space inherits the original data and the translucent write back buffer can be private to each user.

There is a number of applications:

  • updating a live file system on CD-ROM, making it flexible, fast while also conserving space,
  • original skeleton files for each new user, saving space since the original data is kept in a single space and shared out,
  • parallel project development prototyping where every user can seemingly modify the system globally while not affecting other users.

SunOS offers this feature and this is under development for Linux. There was an old project called the Inheriting File Systems (ifs) but this project has stopped. One current project is part of the md system and offers block level translucence so it can be applied to any file system.

Sun has an informative page <http://www.sun.ca/white-papers/tfs.html> on translucent file system.

It should be noted that Clearcase (now owned by Rational) <http://www.rational.com> pioneered and popularized translucent filesystems for software configuration management by writing their own UNIX filesystem.

6.8. Physical Track Positioning

This trick used to be very important when drives were slow and small, and some file systems used to take the varying characteristics into account when placing files. Although higher overall speed, on board drive and controller caches and intelligence has reduced the effect of this.

Nevertheless there is still a little to be gained even today. As we know, "world dominance" is soon within reach but to achieve this "fast" we need to employ all the tricks we can use .

To understand the strategy we need to recall this near ancient piece of knowledge and the properties of the various track locations. This is based on the fact that transfer speeds generally increase for tracks further away from the spindle, as well as the fact that it is faster to seek to or from the central tracks than to or from the inner or outer tracks.

Most drives use disks running at constant angular velocity but use (fairly) constant data density across all tracks. This means that you will get much higher transfer rates on the outer tracks than on the inner tracks; a characteristics which fits the requirements for large libraries well.

Newer disks use a logical geometry mapping which differs from the actual physical mapping which is transparently mapped by the drive itself. This makes the estimation of the "middle" tracks a little harder.

In most cases track 0 is at the outermost track and this is the general assumption most people use. Still, it should be kept in mind that there are no guarantees this is so.

     Inner
        tracks are usually slow in transfer, and lying at one end of the
        seeking position it is also slow to seek to.

        This is more suitable to the low end directories such as DOS,
        root and print spools.

     Middle
        tracks are on average faster with respect to transfers than
        inner tracks and being in the middle also on average faster to
        seek to.

        This characteristics is ideal for the most demanding parts such
        as swap, /tmp and /var/tmp.

     Outer
        tracks have on average even faster transfer characteristics but
        like the inner tracks are at the end of the seek so
        statistically it is equally slow to seek to as the inner tracks.

        Large files such as libraries would benefit from a place here.

Hence seek time reduction can be achieved by positioning frequently accessed tracks in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk or cfdisk to make a partition on the middle tracks or by first making a file (using dd) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk.

The latter trick is suitable for news spools where the empty directory structure can be placed in the middle before putting in the data files. This also helps reducing fragmentation a little.

This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centring the tracks will be different, if possible. Consult the latest RAID manual.

The speed difference this makes depends on the drives, but a 50 percent improvement is a typical value.

6.8.1. Disk Speed Values

The same mechanical head disk assembly (HDA) is often available with a number of interfaces (IDE, SCSI etc) and the mechanical parameters are therefore often comparable. The mechanics is today often the limiting factor but development is improving things steadily. There are two main parameters, usually quoted in milliseconds (ms):

  • Head movement - the speed at which the read-write head is able to move from one track to the next, called access time. If you do the mathematics and doubly integrate the seek first across all possible starting tracks and then across all possible target tracks you will find that this is equivalent of a stroke across a third of all tracks.
  • Rotational speed - which determines the time taken to get to the right sector, called latency.

After voice coils replaced stepper motors for the head movement the improvements seem to have levelled off and more energy is now spent (literally) at improving rotational speed. This has the secondary benefit of also improving transfer rates.

Some typical values:

Drive type

Access time (ms) | Fast Typical Old

       Track-to-track             <1       2       8
       Average seek               10      15      30
       End-to-end                 10      30      70

This shows that the very high end drives offer only marginally better access times then the average drives but that the old drives based on stepper motors are significantly worse.

Rotational speed (RPM) | 3600 | 4500 | 4800 | 5400 | 7200 | 10000

Latency (ms) | 17 | 13 | 12.5 | 11.1 | 8.3 | 6.0

As latency is the average time taken to reach a given sector, the formula is quite simply

latency (ms) = 60000 / speed (RPM)

Clearly this too is an example of diminishing returns for the efforts put into development. However, what really takes off here is the power consumption, heat and noise.

6.9. Yoke

There is also a Linux Yoke Driver <http://www.it.uc3m.es/cgibin /ptb/cvs-yoke.cgi> available in beta which is intended to do hotswappable transparent binding of one Linux block device to another. This means that if you bind two block devices together, say /dev/hda and /dev/loop0, writing to one device will mean also writing to the other and reading from either will yield the same result.

6.10. Stacking

One of the advantages of a layered design of an operating system is that you have the flexibility to put the pieces together in a number of ways. For instance you can cache a CD-ROM with cachefs that is a volume striped over 2 drives. This in turn can be set up translucently with a volume that is NFS mounted from another machine. RAID can be stacked in several layers to offer very fast seek and transfer in such a way that it will work if even 3 drives fail. The choices are many, limited only by imagination and, probably more importantly, money.

6.11. Recommendations

There is a near infinite number of combinations available but my recommendation is to start off with a simple setup without any fancy add-ons. Get a feel for what is needed, where the maximum performance is required, if it is access time or transfer speed that is the bottle neck, and so on. Then phase in each component in turn. As you can stack quite freely you should be able to retrofit most components in as time goes by with relatively few difficulties.

RAID is usually a good idea but make sure you have a thorough grasp of the technology and a solid back up system.

7. Other Operating Systems

Many Linux users have several operating systems installed, often necessitated by hardware setup systems that run under other operating systems, typically DOS or some flavour of Windows. A small section on how best to deal with this is therefore included here.

7.1. DOS

Leaving aside the debate on weather or not DOS qualifies as an operating system one can in general say that it has little sophistication with respect to disk operations. The more important result of this is that there can be severe difficulties in running various versions of DOS on large drives, and you are therefore strongly recommended in reading the Large Drives mini-HOWTO. One effect is that you are often better off placing DOS on low track numbers.

Having been designed for small drives it has a rather unsophisticated file system (fat) which when used on large drives will allocate enormous block sizes. It is also prone to block fragmentation which will after a while cause excessive seeks and slow effective transfers.

One solution to this is to use a defragmentation program regularly but it is strongly recommended to back up data and verify the disk before defragmenting. All versions of DOS have chkdsk that can do some disk checking, newer versions also have scandisk which is somewhat better. There are many defragmentation programs available, some versions have one called defrag. Norton Utilities have a large suite of disk tools and there are many others available too.

As always there are snags, and this particular snake in our drive paradise is called hidden files. Some vendors started to use these for copy protection schemes and would not take kindly to being moved to a different place on the drive, even if it remained in the same place in the directory structure. The result of this was that newer defragmentation programs will not touch any hidden file, which in turn reduces the effect of defragmentation.

Being a single tasking, single threading and single most other things operating system there is very little gains in using multiple drives unless you use a drive controller with built in RAID support of some kind.

There are a few utilities called join and subst which can do some multiple drive configuration but there is very little gains for a lot of work. Some of these commands have been removed in newer versions.

In the end there is very little you can do, but not all hope is lost. Many programs need fast, temporary storage, and the better behaved ones will look for environment variables called TMPDIR or TEMPDIR which you can set to point to another drive. This is often best done in autoexec.bat.


SET TMPDIR=E:/TMP
SET TEMPDIR=E:/TEMP

Not only will this possibly gain you some speed but also it can reduce fragmentation.

There have been reports about difficulties in removing multiple primary partitions using the fdisk program that comes with DOS. Should this happen you can instead use a Linux rescue disk with Linux fdisk to repair the system.

Don't forget there are other alternatives to DOS, the most well known being DR-DOS <http://www.caldera.com/dos/> from Caldera <http://www.caldera.com/>. This is a direct descendant from DR-DOS from Digital Research. It offers many features not found in the more common DOS, such as multi tasking and long filenames.

Another alternative which also is free is Free DOS <http://www.freedos.org/> which is a project under development. A number of free utilities are also available.

7.2. Windows

Most of the above points are valid for Windows too, with the exception of Windows95 which apparently has better disk handling, which will get better performance out of SCSI drives.

A useful thing is the introduction of long filenames, to read these from Linux you will need the vfat file system for mounting these partitions.

Disk fragmentation is still a problem. Some of this can be avoided by doing a defragmentation immediately before and immediately after installing large programs or systems. I use this scheme at work and have found it to work quite well. Purging unused files and emptying the waste basket first can improve defragmentation further.

Windows also use swap drives, redirecting this to another drive can give you some performance gains. There are several mini-HOWTOs telling you how best to share swap space between various operating systems.

The trick of setting TEMPDIR can still be used but not all programs will honour this setting. Some do, though. To get a good overview of the settings in the control files you can run sysedit which will open a number of files for editing, one of which is the autoexec file where you can add the TEMPDIR settings.

Much of the temporary files are located in the /windows/temp directory and changing this is more tricky. To achieve this you can use regedit which is rather powerful and quite capable of rendering your system in a state you will not enjoy, or more precisely, in a state much less enjoyable than windows in general. Registry database error is a message that means seriously bad news. Also you will see that many programs have their own private temporary directories scattered around the system.

Setting the swap file to a separate partition is a better idea and much less risky. Keep in mind that this partition cannot be used for anything else, even if there should appear to be space left there.

It is now possible to read ext2fs partitions from Windows, either by mounting the partition using FSDEXT2 <http://www.yipton.demon.co.uk/> or by using a file explorer like tool called Explore2fs <http://uranus.it.swin.edu.au/~jn/linux/explore2fs.htm>.

7.3. OS/2

The only special note here is that you can get file system driver for OS/2 that can read an ext2fs partition. Matthieu Willm's ext2fs Installable File System for OS/2 can be found at ftp-os2.nmsu.edu <ftp://ftp-os2.nmsu.edu/pub/os2/system/drivers/filesys/ext2_240.zip>, Sunsite
<ftp://sunsite.unc.edu/p