[Linux] Ram Disks

linux@flux.org linux@flux.org
Tue, 15 Jan 2008 19:15:50 -0500


On Jan 15, 2008 1:03 PM, Robert Citek - robert.citek@gmail.com

Some very instructive things about ramdisks.

I note that Robert did this all in one script - insuring that it was
run in one operation.

The data in your ramdisk must be re-formatted every time you boot - it
is volatile,  This would mean that once you set your ramdisk up, your
application would have to:

1.  Copy its files to the ramdisk.
2.  Do its processing on the files in the ramdisk.
3.  Copy its files back to the hard disk from the ramdisk.

Probably in step three you want to copy the file to a temp name, then
once the file is copied and sync()ed, rename it to a incoming name,
then rename the old file to ,old, then put the incoming name file in
as the main name - so that you can restart the process and it can see
where you were. If the tempname is still the name of the file it can't
be assured that the file is all copied and the batch has to be rerun
against the same starting file, and other renames can probably be
trusted, and so forth.

And every time you reboot you have to do the mke2fs and the mount - in
that order, which means that the ramdisk can't be automounted.

But I think that you came to us for a ramdisk and there was a second
question inherent in the first, and that the second question is more
important than the first one, and that question is, "How do I make
this application run faster?"  A ramdisk is a possibilty, but my guess
is that it won't help as much as you think it will - ramdisks help for
applications that write to files more than those that just read files
because what you are saying when you make a ramdisk is that the writes
don't really need to be flushed to disk. The reason is that the system
already has sort of a built in ramdisk and all of the files in the
system participate in the ramdisk.

The pages in the system are used for memory for applications, of
course, but they are also used for I/O. When pages are read from disk,
they occupy memory space, and when they are written back (if they are
changed) they are eligible for reuse. But if the system needs the data
from that disk file again before it is reused it simply maps the page
and uses it. And it usually limits the pages used for I/O to a
fraction of those used for any other purpose - because it is trying to
preserve the pages your programs, which is usually the right thing to
do.

Now, I believe that by the time Redhat 9 came out, the kernel was up
to 2.6 or so. In 2.4 there was something added - called tmpfs.  How to
use it is explained in this article.
http://www.ibm.com/developerworks/library/l-fs3.html

It is a better mousetrap - a ramdisk you can make any size your paging
system can handle without any preparation. It can even be put into
/etc/fstab so that it is automatically created and mounted when you
make it.

Here is a practical example from an older system of mine:

 mount -t tmpfs -o size=1G,nr_inodes=4k tmpds /tmp/shared_volatile

makes me a 1 gig file system, mounted on /tmp/shared_volatile, with no
other preparation or work.  No formatting, etc. All I had to do was to
make a directory to mount it on.  (man mount for other options.) Will
the use of this compromise my virtual memory system?

dd if=/dev/zero of=/tmp/shared_volatile/bar count=800000 bs=1k
800000+0 records in
800000+0 records out
[root@quickdraw tmp]#

It is pretty clearly overcommitted - and has moved this file to disk.
I can tell because when I run something like:

[root@quickdraw tmp]# time cat /tmp/shared_volatile/bar > /dev/null

real    0m42.276s
user    0m0.190s
sys     0m1.660s
[root@quickdraw tmp]#

Also because I selected that file size to be close to the real memory
size on this system.

A lot of that time was spent waiting, probably for pages.  The top
command showed that a large hunk of my paging space was now commiitted
and in use, and when I umounted the new file system, it was all freed.
 I have very little comitted paging space - as I suggest below, it is
mostly unused.

Finally, I compared the time for the dd and cat commands to real disk
and to /tmpfs.  The commands were MUCH FASTER running to my real disk
than to tmpfs.  This is for files that were much bigger than any
available memory, implying that on my system the scheduling of writes
and reads is better.

I did some experiments using smaller files that easily fit in to
memory.  They were faster to write - because of the fact that there is
no need to commit - but NO FASTER TO READ.  Of course these files were
already in the memory cache.

Trying to flush the memory cache by writing a new very large file
resulted in very slightly slower performance the first read, from the
real disk.

The points that make this salient to this discussion is that there is
no need to run mke2fs - the tmpfs file system knows what it is and
automatically shows a file system, and it is also expandable to a
maximum size big enough to hold your file for testing, without
rebooting. With fewer levels of abstraction, you *should* get better
performance.  This means that you have to have a virtual memory system
which is big enough, and ideally, that your real memory is big enough
so that the paging files are never used. But there are problems and
you can't tell for sure what you will see until you try.

>From what you have described, a ramdisk will have no advantage over a
system that has a large enough buffer space such that the file is
completely held in memory, and may well be slower. But it is simple to
test with tmpfs, simpler than with an old style ramdisk, and will
likely be better, but for reads, there will probably be no advantage
over simply having enough memory to hold the file in "buffers".

A newer kernel will likely let you adjust the external tuning for this
application better.

But any of these file systems are "volatile" meaning that if you
reboot, you get no contents.

It really sounds like what this application needs is to be rewritten
using data base tech, and the database needs to be tuned to the
application.

The linux HARDWARE file system would naturally cache, (as I
mentioned), in memory, the disk blocks that were read, were there room
for them in memory.  That is, the most recently used disk blocks,
whatever they are, should be in memory. When they are dirty (changed)
they will be written to disk.  But they won't be wiped until they are
needed.  If the e2fs file system calls for the same memory blocks and
they have not been reused to some other purpose, then they will be
reused from memory.  Sorta like a ramdisk.

This is tunable.  /proc/system/vm/pagecache on recent kernels can be
set with sysctl. I am not sure it can be set on old kernels. I believe
that the top command will tell you how much memory is used for
"buffers" which is this sort of memory. "sysctl -a | less" might be
instructive too.

Of course, this describes a memory disk as well.  These pages are not,
as I recall, actually kept in memory to the exclusivity of other
applications - they compete on the same scheme as other pages, and are
pushed out when the machine is more busy. Buffer memory is generally
considered less likely to be reused and will be reused more easily.

So, what is my point?  Suppose this application had a one gig file,
and suppose it was the only thing running on a machine that had six
gigs of memory, so that the page files might be completely unused and
there is plenty of room for everything.

The first time through the disk file, all the pages get read in - and
they are then memory resident.  When the application wantes to read
the disk file again, and again, and again, it does not really go to
disk.  It finds the page in memory.

How then does a ram disk help?

It does not help.  It is just another way of keeping the pages in
memory, another level of abstraction.  The real benefit that you get
is that the pages may have slightly more priority to memory residence
than clean disk file pages - and they will be paged out to page space,
which, if you have a highly tuned system, will be on its own spindle,
perhaps on its own controller. Of course it is not, it is with your
data but in a separate partition, like most peoples' page space - so
that it is physically separated from your data and accessing it causes
a big seek.  But if you have enough memory, it does not matter - you
never touch your page space.  And that is what good paging performance
is these days?  Enough memory to never page. ****

Where is your overhead?  The real overhead here is NOT reading the
pages from disk, most likely, it is in finding them in memory.  And
there is the overhead of making the read() system call, which changes
the state from user to kernel. And back again.

And digging through the file system, and figuring out which disk
blocks this read() needs and then finding them in memory.

Of course, if you are short of memory and the app is thrashing your
system's disk block cache by reading through the data over and over,
then you might be hitting the real disk.

Have to know more about the app to make a better guess. And more about
your system, how much stuff you run on it, how into paging you are.

Look at the CPU usage when this application runs - and the paging rate
and the rate at which pages are invalidated (that is, the clean pages
are reused for some other purpose, which means that if the data is
needed again it has to be read from disk).  Simply looking at top
might be instructive.  When this app runs by itself is the system at
100% CPU or at 5% CPU?  If at 100%, what is the system/user split?

If the CPU is low, the app is waiting for I/O.  Try simply adding more
memory to the system.

If the CPU is high, but mostly user, you are in a hole that only an
app fix can dig you out of.

If the CPU is high, like 100% but it is mostly system, then your
problem is not in the app and memory won't help, the problem is that
it is digging back to the system for the blocks of the file it needs
too often.  Sometimes just upping the blocksize in fopen() or whatever
the call is that then sets the buffering will help.  But there are a
couple techniques that will really help.

Copying the file in to a ramdisk puts another layer of abstraction in
- which in some cases may make your application run slower.  Unless it
is waiting for I/O a fair amount. In which case all you are doing is
changing the priority of which pages are selected for paging out.
Maybe, on my system it acted like tmpfs was part of "buffer".  Or
maybe it is easier to find the pages.

So then how do you fix this?

Well, what does it need from the file?  If it is summing a field,
putting the file into a database and indexing that field can be
enormously helpful. Any decent web page on database tuning will tell
you why.

Is it reading through the entire file to find matches?  Again,
indexing the fields that it is matching may be enormously helpful.

Suppose it really needs to look at all of all of these 50 million
records? Supposing it needs to look at those fields as quickly as
possible?

Simple answer:  Memory map the file, access it as a large array. If
the program is in C++, maybe hide the access as a class.  Suddenly,
there are no crosses over to the system abstraction layer if the data
is still in memory. First time the memory maps in, it goes through the
e2fs layer and brings the pages in to memory. If they are dirtied they
get written out.  And if they are dismissed they are read back from
the file.  And they don't get read in using e2fs, they are, AFAIK
(based on what I've read about other systems and some
non-authoritative stuff about Linux) mapped by the paging layer to the
actual allocated blocks in the E2FS system so it is just a page fault,
a much shorter code trip through the system.  But they probably have
the same "keep them in memory" as other program pages, so reaccessing
them will be much more dependent on the total application memory load.

Now, memory mapping can speed up an application.  Sometimes what
speeds the app up is simply the discipline of not copying the data
from the system buffer to the user buffer, or reading the file in
whole page sized chunks.  But if someone told me that they did not
want to change an applications' main logic (and the app was written
like it was accessing card files, read it over from the beginning to
find records) then I would recommend memory mapping.

Of course, they could just read the whole file into a one gigabyte
array, process it in that array, and then write it out at the end and
do the same atomic renames.  Because you are proposing doing just that
with your ramdisk, only you will have way more overhead.

And maybe they should read a book on how an IBM 077 collator worked,
so that they could learn some advanced match-merge algorithms.  Like
sorting the input before processing and then merging through a sorted
file so that they did not have to reread the whole file for every
record.

No, today's programmers simply never learned basic data processing.
And that is OK, because when they have a need to do that they can use
SQL, which does all that a heck of a lot better than they otherwise
could.  But sometimes you need to not use database tech, and that is
where understanding how it used to work helps.

Many years ago, when I was a systems programmer on an IBM mainframe
(MVS, before I started using Unix), we had a string of disks (big ones
for the time, 30 MB each, 8 to a string) that were performing very
badly, and it was our main database string for IMS, so the whole shop
appeared to be suffering. A string was under a single controller, on a
single I/O channel (aggregate max speed 1.5 MBytes/sec) and only one
drive could be transferring or searching at a time, although you could
release the channel while the disk was spinning around to the correct
sector.

This was a problem and no one else in my group could figure it out
(even back then they didn't like me, which meant I was only called
when it was really interesting and no one else could or wanted to do
it, so I never tried to change the situation, and later the habit
would hurt my career a little) so the manager finally called me. I
cranked up a wonderful program called Omegamon (which I ordered for
the installation with a bunch of hardware to support it and they
(other guys in my group, managers were neutral) let me because they
thought that it would be such a big mistake that they could fire me
for it, and it saved us like 10 times my annual salary in lost time)
and it showed that a very important application was running very
poorly. Anyone could have done this, but it is not clear that they
would have understood it.

The application was searching the disk using the channel and spinning
the disk many times in a single search - and this tied up the channel
while using no CPU time - but no other disks on that channel could be
accessed.  There were 60 surfaces on the disk and they had written the
program to search all 60 surfaces - their index just told them what
cylinder a record might be on, not what track.  So a read would spin
the disk up to sixty times, and an average of 30 times, at 3600 RPM.
8/1000ths of a second for a read, 16/1000ths of a second for a record
insert, an incredibly long time to hang up a channel in the mainframe
days, in a single I/O operation.

(You had a lot more control over the file system in those days, the
hard disk areas allocated to an application were custom formatted for
the application the first time they were written, just as a matter of
course, and the files were not byte streams, they were record streams,
and records were sort of in the hardware, as were arbitrary keys.
Turned out to be a good idea at the time (to move some of the
processing out of the main computer and into the channel - making an
IBM mainframe a 17 way multiprocessor - but 16 of the processors were
very, very stupid, could only schedule and perform I/O**, and were not
full turing machines) and a bad idea overall, as disks got cheaper,
processors got faster, and then, suddenly, microcomputer makers
discovered multiple bus masters to offload I/O from the main processor
and it was a good idea again - well, not hardware arbitrary keys on
disk, or application reformatting, but special processors that could
access memory and I/O devices so that the main processor did not have
to spin while reading a record.  They were searching for a key on a
record using the channel - but they were searching a whole cylinder.)

So I called the application programmers. Brought them down to the
glass house (where they were not usually allowed, big deal to get
permission, except that I was their escort), showed them the display,
explained what was happening and what they were doing to the system.
They looked at what they were doing to the system, and said, "We can
fix this," and made the application changes that night. Once I
finished my explanation their elapsed time to solution was under 30
seconds. I think they increased the size of their index and reduced
their hardware search span. Simple parameter changes, not even sure
they had to recompile. Their app ran faster too, like 1/5th of the
original clock time. So when you do tuning like this, once you have
come to a conclusion, involve the application people. They might just
take the problem off of your hands.

This app really sounds like it needs an index - or just an array. It
needs that way more than a system fix.  And it might be improved by
way more memory without any other changes, else a ramdisk will probaby
not help as much as you might think. "Application changes are
impossible" translates to (managerese->English) "application changes
are expensive". It is always a matter of money - if it is important
enough you can rewrite or reverse engineer even if you do not have the
source. They (management) thinks you can quick fix something by
throwing hardware at it, but if the application is written badly
enough there may not be enough hardware extant in the world to fix it.
 Or there might be. If this is important enough, a dedicated server
with plenty of memory and its own disks could be cheap enough. If it
looks like that would help.

Good luck. I've been in your position.

-----------

****In the old days, we would run a system with 8 megs of memory, and
we would overcommit by 8 to 1.  The point was that with channels of
1.5 megabytes and a system of 8 megs, you could fill all of your
memory in a couple of seconds - so you were never that far from
pagefaulting) or swapping, in segment machines) an entire application
in.

Now I/O speeds are approaching 20 megabytes per second per device for
the cheapest class of device - I copied some big files from my laptop
to a USB drive and they copied at 20 megabytes a second (aggregate
date rate 40 MB/sec, 20 in, 20 out) - meaning that the one gig of
memory in my system could be filled from the internal hard disk in a
theoretical time of 50 seconds.  You've noticed this - gone to an app
you have not used for a long time and it takes a very long time to get
response - because it is being read in to memory from the page file.
This is way slower than it used to be, in proportion, even though
everything is hugely faster and it explains why we can no longer
afford to page.  I used to give users a megabyte and a half of
unshared space to run their applications in, and they would usually
use 300k. Even if they were all paged out they could get all paged in
in a third of a second.

In today's world, this would mean that you would want to give an
application a maximum of 7 meg so that it could be paged in - in that
same third of a second.  This is, of course, impossible, applications
are many times larger than that - which is why, when we use an app we
have not used in a while, that the system can take 20-30 plus seconds
to stabilize with its new working set, which might be more time
consuming than simply restarting the application. Windows really sucks
at this whole process. Vista is worse than XP was.

Virtual memory is a bad thing in a lot of ways given today's ratios,
but there is a simple solution, you don't use it.  You get enough
memory so that you can simply run your application mix and never put
anything important into the page space. Today's virtual memory systems
know this so they abandon disk pages before application pages because
it is too expensive to swap those application pages out and in if they
have been used at all recently.  It might be that this application
simply needs to run alone on a big enough box.

So look to see if you are keeping your CPU saturated while running
this application, and then look to see where the saturation is, in
system calls or in user time.  We can talk more about tuning the
application if you like.

**think about the processor in something like a Sony PS3.  A bunch of
processors, and it can do a teraflop, but most of the processors are
very very stupid and all they can do is floating point in a very
simple way - not clear that they are full turing machines. There are
no new ideas under the sun.
-- 
A man can't live in the everglades
Where a man can hide and never be found and have no fear of the bayin' hound
But he better keep movin' and don't stand still
If the skeeters don't get him then the gators will