Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryFS example throughput #31

Closed
moreaki opened this issue Dec 5, 2013 · 6 comments
Closed

MemoryFS example throughput #31

moreaki opened this issue Dec 5, 2013 · 6 comments

Comments

@moreaki
Copy link

moreaki commented Dec 5, 2013

G'day and thank you for this interesting project. I have just checked it out and fired up

$ ./examples/memoryfs.sh ~/memoryfs 1>/dev/null 2>&1

Having written file systems before using C and FUSE for Mac, Linux and Windows, I immediately hit the following combo:

$ gdd if=/dev/zero of=~/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 0.420642 s, 125 MB/s
$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 32.0672 s, 1.6 MB/s
$ uname -a
Darwin surimacpro 13.0.0 Darwin Kernel Version 13.0.0: Thu Sep 19 22:22:27 PDT 2013; root:xnu-2422.1.72~6/RELEASE_X86_64 x86_64

Much to my surprise, the result of the simple (and non-conclusive) write test was significantly lower than my expectations. I haven't looked at the code nor did I read much of the other people's experiences, though I would have expected this to be the same cached throughput for both tests.

Care to share your ideas as to why this test ran with such a surprisingly low throughput?

@EtiennePerot
Copy link
Owner

fuse-jna was never written with performance as a goal, otherwise it would not use JNA to do the binding to fuse. The throughput is indeed pretty terrible right off the bat. I've tended to mentally attribute that to JNA, though I've never actually checked this assumption. That doesn't mean things can't improve, but I doubt that the bottleneck rests on fuse-jna's side. Most of the library consists of simple conversions and wrapping to and from Fuse <-> Java types, and forwarding calls to the user's filesystem class. There is no complex design pattern or other complicated flow involved, so there is little to optimize. This being said, it would be nice if it did improve, because currently fuse-jna is much slower than fuse.py which is written in pure Python and uses ctypes for binding. Theoretically, fuse-jna has no reason for being slower than fuse.py, but it is.

I'm closing this as this isn't a directly actionable bug. That doesn't mean that the discussion is closed (it's a good discussion to have) so feel free to keep commenting here. If you find any specific performance issue that is fixable by anything other than not using JNA, please open a new bug about it.

@moreaki
Copy link
Author

moreaki commented Dec 6, 2013

G'day and thanks for your interesting reply.

The thing is that eventually I wanted to move some file system code over to Java to have better access to other technologies already implemented. This definitely scares me away, since my current code base reaches near native speed. A simple dumping of threads lead to following observation:

"Thread-1592" prio=5 tid=0x00007fd583891000 nid=0x4203 runnable [0x000000010984e000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.System.arraycopy(Native Method)
    at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:201)
    at net.fusejna.examples.MemoryFS$MemoryFile.write(MemoryFS.java:152)
    at net.fusejna.examples.MemoryFS$MemoryFile.access$1200(MemoryFS.java:90)
    at net.fusejna.examples.MemoryFS.write(MemoryFS.java:404)
    at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:627)
    at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:623)
    at net.fusejna.LoggedFuseFilesystem.log(LoggedFuseFilesystem.java:355)
    at net.fusejna.LoggedFuseFilesystem.write(LoggedFuseFilesystem.java:622)
    at net.fusejna.FuseFilesystem._write(FuseFilesystem.java:348)
    at net.fusejna.StructFuseOperations$24.callback(StructFuseOperations.java:268)
    at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:455)
    at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:485)

and the other frequent thread stack ...

"Thread-1736" prio=5 tid=0x00007fd5820c5800 nid=0x4803 runnable [0x000000010fb66000]
   java.lang.Thread.State: RUNNABLE
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
    at net.fusejna.examples.MemoryFS$MemoryFile.write(MemoryFS.java:151)
    at net.fusejna.examples.MemoryFS$MemoryFile.access$1200(MemoryFS.java:90)
    at net.fusejna.examples.MemoryFS.write(MemoryFS.java:404)
    at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:627)
    at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:623)
    at net.fusejna.LoggedFuseFilesystem.log(LoggedFuseFilesystem.java:355)
    at net.fusejna.LoggedFuseFilesystem.write(LoggedFuseFilesystem.java:622)
    at net.fusejna.FuseFilesystem._write(FuseFilesystem.java:348)
    at net.fusejna.StructFuseOperations$24.callback(StructFuseOperations.java:268)
    at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:455)
    at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:485)

To code seems to spend most of its time in these two call stack pattern, which to me looks a bit too stacked for a simple write operation. I don't think that logging is the performance hog. The other threads are constantly in

When I find time, I'll get some more information, since jps and jstack are not the right way to pin-point down the real issue here.

@moreaki
Copy link
Author

moreaki commented Dec 6, 2013

Some more information:

memoryfs-profiling-1

Is it possible that there is no block write enabled and that there are too many threads at work trying to write down one single byte?

@moreaki
Copy link
Author

moreaki commented Dec 6, 2013

After tuning the GC and PermGen, I can safely say that there is either a very bad design bug in the JNA or the NIO code path. The following patch allows me to fiddle with some JVM settings:

diff --git a/examples/run.sh b/examples/run.sh
index 3e2cb48..0e06510 100755
--- a/examples/run.sh
+++ b/examples/run.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash

+: ${JAVA_OPTS:=""}
+
 set -e

 if [ $# -lt 2 ]; then
@@ -22,4 +24,5 @@ set -- "${@:1:$(expr "$#" - 1)}" "$absoluteMountpoint"
 cd "$(dirname "$BASH_SOURCE")/.."

 GRADLE_USER_HOME=gradle gradle uberJar
-java -cp build/libs/\* "$@"
+echo "Invoking: java -cp build/libs/\* ${JAVA_OPTS} \"$@\""
+java -cp build/libs/\* ${JAVA_OPTS} "$@"

So, invoking fuse-jna with the MemoryFS using some different JVM settings:

$ JAVA_OPTS="-XX:+UseParallelGC -XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M" ./examples/memoryfs.sh ~/memoryfs 2>/dev/null
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:uberJar UP-TO-DATE

BUILD SUCCESSFUL

Total time: 7.106 secs
Invoking: java -cp build/libs/\* -XX:+UseParallelGC -XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M "net.fusejna.examples.MemoryFS /Users/moreaki/memoryfs"

yields a higher, though still extremely low throughput for this simple test:

@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 21.4373 s, 2.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.027084 s, 38.7 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.746732 s, 14.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=12
12+0 records in
12+0 records out
12582912 bytes (13 MB) copied, 1.145 s, 11.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=13
13+0 records in
13+0 records out
13631488 bytes (14 MB) copied, 1.15865 s, 11.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=14
14+0 records in
14+0 records out
14680064 bytes (15 MB) copied, 1.31263 s, 11.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=15
15+0 records in
15+0 records out
15728640 bytes (16 MB) copied, 1.4781 s, 10.6 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=16
16+0 records in
16+0 records out
16777216 bytes (17 MB) copied, 1.70416 s, 9.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=17
17+0 records in
17+0 records out
17825792 bytes (18 MB) copied, 1.9069 s, 9.3 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=18
18+0 records in
18+0 records out
18874368 bytes (19 MB) copied, 2.12547 s, 8.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=19
19+0 records in
19+0 records out
19922944 bytes (20 MB) copied, 2.36854 s, 8.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=20
20+0 records in
20+0 records out
20971520 bytes (21 MB) copied, 2.63279 s, 8.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=21
21+0 records in
21+0 records out
22020096 bytes (22 MB) copied, 3.03518 s, 7.3 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=22
22+0 records in
22+0 records out
23068672 bytes (23 MB) copied, 3.1992 s, 7.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=23
23+0 records in
23+0 records out
24117248 bytes (24 MB) copied, 3.49927 s, 6.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=24
24+0 records in
24+0 records out
25165824 bytes (25 MB) copied, 3.85673 s, 6.5 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=25
25+0 records in
25+0 records out
26214400 bytes (26 MB) copied, 4.21176 s, 6.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=26
26+0 records in
26+0 records out
27262976 bytes (27 MB) copied, 4.60754 s, 5.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=27
27+0 records in
27+0 records out
28311552 bytes (28 MB) copied, 5.14044 s, 5.5 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=28
28+0 records in
28+0 records out
29360128 bytes (29 MB) copied, 5.7442 s, 5.1 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=29
29+0 records in
29+0 records out
30408704 bytes (30 MB) copied, 6.34359 s, 4.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 6.97741 s, 4.5 MB/s

Interestingly weird throughput regression. Especially since I would assume this to be cached by the VFS layer anyway.

@moreaki
Copy link
Author

moreaki commented Dec 6, 2013

Found the culprit:

int maxWriteIndex = (int) (writeOffset + bufSize);
[...]
final ByteBuffer newContents = ByteBuffer.allocate(maxWriteIndex);

It seems that the allocate() method is extremely slow! Adding some larger buffer allocation (yes, I'm aware of the drawbacks):

        private int write(final ByteBuffer buffer, final long bufSize, final long writeOffset)
        {
            int maxWriteIndex = (int) (writeOffset + bufSize);
            if (maxWriteIndex > contents.capacity()) {
                // Need to create a new, larger buffer
                maxWriteIndex += 128*1024;
                System.out.println("NEW BUFFER: capacity=" + contents.capacity() 
                                + " bufSize=" + bufSize 
                                + " writeOffset=" + writeOffset
                                + " allocation=" + maxWriteIndex);
                final ByteBuffer newContents = ByteBuffer.allocate(maxWriteIndex);
                newContents.put(contents);
                contents = newContents;
            }
            final byte[] bytesToWrite = new byte[(int) bufSize];
            buffer.get(bytesToWrite, 0, (int) bufSize);
            contents.position((int) writeOffset);
            contents.put(bytesToWrite);
            contents.position(0); // Rewind
            return (int) bufSize;
        }

addresses the issue:

@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.026611 s, 39.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.436102 s, 24.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 2.82139 s, 11.1 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 6.9449 s, 7.5 MB/s

Now, I don't know how bufSize is chosen, but it should maybe follow an adaptive algorithm like for example the timeouts of the TCP stack in the Linux kernel. Is the bufSize triggered by FUSE?

@EtiennePerot
Copy link
Owner

Thanks for taking the time to look into this. Java's bytebuffers do take some time to allocate, and doing it on every write (allocating a new one every time no less) indeed takes a while. Allocating more memory than necessary essentially turns the knob of how tight you want to be on memory and the ratio of (number of new buffer allocations) / (number of writes). Unfortunately this solution still means that you need to allocate new buffers at regular intervals, and these buffers keep getting bigger over time since they're all going to store the same file. There are other ways to go about this problem, such as allocating exponentially-growing memory chunks as the requirement grows (i.e. allocate a kilobyte, then if it goes over allocate 2 kilobytes, then 4 etc.), much like ArrayList does (though ArrayList grows at an exponential factor of 1.5, not 2, but the idea is the same). You could also store identical-size chunks in a linked list, which would make writes linear in time but random reads would take a lot longer. And then you could keep an index to make random reads faster, etc.; this rabbit hole probably goes quite deep. MemoryFS isn't meant to be a real fully-fledged high-performance filesystem, it's just an example of a working filesystem that you can do writes on and serves as a starting point, so I wouldn't want to implement any of the above in it. Let's keep it very simple, and very slow if it needs to be.
However, it would be nice to have a more complex in-memory filesystem implementation and/or a faster one next to it (AdvancedMemoryFS or whatever). Feel free to send a pull request to add such a thing.

Now to answer your question: bufSize is indeed set by fuse. fuse ignores the size of what the writing program (in this case dd) is trying to write, and chunks it up in same-sized blocks (4096 bytes for me) before passing it down to userspace, probably to make life "easier" for filesystem code.
Fortunately, there is a patch from last year (pulled into the Linux kernel so you don't need to apply it if you have a recent kernel) that adds new mount options to be able to control this behavior. You need to enable big_writes, and then you can set max_write=9001 or whatever, which will become the bufSize on writes. I wanted to test that theory. Here's what I get without the options:

$ dd if=/dev/zero of=tmpmount/temp.dd bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 144.721 s, 362 kB/s

With the log showing INFO: [/temp.dd] Method succeeded. Result: 4096.

And then I added this to MemoryFS:

    @Override
    protected String[] getOptions()
    {
        return new String[] { "-o", "big_writes", "-o", "max_write=4194304" };
    }

And then I got this:

$ dd if=/dev/zero of=tmpmount/temp.dd bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 7.43098 s, 7.1 MB/s

And the log should show INFO: [/temp.dd] Method succeeded. Result: 131072.

You're on OS X so I don't know if that patch has made it through yet, please give it a shot.

Again, this stuff would require more code in MemoryFS (not just adding a getOptions method; it would need to detect if the fuse version it's running on supports these options somehow, so there's some platform/capability detection code involved) so it doesn't belong there, as I'd like to keep MemoryFS as simple as possible. However, this stuff would fit well in said AdvancedMemoryFS.

Thanks again for taking the time to dig into this. Feel free to send pull requests for such an alternate MemoryFS implementation (or for that run.sh change as well, it seems useful) and I will pull them in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants