-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemoryFS example throughput #31
Comments
fuse-jna was never written with performance as a goal, otherwise it would not use JNA to do the binding to fuse. The throughput is indeed pretty terrible right off the bat. I've tended to mentally attribute that to JNA, though I've never actually checked this assumption. That doesn't mean things can't improve, but I doubt that the bottleneck rests on fuse-jna's side. Most of the library consists of simple conversions and wrapping to and from Fuse <-> Java types, and forwarding calls to the user's filesystem class. There is no complex design pattern or other complicated flow involved, so there is little to optimize. This being said, it would be nice if it did improve, because currently fuse-jna is much slower than fuse.py which is written in pure Python and uses ctypes for binding. Theoretically, fuse-jna has no reason for being slower than fuse.py, but it is. I'm closing this as this isn't a directly actionable bug. That doesn't mean that the discussion is closed (it's a good discussion to have) so feel free to keep commenting here. If you find any specific performance issue that is fixable by anything other than not using JNA, please open a new bug about it. |
G'day and thanks for your interesting reply. The thing is that eventually I wanted to move some file system code over to Java to have better access to other technologies already implemented. This definitely scares me away, since my current code base reaches near native speed. A simple dumping of threads lead to following observation: "Thread-1592" prio=5 tid=0x00007fd583891000 nid=0x4203 runnable [0x000000010984e000]
java.lang.Thread.State: RUNNABLE
at java.lang.System.arraycopy(Native Method)
at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:201)
at net.fusejna.examples.MemoryFS$MemoryFile.write(MemoryFS.java:152)
at net.fusejna.examples.MemoryFS$MemoryFile.access$1200(MemoryFS.java:90)
at net.fusejna.examples.MemoryFS.write(MemoryFS.java:404)
at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:627)
at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:623)
at net.fusejna.LoggedFuseFilesystem.log(LoggedFuseFilesystem.java:355)
at net.fusejna.LoggedFuseFilesystem.write(LoggedFuseFilesystem.java:622)
at net.fusejna.FuseFilesystem._write(FuseFilesystem.java:348)
at net.fusejna.StructFuseOperations$24.callback(StructFuseOperations.java:268)
at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:455)
at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:485) and the other frequent thread stack ... "Thread-1736" prio=5 tid=0x00007fd5820c5800 nid=0x4803 runnable [0x000000010fb66000]
java.lang.Thread.State: RUNNABLE
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at net.fusejna.examples.MemoryFS$MemoryFile.write(MemoryFS.java:151)
at net.fusejna.examples.MemoryFS$MemoryFile.access$1200(MemoryFS.java:90)
at net.fusejna.examples.MemoryFS.write(MemoryFS.java:404)
at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:627)
at net.fusejna.LoggedFuseFilesystem$41.invoke(LoggedFuseFilesystem.java:623)
at net.fusejna.LoggedFuseFilesystem.log(LoggedFuseFilesystem.java:355)
at net.fusejna.LoggedFuseFilesystem.write(LoggedFuseFilesystem.java:622)
at net.fusejna.FuseFilesystem._write(FuseFilesystem.java:348)
at net.fusejna.StructFuseOperations$24.callback(StructFuseOperations.java:268)
at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:455)
at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:485) To code seems to spend most of its time in these two call stack pattern, which to me looks a bit too stacked for a simple write operation. I don't think that logging is the performance hog. The other threads are constantly in When I find time, I'll get some more information, since |
After tuning the GC and PermGen, I can safely say that there is either a very bad design bug in the JNA or the NIO code path. The following patch allows me to fiddle with some JVM settings: diff --git a/examples/run.sh b/examples/run.sh
index 3e2cb48..0e06510 100755
--- a/examples/run.sh
+++ b/examples/run.sh
@@ -1,5 +1,7 @@
#!/usr/bin/env bash
+: ${JAVA_OPTS:=""}
+
set -e
if [ $# -lt 2 ]; then
@@ -22,4 +24,5 @@ set -- "${@:1:$(expr "$#" - 1)}" "$absoluteMountpoint"
cd "$(dirname "$BASH_SOURCE")/.."
GRADLE_USER_HOME=gradle gradle uberJar
-java -cp build/libs/\* "$@"
+echo "Invoking: java -cp build/libs/\* ${JAVA_OPTS} \"$@\""
+java -cp build/libs/\* ${JAVA_OPTS} "$@" So, invoking fuse-jna with the MemoryFS using some different JVM settings: $ JAVA_OPTS="-XX:+UseParallelGC -XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M" ./examples/memoryfs.sh ~/memoryfs 2>/dev/null
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:uberJar UP-TO-DATE
BUILD SUCCESSFUL
Total time: 7.106 secs
Invoking: java -cp build/libs/\* -XX:+UseParallelGC -XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M "net.fusejna.examples.MemoryFS /Users/moreaki/memoryfs" yields a higher, though still extremely low throughput for this simple test: @surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 21.4373 s, 2.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.027084 s, 38.7 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.746732 s, 14.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=12
12+0 records in
12+0 records out
12582912 bytes (13 MB) copied, 1.145 s, 11.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=13
13+0 records in
13+0 records out
13631488 bytes (14 MB) copied, 1.15865 s, 11.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=14
14+0 records in
14+0 records out
14680064 bytes (15 MB) copied, 1.31263 s, 11.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=15
15+0 records in
15+0 records out
15728640 bytes (16 MB) copied, 1.4781 s, 10.6 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=16
16+0 records in
16+0 records out
16777216 bytes (17 MB) copied, 1.70416 s, 9.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=17
17+0 records in
17+0 records out
17825792 bytes (18 MB) copied, 1.9069 s, 9.3 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=18
18+0 records in
18+0 records out
18874368 bytes (19 MB) copied, 2.12547 s, 8.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=19
19+0 records in
19+0 records out
19922944 bytes (20 MB) copied, 2.36854 s, 8.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=20
20+0 records in
20+0 records out
20971520 bytes (21 MB) copied, 2.63279 s, 8.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=21
21+0 records in
21+0 records out
22020096 bytes (22 MB) copied, 3.03518 s, 7.3 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=22
22+0 records in
22+0 records out
23068672 bytes (23 MB) copied, 3.1992 s, 7.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=23
23+0 records in
23+0 records out
24117248 bytes (24 MB) copied, 3.49927 s, 6.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=24
24+0 records in
24+0 records out
25165824 bytes (25 MB) copied, 3.85673 s, 6.5 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=25
25+0 records in
25+0 records out
26214400 bytes (26 MB) copied, 4.21176 s, 6.2 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=26
26+0 records in
26+0 records out
27262976 bytes (27 MB) copied, 4.60754 s, 5.9 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=27
27+0 records in
27+0 records out
28311552 bytes (28 MB) copied, 5.14044 s, 5.5 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=28
28+0 records in
28+0 records out
29360128 bytes (29 MB) copied, 5.7442 s, 5.1 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=29
29+0 records in
29+0 records out
30408704 bytes (30 MB) copied, 6.34359 s, 4.8 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 6.97741 s, 4.5 MB/s Interestingly weird throughput regression. Especially since I would assume this to be cached by the VFS layer anyway. |
Found the culprit: int maxWriteIndex = (int) (writeOffset + bufSize);
[...]
final ByteBuffer newContents = ByteBuffer.allocate(maxWriteIndex); It seems that the allocate() method is extremely slow! Adding some larger buffer allocation (yes, I'm aware of the drawbacks): private int write(final ByteBuffer buffer, final long bufSize, final long writeOffset)
{
int maxWriteIndex = (int) (writeOffset + bufSize);
if (maxWriteIndex > contents.capacity()) {
// Need to create a new, larger buffer
maxWriteIndex += 128*1024;
System.out.println("NEW BUFFER: capacity=" + contents.capacity()
+ " bufSize=" + bufSize
+ " writeOffset=" + writeOffset
+ " allocation=" + maxWriteIndex);
final ByteBuffer newContents = ByteBuffer.allocate(maxWriteIndex);
newContents.put(contents);
contents = newContents;
}
final byte[] bytesToWrite = new byte[(int) bufSize];
buffer.get(bytesToWrite, 0, (int) bufSize);
contents.position((int) writeOffset);
contents.put(bytesToWrite);
contents.position(0); // Rewind
return (int) bufSize;
} addresses the issue: @surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.026611 s, 39.4 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.436102 s, 24.0 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=30
30+0 records in
30+0 records out
31457280 bytes (31 MB) copied, 2.82139 s, 11.1 MB/s
@surimacpro:~$ gdd if=/dev/zero of=~/memoryfs/50M bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 6.9449 s, 7.5 MB/s Now, I don't know how bufSize is chosen, but it should maybe follow an adaptive algorithm like for example the timeouts of the TCP stack in the Linux kernel. Is the bufSize triggered by FUSE? |
Thanks for taking the time to look into this. Java's bytebuffers do take some time to allocate, and doing it on every write (allocating a new one every time no less) indeed takes a while. Allocating more memory than necessary essentially turns the knob of how tight you want to be on memory and the ratio of (number of new buffer allocations) / (number of writes). Unfortunately this solution still means that you need to allocate new buffers at regular intervals, and these buffers keep getting bigger over time since they're all going to store the same file. There are other ways to go about this problem, such as allocating exponentially-growing memory chunks as the requirement grows (i.e. allocate a kilobyte, then if it goes over allocate 2 kilobytes, then 4 etc.), much like Now to answer your question:
With the log showing And then I added this to @Override
protected String[] getOptions()
{
return new String[] { "-o", "big_writes", "-o", "max_write=4194304" };
} And then I got this:
And the log should show You're on OS X so I don't know if that patch has made it through yet, please give it a shot. Again, this stuff would require more code in Thanks again for taking the time to dig into this. Feel free to send pull requests for such an alternate |
G'day and thank you for this interesting project. I have just checked it out and fired up
Having written file systems before using C and FUSE for Mac, Linux and Windows, I immediately hit the following combo:
Much to my surprise, the result of the simple (and non-conclusive) write test was significantly lower than my expectations. I haven't looked at the code nor did I read much of the other people's experiences, though I would have expected this to be the same cached throughput for both tests.
Care to share your ideas as to why this test ran with such a surprisingly low throughput?
The text was updated successfully, but these errors were encountered: