Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Read performance #154

Closed
Licenser opened this issue Mar 12, 2014 · 21 comments
Closed

Low Read performance #154

Licenser opened this issue Mar 12, 2014 · 21 comments

Comments

@Licenser
Copy link
Contributor

I'm experiencing read performance that is a order of magnitude lower then the write performance, this seems to be quite odd I would expect read to outperform write if they are different at all.

Test

Expected behavior

Read is as fast if not faster then write.

Observed behavior

Read is a order of magnate slower then write (80MBit/s vs 1GBit/s)

Setup

  • OS: SmartOS joyent_20140124T065835Z Base64 13.3.0.1
  • Erlang: Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:32:32] [async-threads:10] [kernel-poll:false] [dtrace]
  • CPU Dual Xeon 3.4G
  • Disks: Striped ZRaid3 + SSD Cache for ZIL and L2ARC
  • GW: 16G RAM, Storage: 64G Memory
  • LeoFS 1.0.0pre3
  • Network: 10GBit
  • Client: gof3r (custom build MD5 multipart checks disabled)
  • Read concurrency: 5
  • Chunk Size: 5242880 (aligned with size of multiapart chunks as configured in the GW)
  • File Size: 4094722048 bytes

Layout

image

GW Config

#======================================================================
# LeoFS - Gateway Configuration
#
# See: http://www.leofs.org/docs/configuration.html#leofs-gateway
#======================================================================
## --------------------------------------------------------------------
## SASL
## --------------------------------------------------------------------
## See: http://www.erlang.org/doc/man/sasl_app.html
##
## The following configuration parameters are defined for
## the SASL application. See app(4) for more information
## about configuration parameters

## SASL error log path
## sasl.sasl_error_log = ./log/sasl/sasl-error.log

## Restricts the error logging performed by the specified sasl_error_logger
## to error reports, progress reports, or both.
## errlog_type = [error | progress | all]
## sasl.errlog_type = error

## Specifies in which directory the files are stored.
## If this parameter is undefined or false, the error_logger_mf_h is not installed.
# sasl.error_logger_mf_dir = ./log/sasl

## Specifies how large each individual file can be.
## If this parameter is undefined, the error_logger_mf_h is not installed.
## sasl.error_logger_mf_maxbytes = 10485760

## Specifies how many files are used.
## If this parameter is undefined, the error_logger_mf_h is not installed.
## sasl.error_logger_mf_maxfiles = 5

## --------------------------------------------------------------------
## Manager's Node(s)
## --------------------------------------------------------------------
## Name of Manager node(s)
managers = [manager_0@leomgr0, manager_1@leomgr1]

## --------------------------------------------------------------------
## GATEWAY
## --------------------------------------------------------------------
## Gateway’s HTTP API to use: [s3 | rest | embed]
http.handler = s3

## Port number the Gateway uses for HTTP connections
http.port = 80

## Numbers of processes listening for connections
http.num_of_acceptors = 128

## Maximum number of requests allowed in a single keep-alive session
http.max_keepalive = 4096

## Total number of virtual directories
## http.layer_of_dirs = 12

## Port number the Gateway uses for HTTPS connections
http.ssl_port     = 443

## SSL Certificate file
## http.ssl_certfile = ./etc/server_cert.pem

## SSL key
## http.ssl_keyfile  = ./etc/server_key.pem

## Synchronized time of a bucket property (second)
bucket_prop_sync_interval = 300

## --------------------------------------------------------------------
## GATEWAY - Large Object
## --------------------------------------------------------------------
## Total number of chunked objects
large_object.max_chunked_objs = 10000000

## Length of a chunked object
large_object.chunked_obj_len = 5242880

## Threshold of length of a chunked object
large_object.threshold_of_chunk_len = 5767168

## Reading length of a chuncked object
##   * If happening timeout when copying a large object,
##     you will solve to set this value as less than 5MB.
##   * default: "large_object.chunked_obj_len" (5242880 - 5MB)
large_object.reading_chunked_obj_len = 5242880


## --------------------------------------------------------------------
## GATEWAY - Cache
## --------------------------------------------------------------------
## If this parameter is 'true', Gateway turns on HTTP-based cache server, like Varnish OR Squid.
## If this parameter is 'false', Stores objects into the Gateway’s memory.
## When operating READ, the Etag of the cache is compared with a backend storage’s Etag.
## cache.http_cache = false

## A number of cache workers
## cache.cache_workers = 16

## Memory cache capacity in bytes
cache.cache_ram_capacity  = 1073741824

## Disk cache capacity in bytes
cache.cache_disc_capacity = 524288000

## When the length of the object exceeds this value, store the object on disk
cache.cache_disc_threshold_len = 1048576

## Directory for the disk cache data
cache.cache_disc_dir_data    = ./cache/data

## Directory for the disk cache journal
cache.cache_disc_dir_journal = ./cache/journal

## Cache Expire in seconds
cache.cache_expire = 300

## Cache Max Content Length in bytes
cache.cache_max_content_len = 1048576

## Cache Content Type(s)
## In case of "empty", all objects are cached.
## cache.cachable_content_type =

## Cache Path Pattern(s) (regular expression)
## In case of "empty", all objects are cached.
## cache.cachable_path_pattern =


## --------------------------------------------------------------------
## GATEWAY - Timeout
## --------------------------------------------------------------------
## Timeout value when requesting to a storage
## 0 to 65,535 bytes
## timeout.level_1 =  5000

## 65,535 to 131,071 bytes
## timeout.level_2 =  7000

## 131,072 to 524,287 bytes
## timeout.level_3 = 10000

## 524,288 to 1,048,576 bytes
## timeout.level_4 = 20000

## 1,048,576 bytes and over
## timeout.level_5 = 30000


## --------------------------------------------------------------------
## GATEWAY - Log
## --------------------------------------------------------------------
##
## Log level: [0:debug, 1:info, 2:warn, 3:error]
## log.log_level = 1

## Is enable access-log [true, false]
## log.is_enable_access_log = false

## Output log file(s) - Erlang's log
## log.erlang = ./log/erlang

## Output log file(s) - app
## log.app = ./log/app

## Output log file(s) - members of storage-cluster
## log.member_dir = ./log/ring

## Output log file(s) - ring
## log.ring_dir = ./log/ring


## --------------------------------------------------------------------
## GATEWAY - Other Directories
## --------------------------------------------------------------------
## Directory of queue for monitoring "RING"
## queue_dir  = ./work/queue

## Directory of SNMP agent configuration
## snmp_agent = ./snmp/snmpa_gateway_0/LEO-GATEWAY


## --------------------------------------------------------------------
## Other Libs
## --------------------------------------------------------------------
## Enable profiler - leo_backend_db
## leo_backend_db.profile = false

## Enable profiler - leo_logger
## leo_logger.profile = false

## Enable profiler - leo_mq
## leo_mq.profile = false

## Enable profiler - leo_redundant_manager
## leo_redundant_manager.profile = false

## Enable profiler - leo_statistics
## leo_statistics.profile = false


#======================================================================
# For vm.args
#======================================================================
## Name of the leofs-gateway node
nodename = gateway_bench@loegwbench

## Cookie for distributed node communication.  All nodes in the same cluster
## should use the same cookie or they will not be able to communicate.
distributed_cookie = ...

## Enable kernel poll
erlang.kernel_poll = true

## Number of async threads
erlang.asyc_threads = 32

## Increase number of concurrent ports/sockets
erlang.max_ports = 64000

## Set the location of crash dumps
erlang.crash_dump = ./log/erl_crash.dump

## Raise the ETS table limit
erlang.max_ets_tables = 256000

## Raise the default erlang process limit
process_limit = 1048576

## Path of SNMP-agent configuration
##snmp_conf = ./snmp/snmpa_gateway_0/leo_gateway_snmp

Results

Write

real    0m36.481s
user    0m19.098s
sys     0m4.876s

Read

real    0m36.171s
user    0m18.174s
sys     0m6.516s

Profiling

SVG flame charts for the GW and Storage process during read and write

https://www.dropbox.com/sh/3psqxvkv966y9t4/K4NsHvpLAh

Observations

  • No disk activity during reads -> All data was served from the ZFS cache (aka memory), no bottleneck on the disks during write
  • Network is not saturated and only is used in bursts:
[root@... ~]# nicstat -i z4_net0 1
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:05:36      z4_net0   11.88   11.45   11.42   11.21  1065.5  1045.3   0.00  0.00
15:05:37      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:05:38      z4_net0   19.97   28.63   137.7   201.5   148.5   145.5   0.00  0.00
15:05:39      z4_net0 16662.4 16944.2 15877.2 18563.5  1074.6   934.7   1.39  0.00
15:05:40      z4_net0 71110.7 71470.9 60946.8 67907.4  1194.8  1077.7   5.85  0.00
15:05:41      z4_net0    0.20    0.21    3.00    3.00   70.00   71.00   0.00  0.00
15:05:42      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:05:43      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:05:44      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:05:45      z4_net0 49526.6 47336.3 43998.2 46076.2  1152.7  1052.0   4.06  0.00
15:05:46      z4_net0 16243.0 16387.4 12860.7 14781.0  1293.3  1135.3   1.34  0.00
15:05:47      z4_net0    0.07    0.07    1.00    1.00   70.00   71.00   0.00  0.00
15:05:48      z4_net0    0.07    0.07    1.00    1.00   70.00   71.00   0.00  0.00
15:05:49      z4_net0    0.07    0.07    1.00    1.00   74.00   70.00   0.00  0.00
15:05:50      z4_net0 10964.0  8754.0  9472.6  9041.2  1185.2   991.5   0.90  0.00
15:05:51      z4_net0    0.20    1.54    3.00    2.00   66.67   790.5   0.00  0.00
15:05:52      z4_net0    0.19    0.21    2.99    2.99   66.67   71.00   0.00  0.00
15:05:53      z4_net0    0.27    0.35    3.99    4.99   70.00   72.20   0.00  0.00
15:05:54      z4_net0    0.20    0.14    2.99    2.00   70.00   71.00   0.00  0.00
15:05:55      z4_net0  5474.1  5563.1  4647.3  5677.4  1206.2  1003.4   0.46  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:05:56      z4_net0   355.7   498.2  2549.3  3554.8   142.9   143.5   0.04  0.00
15:05:57      z4_net0 73939.1 69382.7 62695.3 67147.0  1207.6  1058.1   6.06  0.00
15:05:58      z4_net0  8106.2 10953.4  6904.3  9664.2  1202.3  1160.6   0.90  0.00
15:05:59      z4_net0    2.00    3.33   21.95   25.94   93.27   131.3   0.00  0.00
15:06:00      z4_net0    0.20    0.28    2.99    3.99   70.00   71.00   0.00  0.00
15:06:01      z4_net0    0.20    0.14    2.99    2.00   70.00   71.00   0.00  0.00
15:06:02      z4_net0 21927.2 22186.6 18831.0 21453.1  1192.4  1059.0   1.82  0.00
15:06:03      z4_net0 49431.7 44612.4 43082.8 46464.3  1174.9   983.2   4.05  0.00
15:06:04      z4_net0 21970.8 25662.8 19454.2 25377.3  1156.5  1035.5   2.10  0.00
15:06:05      z4_net0    0.21    0.21    2.99    2.99   71.33   70.67   0.00  0.00
15:06:06      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:07      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:08      z4_net0  5502.0  5577.0  4982.4  5774.9  1130.8   988.9   0.46  0.00
15:06:09      z4_net0 60411.1 55997.9 52527.7 58648.5  1177.7   977.7   4.95  0.00
15:06:10      z4_net0 10996.2 11811.3  9660.0 12128.4  1165.6   997.2   0.97  0.00
15:06:11      z4_net0    0.27    0.28    3.99    3.99   70.00   71.00   0.00  0.00
15:06:12      z4_net0    0.34    0.35    4.99    4.99   70.00   71.00   0.00  0.00
15:06:13      z4_net0    0.27    0.28    3.99    3.99   70.00   71.00   0.00  0.00
15:06:14      z4_net0 16487.4 16667.5 14443.2 16092.4  1168.9  1060.6   1.37  0.00
15:06:15      z4_net0 41246.8 39837.0 38800.2 46243.0  1088.6   882.1   3.38  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:06:16      z4_net0 70994.0 68445.4 61901.7 75036.3  1174.4   934.1   5.82  0.00
15:06:17      z4_net0  8879.1 12187.1  7010.2 11071.7  1297.0  1127.2   1.00  0.00
15:06:18      z4_net0    0.07    0.07    1.00    1.00   70.00   71.00   0.00  0.00
15:06:19      z4_net0    1.94    3.26   20.96   24.95   94.57   133.7   0.00  0.00
15:06:20      z4_net0    0.14    0.14    2.00    2.00   72.00   70.50   0.00  0.00
15:06:21      z4_net0 11008.3 11280.0  9745.9 13027.2  1156.6   886.7   0.92  0.00
15:06:22      z4_net0 27620.2 18812.8 25158.2 22902.7  1124.2   841.1   2.26  0.00
15:06:23      z4_net0 77506.4 88261.8 72265.3 95849.2  1098.3   942.9   7.23  0.00
15:06:24      z4_net0 82762.1 81927.8 74751.3 91327.3  1133.7   918.6   6.78  0.00
15:06:25      z4_net0 16403.4 16749.7 14013.7 18423.5  1198.6   931.0   1.37  0.00
15:06:26      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:27      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:28      z4_net0    0.20    0.21    2.99    2.99   70.00   71.00   0.00  0.00
15:06:29      z4_net0  5506.3  3090.9  4825.2  3620.4  1168.5   874.2   0.45  0.00
15:06:30      z4_net0    0.20    0.21    3.00    3.00   66.67   71.00   0.00  0.00
15:06:31      z4_net0    0.07    0.07    1.00    1.00   70.00   71.00   0.00  0.00
15:06:32      z4_net0    0.06    0.00    1.00    0.00   60.00    0.00   0.00  0.00
15:06:33      z4_net0    0.13    0.07    2.00    1.00   65.00   71.00   0.00  0.00
15:06:34      z4_net0  5512.9  5566.9  4892.0  5291.5  1154.0  1077.3   0.46  0.00
15:06:35      z4_net0   381.5   533.1  2736.1  3802.6   142.8   143.6   0.04  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:06:36      z4_net0 60374.1 60522.6 52244.5 65679.9  1183.3   943.6   4.96  0.00
15:06:37      z4_net0    0.20    0.21    3.00    3.00   70.00   71.00   0.00  0.00
15:06:38      z4_net0    0.54    0.45    6.99    5.99   79.14   77.67   0.00  0.00
15:06:39      z4_net0    2.00    3.33   21.97   25.96   93.27   131.3   0.00  0.00
15:06:40      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:41      z4_net0  5520.5  5664.7  4983.1  6688.4  1134.4   867.3   0.46  0.00
15:06:42      z4_net0 41879.9 37751.7 39215.1 46301.3  1093.6   834.9   3.43  0.00
15:06:43      z4_net0 41183.2 44826.0 37797.8 48067.2  1115.7   955.0   3.67  0.00
15:06:44      z4_net0    0.27    0.14    3.99    2.00   70.00   71.00   0.00  0.00
15:06:45      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:46      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:06:47      z4_net0  5539.6  5672.2  5222.5  6741.0  1086.2   861.6   0.46  0.00
15:06:48      z4_net0 44648.2 42346.9 43820.1 48630.4  1043.4   891.7   3.66  0.00
15:06:49      z4_net0 16377.7 16467.8 14081.1 15001.8  1191.0  1124.1   1.35  0.00
15:06:50      z4_net0 16677.4 17021.4 16310.8 19637.4  1047.0   887.6   1.39  0.00
15:06:51      z4_net0    0.19    0.14    3.00    2.00   66.67   71.00   0.00  0.00
15:06:52      z4_net0    0.20    0.21    3.00    3.00   70.00   71.00   0.00  0.00
15:06:53      z4_net0  5690.0  5778.7  6332.3  6767.6   920.1   874.4   0.47  0.00
15:06:54      z4_net0 55566.8 55546.4 52885.4 62570.3  1075.9   909.0   4.55  0.00
15:06:55      z4_net0 16545.8 16658.0 15401.1 16952.7  1100.1  1006.2   1.36  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:06:56      z4_net0    0.13    0.07    2.00    1.00   65.00   71.00   0.00  0.00
15:06:57      z4_net0    0.14    0.14    2.00    2.00   72.00   70.50   0.00  0.00
15:06:58      z4_net0    0.00    0.00    0.00    0.00    0.00    0.00   0.00  0.00
15:06:59      z4_net0 11070.8 10959.8 10126.5 12139.6  1119.5   924.5   0.91  0.00
15:07:00      z4_net0    0.20    1.54    3.00    2.00   66.67   790.5   0.00  0.00
15:07:01      z4_net0    0.20    0.28    3.00    3.99   66.67   71.00   0.00  0.00
15:07:02      z4_net0    0.20    0.14    3.00    2.00   70.00   71.00   0.00  0.00
15:07:03      z4_net0    0.20    0.14    3.00    2.00   66.67   71.00   0.00  0.00
15:07:04      z4_net0   867.3   209.6  1445.1  1488.0   614.6   144.2   0.07  0.00
15:07:05      z4_net0  5076.7  5947.3  6598.2  7991.3   787.9   762.1   0.49  0.00
15:07:06      z4_net0 66272.6 63281.7 60493.2 71337.3  1121.8   908.4   5.43  0.00
15:07:07      z4_net0   32.03  4118.2   467.4  2784.2   70.18  1514.7   0.34  0.00
15:07:08      z4_net0    0.53    0.38    6.99    4.99   77.71   78.40   0.00  0.00
15:07:09      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:10      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:11      z4_net0 33807.8 34870.0 34930.0 45862.7   991.1   778.6   2.86  0.00
15:07:12      z4_net0 60697.6 61768.4 55109.9 66763.5  1127.8   947.4   5.06  0.00
15:07:13      z4_net0    0.06    0.00    1.00    0.00   60.00    0.00   0.00  0.00
15:07:14      z4_net0    0.00    0.00    0.00    0.00    0.00    0.00   0.00  0.00
15:07:15      z4_net0    0.00    0.00    0.00    0.00    0.00    0.00   0.00  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:07:16      z4_net0    0.00    0.00    0.00    0.00    0.00    0.00   0.00  0.00
15:07:17      z4_net0 11101.2 11406.0 10441.4 13921.5  1088.7   839.0   0.93  0.00
15:07:18      z4_net0    0.00    0.00    0.00    0.00    0.00    0.00   0.00  0.00
15:07:19      z4_net0    2.07    3.26   22.97   24.97   92.17   133.7   0.00  0.00
15:07:20      z4_net0    0.00    0.07    0.00    1.00    0.00   70.00   0.00  0.00
15:07:21      z4_net0    0.06    0.00    1.00    0.00   60.00    0.00   0.00  0.00
15:07:22      z4_net0   110.9   160.6   747.6  1116.9   151.9   147.2   0.01  0.00
15:07:23      z4_net0  5427.8  5051.2  4304.9  5499.0  1291.1   940.6   0.44  0.00
15:07:24      z4_net0    0.20    0.21    3.00    3.00   70.00   71.00   0.00  0.00
15:07:25      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:26      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:27      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:28      z4_net0 33831.9 32027.7 34692.5 43095.8   998.6   761.0   2.77  0.00
15:07:29      z4_net0 44052.6 41278.4 39408.8 43849.8  1144.7   964.0   3.61  0.00
15:07:30      z4_net0    0.20    0.21    2.99    2.99   70.00   71.00   0.00  0.00
15:07:31      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:32      z4_net0    0.20    0.21    2.99    2.99   70.00   71.00   0.00  0.00
15:07:33      z4_net0   155.3   225.0  1045.4  1563.0   152.2   147.4   0.02  0.00
15:07:34      z4_net0 49931.1 39191.8 46361.7 50325.2  1102.8   797.5   4.09  0.00
15:07:35      z4_net0 27668.3 35347.9 25612.4 37189.4  1106.2   973.3   2.90  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:07:36      z4_net0    0.34    0.40    4.99    5.99   70.00   69.00   0.00  0.00
15:07:37      z4_net0    0.14    0.21    2.00    3.00   70.00   71.00   0.00  0.00
15:07:38      z4_net0    0.75    0.65    9.99    8.99   76.40   74.11   0.00  0.00
15:07:39      z4_net0 10207.1  5962.8  9397.0  8842.8  1112.3   690.5   0.84  0.00
15:07:40      z4_net0 11992.0 11688.3 11072.4 13277.1  1109.0   901.5   0.98  0.00
15:07:41      z4_net0    0.66    2.03    9.99    8.99   68.00   230.9   0.00  0.00
15:07:42      z4_net0    0.41    0.47    5.99    6.99   70.00   69.29   0.00  0.00
15:07:43      z4_net0    0.41    0.48    5.99    6.99   70.00   71.00   0.00  0.00
15:07:44      z4_net0    0.48    0.42    6.99    5.99   70.00   71.00   0.00  0.00
15:07:45      z4_net0 39127.4 37837.6 38145.1 46408.0  1050.4   834.9   3.21  0.00
15:07:46      z4_net0 55385.0 54145.5 51033.7 60978.6  1111.3   909.3   4.54  0.00
15:07:47      z4_net0    0.27    1.68    3.99    3.99   70.00   430.8   0.00  0.00
15:07:48      z4_net0    0.14    0.14    2.00    2.00   70.00   71.00   0.00  0.00
15:07:49      z4_net0    0.35    0.40    4.99    5.99   71.60   68.67   0.00  0.00
15:07:50      z4_net0   93.75   135.4   631.1   940.7   152.1   147.4   0.01  0.00
15:07:51      z4_net0 13022.7 12345.4 15595.6 20908.0   855.1   604.6   1.07  0.00
15:07:52      z4_net0 57105.6 56992.8 54811.2 67079.8  1066.9   870.0   4.68  0.00
15:07:53      z4_net0 57905.2 56980.6 52235.8 66020.3  1135.1   883.8   4.74  0.00
15:07:54      z4_net0 16844.6 17298.4 17006.8 21562.4  1014.2   821.5   1.42  0.00
15:07:55      z4_net0  5454.9  5575.8  4499.6  6065.4  1241.4   941.4   0.46  0.00
    Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs  %Util   Sat
15:07:56      z4_net0    0.20    0.21    3.00    3.00   70.00   71.00   0.00  0.00
15:07:57      z4_net0    0.28    0.28    3.99    3.99   71.00   70.75   0.00  0.00
15:07:58      z4_net0 11149.7 11475.1 10749.4 14383.1  1062.1   817.0   0.94  0.00
15:07:59      z4_net0   557.3   786.0  3957.2  5576.8   144.2   144.3   0.06  0.00
15:08:00      z4_net0 55425.5 53552.2 51438.3 63786.6  1103.4   859.7   4.54  0.00
15:08:01      z4_net0 62146.8 64947.6 65729.1 81799.6   968.2   813.0   5.32  0.00
@Licenser
Copy link
Contributor Author

The storage config is:

[root@ny4-leo-st-01 ~]# cat /opt/local/leo_storage/etc/leo_storage.conf
#======================================================================
# LeoFS - Storage Configuration
#
# See: http://www.leofs.org/docs/configuration.html#leofs-storage
#======================================================================
## --------------------------------------------------------------------
## SASL
## --------------------------------------------------------------------
## See: http://www.erlang.org/doc/man/sasl_app.html
##
## The following configuration parameters are defined for
## the SASL application. See app(4) for more information
## about configuration parameters

## SASL error log path
## sasl.sasl_error_log = /var/log/leo_storage/sasl/sasl-error.log

## Restricts the error logging performed by the specified sasl_error_logger
## to error reports, progress reports, or both.
## errlog_type = [error | progress | all]
## sasl.errlog_type = error

## Specifies in which directory the files are stored.
## If this parameter is undefined or false, the error_logger_mf_h is not installed.
# sasl.error_logger_mf_dir = /var/log/leo_storage/sasl

## Specifies how large each individual file can be.
## If this parameter is undefined, the error_logger_mf_h is not installed.
## sasl.error_logger_mf_maxbytes = 10485760

## Specifies how many files are used.
## If this parameter is undefined, the error_logger_mf_h is not installed.
## sasl.error_logger_mf_maxfiles = 5

## --------------------------------------------------------------------
## Manager's Node(s)
## --------------------------------------------------------------------
## Name of Manager node(s)
managers = [manager_0@leomgr0, manager_1@leomgr1]

## --------------------------------------------------------------------
## STORAGE
## --------------------------------------------------------------------
## Object container
obj_containers.path = [/var/db/leo_storage/avs]
obj_containers.num_of_containers = [128]

## e.g. Case of plural pathes
## obj_containers.path = [/var/leofs/avs/1, /var/leofs/avs/2]
## obj_containers.num_of_containers = [32, 64]


## A number of virtual-nodes for the redundant-manager
## num_of_vnodes = 168


## --------------------------------------------------------------------
## STORAGE - MQ
## --------------------------------------------------------------------
## A number of mq-server's processes
mq.num_of_mq_procs = 8

## MQ recover per_object
mq.recover_per_object.num_of_batch_process = 1
mq.recover_per_object.interval_min = 1
mq.recover_per_object.interval_max = 1

## MQ synchronize objects by vnode-id
mq.sync_by_vnode_id.num_of_batch_process = 1
mq.sync_by_vnode_id.interval_min = 8
mq.sync_by_vnode_id.interval_max = 32

## MQ rebalance objects
mq.rebalance.num_of_batch_process = 1
mq.rebalance.interval_min = 1
mq.rebalance.interval_max = 1

## MQ delete objects
mq.delete_object.num_of_batch_process = 1
mq.delete_object.interval_min = 1
mq.delete_object.interval_max = 1

## MQ recover a node's object
mq.recovery_node.num_of_batch_process = 1
mq.recovery_node.interval_min = 8
mq.recovery_node.interval_max = 32


## --------------------------------------------------------------------
## STORAGE - Replication/Recovery object(s)
## --------------------------------------------------------------------
## Rack-id for the rack-awareness replica placement
## replication.rack_awareness.rack_id =

## Size of stacked objects (bytes)
## replication.recovery.size_of_stacked_objs = 67108864

## Stacking timeout (msec)
## replication.recovery.stacking_timeout = 5000


## --------------------------------------------------------------------
## STORAGE - Log
## --------------------------------------------------------------------
## Log level: [0:debug, 1:info, 2:warn, 3:error]
## log.log_level = 1

## Output log file(s) - Erlang's log
log.erlang = /var/log/leo_storage/erlang

## Output log file(s) - app
log.app = /var/log/leo_storage/app

## Output log file(s) - members of storage-cluster
log.member_dir = /var/log/leo_storage/ring

## Output log file(s) - ring
log.ring_dir = /var/log/leo_storage/ring


## --------------------------------------------------------------------
## STORAGE - Other Directories
## --------------------------------------------------------------------
## Directory of queue for monitoring "RING"
queue_dir  = /var/db/leo_storage/queue

## Directory of SNMP agent configuration
## snmp_agent = ./snmp/snmpa_storage_0/LEO-STORAGE


## --------------------------------------------------------------------
## Other Libs
## --------------------------------------------------------------------
## Enable profiler - leo_backend_db
## leo_backend_db.profile = false

## Enable profiler - leo_logger
## leo_logger.profile = false

## Enable profiler - leo_mq
## leo_mq.profile = false

## MQ backend storage: [bitcask, leveldb]
leo_mq.backend_db = bitcask

## Enable profiler - leo_object_storage
## leo_object_storage.profile = false

## Enable strict check between checksum of a metadata and checksum of an object
## leo_object_storage.is_strict_check = false

## Metadata Storage: [bitcask, leveldb]
## leo_object_storage.metadata_storage = bitcask

## Enable profiler - leo_ordning_reda
## leo_ordning_reda.profile = false

## Send after interval
## leo_ordning_reda.send_after_interval = 100

## Enable profiler - leo_redundant_manager
## leo_redundant_manager.profile = false

## Enable profiler - leo_statistics
## leo_statistics.profile = false


#======================================================================
# For vm.args
#======================================================================
## Name of the leofs-storage node
nodename = storage0@loestorage0

## Cookie for distributed node communication.  All nodes in the same cluster
## should use the same cookie or they will not be able to communicate.
distributed_cookie = ...

## Enable kernel poll
erlang.kernel_poll = true

## Number of async threads
erlang.asyc_threads = 32

## Increase number of concurrent ports/sockets
erlang.max_ports = 64000

## Set the location of crash dumps
erlang.crash_dump = /var/log/leo_storage/erl_crash.dump

## Raise the ETS table limit
erlang.max_ets_tables = 256000

## Raise the default erlang process limit
process_limit = 1048576

## Path of SNMP-agent configuration
##snmp_conf = ./snmp/snmpa_storage_0/leo_storage_snmp

@Licenser
Copy link
Contributor Author

Testing serial downloads s3cli which is not multithreaded and gof3r which is multithreaded. Since s3cli is limited by memory when downloading we can only download a 100MB file 40 times to approximate 4GB of data. The same 40x100MB test also is run with gof3r for comparison then the gof3r is run with the 4g test file.

Serial s3cli

[root@go-test ~/win]# time for i in {1..40}; do echo $i; s3cli -b bench pull 100m 100m.cli; done
...
real    1m16.174s
user    0m23.725s
sys     0m39.379s

~420 MBit/s

Parallel (Multiple small files)

[root@go-test ~/win]# time for i in {1..40}; do echo $i; gof3r get --endpoint=leo -b bench -c 5 -s 5242880 -k 100m -p 100m.down; done
...
real    4m17.014s
user    0m11.794s
sys     0m11.246s

~125 MBit/s

Parallel (Big file)

[root@go-test ~/win]# gof3r get --endpoint=leo -b bench -c 5 -s 5242880 -k 4g -p 4g.down
...
2014/03/12 17:39:54 Duration: 3m55.173165578s

~140 MBit/s

Parallel (Big file, default config)

[root@go-test ~/win]# gof3r get --endpoint=leo -b bench -k 4g -p 4g.down
...
2014/03/12 17:42:36 Duration: 1m38.567685062s

~330 MBit/s

Theory

I have the theory that with smaller chunks (5M) even so they should be allinged the object chunks are not end up bing alligned and reading chunk 1-5 requires connections to read:

connection chunks
1 1,2
2 2,3
3 3,4
4 4,5
5 5,6

Meaning every single chunk is downloaded twice, bigger chunks mitigate that a bit by only fetching every 4th chunk twice.

connection chunks
1 1,2,3,4
2 4,5,6,7
3 7,8,9,10
4 10,11,12,13
5 13,14,15,16

Would it be possible to ensure each chunk is only is only downloaded once something like this:

img

@Licenser
Copy link
Contributor Author

Timeouts

Setup

2014/03/12 18:48:41 GET:  &{4g.down {4g bench false 20 20971520 leo false} }
2014/03/12 18:48:41 chunk total:  196
2014/03/12 18:48:41 content length :  4094722048
2014/03/12 18:48:41 concurrency:  20

Results

timeout in s # timeouts time used (s) throughput (MBit/s)
1 faild faild faild
2 150 62 528
3 151 76 528
5* 110 78 420
15 100 229 143
30 100 409 80
60 129 858 38

Conclusion

There seem to be two types of timeouts:

  1. Timeouts that are cleaned after ~5s this seem to be genuinely requests that take some time to complete (see drop from 150->110 between 3s and 5s)

  2. Timeouts that do not clear after even 1m, my best guess here is that those are some locking issues where two systems wait for each other it seems unreasonable that requests would take over a minute to finish

@Licenser
Copy link
Contributor Author

Addition:
CPU usage on the GW is around max 4%, on the Storage max 2%, memory barely reaches 1G on GW and 200M on the Storage +2G mmaped files.

No error or info logs are written on the GW indicating anything.

@yosukehara
Copy link
Member

Thank you for your report. We'll check this issue in our environment then will share the result.

@Licenser
Copy link
Contributor Author

Thanks for taking the time to looking at this. I have some additional info:

Original test was performed with N/R/W of 3/1/2 and three nodes. I repeated the test with N/R/W 1/1/1 with one two and tree nodes. I expected the timeouts to improve since data could be read from multiple nodes in paralel, to my confusion the opposite happend, timeouts the number and severity of timeouts increased dramatically in many cases so bad the get did not manage to finish since the maximal number of timeouts were reached:

One Storage node

Write

[root@go-test ~/win]# gof3r put --endpoint=leo -b bench -k 4g -p 4g
2014/03/13 01:17:51 &{4g map[] {4g bench false 20 20971520 leo false}}
2014/03/13 01:18:15 makes: 22
2014/03/13 01:18:15 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 01:18:15 md5Path:  .md5/4g.md5
2014/03/13 01:18:15 Duration: 24.235580607s

~ 1.35 GBit/s

Read

[root@go-test ~/win]# gof3r get --endpoint=leo -b bench -k 4g -p 4g.down
2014/03/13 01:19:20 GET:  &{4g.down {4g bench false 20 20971520 leo false} }
2014/03/13 01:19:20 chunk total:  196
2014/03/13 01:19:20 content length :  4094722048
2014/03/13 01:19:20 concurrency:  20
... 226 timeouts ...
2014/03/13 01:21:31 makes: 43
2014/03/13 01:21:31 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 01:21:31 md5Path:  .md5/4g.md5
2014/03/13 01:21:31 Headers:  map[Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["d41d8cd98f00b204e9800998ecf8427e"] Content-Length:[4094722048] Last-Modified:[Thu, 13 Mar 2014 01:18:15 GMT] Date:[Thu, 13 Mar 2014 01:19:20 GMT]]
2014/03/13 01:21:31 Duration: 2m11.476655442s

~ 250 MBit/s

Two storage nodes

Write

[root@go-test ~/win]# gof3r put --endpoint=leo -b bench -k 4g.1 -p 4g
2014/03/13 01:26:04 &{4g map[] {4g.1 bench false 20 20971520 leo false}}
2014/03/13 01:26:25 makes: 22
2014/03/13 01:26:25 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 01:26:25 md5Path:  .md5/4g.1.md5
2014/03/13 01:26:25 Duration: 20.686512039s

~ 1.58 GBit/s

Read

[root@go-test ~/win]# gof3r get --endpoint=leo -b bench -k 4g -p 4g.down
2014/03/13 01:47:09 GET:  &{4g.down {4g bench false 20 20971520 leo false} }
2014/03/13 01:47:09 chunk total:  196
2014/03/13 01:47:09 content length :  4094722048
2014/03/13 01:47:09 concurrency:  20
... 223 timeouts ...
2014/03/13 01:49:03 makes: 43
2014/03/13 01:49:03 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 01:49:03 md5Path:  .md5/4g.md5
2014/03/13 01:49:03 Headers:  map[Content-Length:[4094722048] Last-Modified:[Thu, 13 Mar 2014 01:28:46 GMT] Date:[Thu, 13 Mar 2014 01:47:09 GMT] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["d41d8cd98f00b204e9800998ecf8427e"]]
2014/03/13 01:49:03 Duration: 1m53.241235465s

~ 290 MBit/s

Three Storage Nodes

Write

[root@go-test ~/win]# gof3r put --endpoint=leo -b bench -k 4g.2 -p 4g
2014/03/13 01:59:09 &{4g map[] {4g.2 bench false 20 20971520 leo false}}
2014/03/13 01:59:29 makes: 22
2014/03/13 01:59:29 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 01:59:29 md5Path:  .md5/4g.2.md5
2014/03/13 01:59:29 Duration:   20.261161216s

~ 1.62 GBit/s

Read

[root@go-test ~/win]# gof3r get --endpoint=leo -b bench -k 4g -p 4g.down
2014/03/13 02:11:59 GET:  &{4g.down {4g bench false 20 20971520 leo false} }
2014/03/13 02:11:59 chunk total:  196
2014/03/13 02:11:59 content length :  4094722048
2014/03/13 02:11:59 concurrency:  20
...  256 timeouts ...
2014/03/13 02:14:33 makes: 43
2014/03/13 02:14:33 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/03/13 02:14:33 md5Path:  .md5/4g.md5
2014/03/13 02:14:33 Headers:  map[Content-Type:[application/octet-stream] Etag:["d41d8cd98f00b204e9800998ecf8427e"] Content-Length:[4094722048] Last-Modified:[Thu, 13 Mar 2014 02:01:02 GMT] Date:[Thu, 13 Mar 2014 02:11:59 GMT] Server:[LeoFS]]
2014/03/13 02:14:33 Duration: 2m34.55669319s

~ 211 MBit/s

@yosukehara
Copy link
Member

We're very helpful of you to share the additional info.

@Licenser
Copy link
Contributor Author

To add more information I've ran the basho_bench benchmarks for the storage system, trying to rule out timeouts due to IO Satturation. The tests only run on one node but I found that a sufficient representation since the three zones running storage are on physically distinct systems so interference between them can be ruled out.

All tests are run with R8W2 configuration as in the shipped config file, additional tests are run with element sizes of 1 MB and 5MB to accurately simulate writing large files. Since the timeouts were seen form the beginning the tests were stopped after a medium amount of time as long term behavior was not the concern.

128k

summary

1M

summary

5M

summary

@yosukehara
Copy link
Member

Thank you so much :)
This info has become a great help.

@mocchira
Copy link
Member

The root cause is that
A http request to a very large file with a range header which specified the latter of the file will cause lots of head requests between a gateway node and storage nodes.

For example,
say a 5GB file was stored in LeoFS as multiple chunks(if using default, there must be 1000 chunks).
To read the end of chunks, the current implementation confirmed the metadata of every chunks by using head request to storage nodes
so in this case, 1000 head requests will happen;)

Since we don't need to confirm the metadata of every chunks
if the chunk don't have chunks(when using multiparts upload with a size which is different from the size of LeoFS's chunk, the file stored in LeoFS will consist of child chunks and grand child chunks),
in order to make the performance prior to other things, We have decided that LeoFS users MUST need to set the same chunk size with LeoFS's one when using multi part upload.
if this is the case, we can calculate the right position where a range header specified without invoking lots of head requests, so we will achieve the read performance you expect.

The latest dev branch with our test environments, the result is there

### write 
[leofs@test01 dev]$ gof3r put --endpoint=gw01:8080 -b bbb -c 10 -s 5242880 -k 
5G.dat -p 5G.dat
2014/03/14 15:34:53 &{5G.dat map[] {5G.dat bbb false 10 5242880 gw01:8080 false}}
2014/03/14 15:35:39 makes: 12
2014/03/14 15:35:40 remote_md5:  "7ccf12e98cac27e60a07c4e7a11403c4-1000"
2014/03/14 15:35:40 md5:  779bb0ec61923044ce4a760731c5377b
2014/03/14 15:35:40 md5Path:  .md5/5G.dat.md5
2014/03/14 15:35:40 Duration: 46.776661808s

### read 
[leofs@test01dev]$ gof3r get --endpoint=gw01:8080 -b bbb -c 8 -s 5242880 -k 5
G.dat -p 5G.dat.down
2014/03/14 15:44:58 GET:  &{5G.dat.down {5G.dat bbb false 8 5242880 gw01:8080 false} }
2014/03/14 15:45:00 chunk total:  1000
2014/03/14 15:45:00 content length :  5242880000
2014/03/14 15:45:00 concurrency:  8
2014/03/14 15:45:26 makes: 17
2014/03/14 15:45:26 md5:  779bb0ec61923044ce4a760731c5377b
2014/03/14 15:45:26 md5Path:  .md5/5G.dat.md5
2014/03/14 15:45:26 Headers:  map[Date:[Fri, 14 Mar 2014 06:45:03 GMT] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["7ccf12e98cac27e60a07c4e7a11403c4"] Content-Length:[5242880000] Last-Modified:[Fri, 14 Mar 2014 06:35:44 GMT]]
2014/03/14 15:45:42 Duration: 44.665105388s

The read throughput have reached 1Gbits too:)

We wil share the detailed test cases on the leo-project/notes repository later.

@yosukehara
Copy link
Member

I'd like to share the results on leo-project/notes
It would be great if you see it.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 2, 2014

Hi,
I've retried this with the 1.0.0 release the GW config is unmodified the chunk section is:

## --------------------------------------------------------------------
## GATEWAY - Large Object
## --------------------------------------------------------------------
## Total number of chunked objects
large_object.max_chunked_objs = 100000

## Length of a chunked object
large_object.chunked_obj_len = 5242880

## Threshold of length of a chunked object
large_object.threshold_of_chunk_len = 5767168

## Reading length of a chuncked object
##   * If happening timeout when copying a large object,
##     you will solve to set this value as less than 5MB.
##   * default: "large_object.chunked_obj_len" (5242880 - 5MB)
large_object.reading_chunked_obj_len = 5242880

Here the log: https://gist.github.com/Licenser/9936529

Given you saw so good results and I keep failing to reproduce that is there perhaps any option I forgot to change?

@mocchira
Copy link
Member

mocchira commented Apr 3, 2014

Hi @Licenser

I tried 1.0.0 with default settings(all in one) but couldn't reproduce it.
My result is below.

[leofs@intel11 basho_bench]$ gof3r put --endpoint=localhost:8080 -b bbb -c 10 -s 5242880 -k 5G.dat -p ~/rpm/SPECS/5G.dat

2014/04/03 12:06:46 &{/home/leofs/rpm/SPECS/5G.dat map[] {5G.dat bbb false 10 5242880 localhost:8080 false}}
2014/04/03 12:07:40 makes: 5
2014/04/03 12:07:41 remote_md5:  "adfec641b6155ce947f5ef8b633eec6e-1000"
2014/04/03 12:07:41 md5:  f0c4910bd1b40aecaad309d2a8999e66
2014/04/03 12:07:41 md5Path:  .md5/5G.dat.md5
2014/04/03 12:07:41 Duration: 55.373718685s

[leofs@intel11 basho_bench]$ gof3r get --endpoint=localhost:8080 -b bbb -c 8 -s 5242880 
-k 5G.dat -p 5G.dat.down

2014/04/03 12:12:07 GET:  &{5G.dat.down {5G.dat bbb false 8 5242880 localhost:8080 false} }
2014/04/03 12:12:07 chunk total:  1000
2014/04/03 12:12:07 content length :  5242880000
2014/04/03 12:12:07 concurrency:  8
2014/04/03 12:12:35 makes: 13
2014/04/03 12:12:35 md5:  f0c4910bd1b40aecaad309d2a8999e66
2014/04/03 12:12:35 md5Path:  .md5/5G.dat.md5
2014/04/03 12:12:35 Headers:  map[Date:[Thu, 03 Apr 2014 03:12:07 GMT] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["adfec641b6155ce947f5ef8b633eec6e"] Content-Length:[5242880000] Last-Modified:[Thu, 03 Apr 2014 03:07:41 GMT]]
2014/04/03 12:12:35 Duration: 28.735938117s

As far as i know, your gist seems to be using some older version of leo_gateway which is not applied this commit

Could you check whether the latest leo_gateway is installed or not like this?

### on shell
cd ${leo_gateway_install_dir}
./bin/leo_gateway remote_console

### on erlang console
([email protected])3> leo_gateway_http_commons:module_info(attributes).
[{vsn,[305400042918926957424335382242018533625]},
 {author,['Yosuke Hara']},
 {author,['Yosuke Hara']}]

### if your `vsn` is not same as above, your leo_gateway probably is old for some reason.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 3, 2014

Thanks for looking into it, I've checked and it sadly it all looks good version wise I will try to delete everything and use a fresh clone tomorrow, just in case somewhere some old data is hanging around that messes things up :)

Also it should be noted that I run the 3 storage nodes on 3 different (physically distinct) machines I'll try a all in one setup tomorrow first.

([email protected])1> leo_gateway_http_commons:module_info(attributes).
[{vsn,[305400042918926957424335382242018533625]},
 {author,['Yosuke Hara']},
 {author,['Yosuke Hara']}]

direct comparison:

305400042918926957424335382242018533625 -- from your console
305400042918926957424335382242018533625 -- from my console
``

The file from the commit:
```erlang
         {ok, #?METADATA{del = 0, cnumber = N, dsize = ObjectSize, csize = CS} = _Meta} ->
             {NewStartPos, NewEndPos} = calc_pos(Start, End, ObjectSize),
             {CurPos, Index} = move_curpos2head(NewStartPos, CS, 0, 0),
             get_range_object_large(Req, Bucket, Key, NewStartPos, NewEndPos, N, Index, CurPos);

@mocchira
Copy link
Member

mocchira commented Apr 3, 2014

Thanks for quick reply!
Then we will try to test a cluster which consist of 3 storage nodes(physically distinct) as similar your cluster as possible.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 3, 2014

Quick update:

I rerun the tests with your setup, and I think I found the difference. When running gof3r from localhost I get a OK throughput for the first ~1k chunks (higher chunks fail but that is a different issue I'll look into) only when using a remote I see timeouts.

To check the network connectivity I ran a few iperf tests between networked hosts and the network looks good.

[root@go-test ~/win]# iperf -c bench -t 60 -P 8
------------------------------------------------------------
Client connecting to bench, TCP port 5001
TCP window size:  125 KByte (default)
------------------------------------------------------------
[ 10] local 172.21.0.200 port 42265 connected with bench port 5001
[  3] local 172.21.0.200 port 46289 connected with bench port 5001
[  5] local 172.21.0.200 port 34332 connected with bench port 5001
[  8] local 172.21.0.200 port 49375 connected with bench port 5001
[  7] local 172.21.0.200 port 55730 connected with bench port 5001
[  4] local 172.21.0.200 port 56132 connected with bench port 5001
[  6] local 172.21.0.200 port 49868 connected with bench port 5001
[  9] local 172.21.0.200 port 40568 connected with bench port 5001
[ ID] Interval       Transfer     Bandwidth
[ 10]  0.0-60.0 sec  3.29 GBytes   472 Mbits/sec
[  3]  0.0-60.0 sec  3.06 GBytes   438 Mbits/sec
[  5]  0.0-60.0 sec  2.66 GBytes   380 Mbits/sec
[  8]  0.0-60.0 sec  1.68 GBytes   241 Mbits/sec
[  7]  0.0-60.0 sec  3.55 GBytes   508 Mbits/sec
[  4]  0.0-60.0 sec  1.89 GBytes   271 Mbits/sec
[  6]  0.0-60.0 sec  2.50 GBytes   358 Mbits/sec
[  9]  0.0-60.0 sec  2.38 GBytes   341 Mbits/sec
[SUM]  0.0-60.0 sec  21.0 GBytes  3.01 Gbits/sec

That it works locally but not remotely suggests a networking issue but that it works for uploads just fine indicates no network issue the iperf results (while not optimal) suggest the same on the other hand so I am not sure what to think is the issue.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 3, 2014

On local it looks like this:

[root@leo-test1 ~]# ./gof3r get --endpoint=bench:8080 -b bench -c 8 -s 5242880 -k 4g -p 4g.down
2014/04/03 17:06:19 GET:  &{4g.down {4g bench false 8 5242880 bench:8080 false} }
2014/04/03 17:06:19 chunk total:  782
2014/04/03 17:06:19 content length :  4094722048
2014/04/03 17:06:19 concurrency:  8
2014/04/03 17:06:25 Error on attempt 0: retrying chunk: &{89 map[Range:[bytes=466616320-471859199] Date:[Thu, 03 Apr 2014 17:06:20 GMT] User-Agent:[S3Gof3r] Authorization:[AWS e2c11b798d9aa08c84f6:7R6dn5V4G4O6D2rSeHVwLb5s/aQ=]] 466616320 5242880 0xc2080fbf80 0}, Error: read tcp 127.0.0.1:8080: i/o timeout
2014/04/03 17:06:36 makes: 18
2014/04/03 17:06:36 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/04/03 17:06:36 md5Path:  .md5/4g.md5
2014/04/03 17:06:36 Headers:  map[Etag:["00b890eeb271dba8c42a34a03490eb16"] Content-Length:[4094722048] Last-Modified:[Thu, 03 Apr 2014 17:03:49 GMT] Date:[Thu, 03 Apr 2014 17:06:18 GMT] Server:[LeoFS] Content-Type:[application/octet-stream]]
2014/04/03 17:06:36 Duration: 16.894575956s

It takes a few more runs to get the caches warmed up the local respons goes down to:

[root@leo-test1 ~]# ./gof3r get --endpoint=bench.storage.lucerahq.com:8080 -b bench -c 8 -s 5242880 -k 4g -p 4g.down
2014/04/03 17:14:50 GET:  &{4g.down {4g bench false 8 5242880 bench.storage.lucerahq.com:8080 false} }
2014/04/03 17:14:50 chunk total:  782
2014/04/03 17:14:50 content length :  4094722048
2014/04/03 17:14:50 concurrency:  8
2014/04/03 17:15:01 makes: 15
2014/04/03 17:15:01 md5:  d535ebcf56e726abffd6d0fb28ef4473
2014/04/03 17:15:01 md5Path:  .md5/4g.md5
2014/04/03 17:15:01 Headers:  map[Date:[Thu, 03 Apr 2014 17:14:49 GMT] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["00b890eeb271dba8c42a34a03490eb16"] Content-Length:[4094722048] Last-Modified:[Thu, 03 Apr 2014 17:03:49 GMT]]
2014/04/03 17:15:01 Duration: 10.956782954s

@Licenser
Copy link
Contributor Author

Licenser commented Apr 3, 2014

And another update, I added a log line to leo_gateway_rpc_handler:key then I ran the local and remote test agian counting the times leo_gateway_rpc_handler:head was called:

[root@leo-test1 ~]# wc -l error.remote
4641 error.remote
[root@leo-test1 ~]# wc -l error.local
1566 error.local

The remote get had 372 timeouts, the local had 0.

Not sure if this helps any but since the head request was the original issue I figure it might be shed some more light on it.

@mocchira
Copy link
Member

mocchira commented Apr 8, 2014

We have finished tests with 3 storage nodes(physically distinct) but couldn't reproduce.
The settings is below.

### remote
status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 2
           # of successes of R : 1
           # of successes of W : 2
           # of successes of D : 1
 # of DC-awareness replicas    : 0
                     ring size : 2^128
             Current ring hash : 6f7b97b0
                Prev ring hash : 6f7b97b0
[Multi DC replication settings]
         max # of joinable DCs : 2
            # of replicas a DC : 1

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at         
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | [email protected]      | running      | 6f7b97b0       | 6f7b97b0       | 2014-04-08 14:14:44 +0900
  S    | [email protected]      | running      | 6f7b97b0       | 6f7b97b0       | 2014-04-08 14:14:44 +0900
  S    | [email protected]      | running      | 6f7b97b0       | 6f7b97b0       | 2014-04-08 14:14:44 +0900
  G    | [email protected]      | running      | 6f7b97b0       | 6f7b97b0       | 2014-04-08 14:14:45 +0900

### local
status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 1
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
                     ring size : 2^128
             Current ring hash : 4ad41401
                Prev ring hash : 4ad41401
[Multi DC replication settings]
         max # of joinable DCs : 2
            # of replicas a DC : 1

[Node(s) state]
-------+--------------------------+--------------+----------------+----------------+----------------------------
 type  |           node           |    state     |  current ring  |   prev ring    |          updated at         
-------+--------------------------+--------------+----------------+----------------+----------------------------
  S    | [email protected]      | running      | 4ad41401       | 4ad41401       | 2014-04-08 14:37:12 +0900
  G    | [email protected]      | running      | 4ad41401       | 4ad41401       | 2014-04-08 14:37:12 +0900

the result is below.(including the times leo_gateway_rpc_handler:head was called)

### remote
[kanno@leofs1301 basho_bench]$ gof3r put --endpoint=gw01:8080 -b bbb -c 4 -s 5242880 -k
 5g.dat -p 5g.dat 
2014/04/08 14:26:46 &{5g.dat {5g.dat bbb false 4 5242880 gw01:8080 false} map[]}
2014/04/08 14:28:17 makes: 6
2014/04/08 14:28:17 md5:  ee561cc87e157b07d2549d8fa4e93e33
2014/04/08 14:28:17 md5Path:  .md5/5g.dat.md5
2014/04/08 14:28:17 Duration: 1m31.852658297s

[kanno@leofs1302 ~]$ cat package/leo_gateway/log/erlang.log*|grep "head key"|wc -l
2001

[kanno@leofs1301 basho_bench]$ gof3r get --endpoint=gw01:8080 -b bbb -c 4 -s 5242880 -k
 5g.dat -p 5g.dat.down
2014/04/08 14:30:07 GET:  &{5g.dat.down {5g.dat bbb false 4 5242880 gw01:8080 false} }
2014/04/08 14:30:07 chunk total:  1000
2014/04/08 14:30:07 content length :  5242880000
2014/04/08 14:30:07 concurrency:  4
2014/04/08 14:30:57 makes: 8
2014/04/08 14:30:57 md5:  ee561cc87e157b07d2549d8fa4e93e33
2014/04/08 14:30:57 md5Path:  .md5/5g.dat.md5
2014/04/08 14:30:57 Headers:  map[Connection:[keep-alive] Date:[Tue, 08 Apr 2014 05:29:59 GMT] Content-Length:[5242880000] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["d729cb7c6c7cadf5910f05b2fa64987f"] Last-Modified:[Tue, 08 Apr 2014 05:28:09 GMT]]
2014/04/08 14:30:57 Duration: 49.462065512s

[kanno@leofs1302 ~]$ cat package/leo_gateway/log/erlang.log*|grep "head key"|wc -l
4001

### local
[kanno@leofs1301 basho_bench]$ gof3r put --endpoint=localhost:8080 -b bbb -c 4 -s 52428
80 -k 5g.dat -p 5g.dat 
2014/04/08 14:38:55 &{5g.dat {5g.dat bbb false 4 5242880 localhost:8080 false} map[]}
2014/04/08 14:40:03 makes: 6
2014/04/08 14:40:04 md5:  ee561cc87e157b07d2549d8fa4e93e33
2014/04/08 14:40:04 md5Path:  .md5/5g.dat.md5
2014/04/08 14:40:04 Duration: 1m9.110891284s

[kanno@leofs1301 leofs]$ cat package/leo_gateway/log/erlang.log*|grep "head key"|wc -l
2001

[kanno@leofs1301 basho_bench]$ gof3r get --endpoint=localhost:8080 -b bbb -c 4 -s 52428
80 -k 5g.dat -p 5g.dat.down
2014/04/08 14:50:32 GET:  &{5g.dat.down {5g.dat bbb false 4 5242880 localhost:8080 false} }
2014/04/08 14:50:32 chunk total:  1000
2014/04/08 14:50:32 content length :  5242880000
2014/04/08 14:50:32 concurrency:  4
2014/04/08 14:51:32 makes: 8
2014/04/08 14:51:32 md5:  ee561cc87e157b07d2549d8fa4e93e33
2014/04/08 14:51:32 md5Path:  .md5/5g.dat.md5
2014/04/08 14:51:32 Headers:  map[Connection:[keep-alive] Date:[Tue, 08 Apr 2014 05:50:31 GMT] Content-Length:[5242880000] Server:[LeoFS] Content-Type:[application/octet-stream] Etag:["d729cb7c6c7cadf5910f05b2fa64987f"] Last-Modified:[Tue, 08 Apr 2014 05:40:04 GMT]]
2014/04/08 14:51:32 Duration: 59.411457958s

[kanno@leofs1301 leofs]$ cat package/leo_gateway/log/erlang.log*|grep "head key"|wc -l
4001

As stated above,
after finished put/get requests, the times leo_gateway_rpc_handler:head was called should be $num_of_chunk * 4 + 1 regardless of local/remote, # of replica if there were no timeouts(retries) at client side.
In this case since $num_of_chunk is 1000, the result is 4001.

so your resut

[root@leo-test1 ~]# wc -l error.remote
4641 error.remote

seems to be a reasonable number($num_of_chunk * 4 + 1 + timeouts + alpha).

So we will have to look into next step,
could you share the consitency setting (The output of status command on manager console) and network bandwidth with us?
when setting # of successes of R to more than 2 and network bandwidth is relatively low,
we'd think this issue could happen.

@Licenser
Copy link
Contributor Author

I think I we rule out LeoFS at this point I did some additional tests over the last days and ut comes out as this:

  1. s3cmd runs w/o timeouts (even so a good bit slower ~1GB/s)
  2. gof3r runs on Linux (even so slower ~1GB/s limited by VM's interface)

This indicates that it is probably is a issue with gof3r on SmartOS.

Thousand thanks for the help guys!

@yosukehara
Copy link
Member

Thank you for your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants