Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize table ontime all returns Too many open files (os error 24) #4253

Closed
BohuTANG opened this issue Feb 26, 2022 · 11 comments · Fixed by #4283
Closed

optimize table ontime all returns Too many open files (os error 24) #4253

BohuTANG opened this issue Feb 26, 2022 · 11 comments · Fixed by #4283
Assignees
Labels
C-bug Category: something isn't working

Comments

@BohuTANG
Copy link
Member

Summary

Using ontime dataset.

Server log:

OpenTelemetry trace error occurred. Exporter jaeger encountered the following error(s): thrift agent failed with message too long
ESC[2m2022-02-26T07:04:57.174582ZESC[0m ESC[31mERRORESC[0m ESC[2mdatabend_query::servers::mysql::writers::query_result_writerESC[0mESC[2m:ESC[0m OnQuery Error: Code: 1024, displayText = fail to read block _b/62b6f5f2b89f42a99d73cab195388fb3.p
arquet, Code: 1024, displayText = Io error: unexpected: (op: read, path: _b/62b6f5f2b89f42a99d73cab195388fb3.parquet, source: io error: error trying to connect: tcp open error: Too many open files (os error 24))..

   0: common_exception::exception_code::<impl common_exception::exception::ErrorCode>::ParquetError
             at /home/ubuntu/databend/common/exception/src/exception_code.rs:36:66
   1: databend_query::storages::fuse::operations::read::<impl databend_query::storages::fuse::table::FuseTable>::do_read::{{closure}}::{{closure}}::{{closure}}::{{closure}}
             at /home/ubuntu/databend/query/src/storages/fuse/operations/read.rs:86:25
      core::result::Result<T,E>::map_err
             at /rustc/f624427f8771c00819684c783bb841bf72095704/library/core/src/result.rs:842:27
      databend_query::storages::fuse::operations::read::<impl databend_query::storages::fuse::table::FuseTable>::do_read::{{closure}}::{{closure}}::{{closure}}
             at /home/ubuntu/databend/query/src/storages/fuse/operations/read.rs:85:21
      <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f624427f8771c00819684c783bb841bf72095704/library/core/src/future/mod.rs:84:19
   2: <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/futures_unordered/mod.rs:514:17
      futures_util::stream::stream::StreamExt::poll_next_unpin
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/mod.rs:1626:9
   3: <futures_util::stream::stream::buffer_unordered::BufferUnordered<St> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/buffer_unordered.rs:79:15
   4: <tracing_futures::Instrumented<T> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-futures-0.2.5/src/lib.rs:342:9
   5: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      <common_streams::stream_progress::ProgressStream as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/databend/common/streams/src/stream_progress.rs:51:15
   6: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      <futures_util::abortable::Abortable<St> as futures_core::stream::Stream>::poll_next::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:170:40
      futures_util::abortable::Abortable<T>::try_poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:133:33
      <futures_util::abortable::Abortable<St> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:170:9
      <common_streams::stream_abort::AbortStream as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/databend/common/streams/src/stream_abort.rs:54:15
   7: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      futures_util::stream::stream::StreamExt::poll_next_unpin
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/mod.rs:1626:
@BohuTANG BohuTANG added the C-bug Category: something isn't working label Feb 26, 2022
@dantengsky
Copy link
Member

have seen this in the lab dev box,

ulimit -S -n xxx

can mitigate this.

The read_columns_may_async that was introduced in PR #4212 will read all the columns simultaneously, thus for one block, there will be as many fds as the number of columns be opened, table ontime has 100+ columns. so there might be num_cpus * parallel_read_threads * 100+ fds opened simultaneously.

@dantengsky
Copy link
Member

@youngsofun

Is it possible that we could use a home-brewed read_columns_may_async that can limit the number of columns read simultaneously?

or shall we fallback to the stream.buffered(n).try_collect() style, but use the _read_single_column_async to read the col (in one go)

@BohuTANG
Copy link
Member Author

Thank you.
This will be a problem if a table have thousands of columns, looks we should do some limits here.

@BohuTANG
Copy link
Member Author

BohuTANG commented Feb 26, 2022

Another problem:

After the optimize table ontime all, the performance 3X slower.
After optmize:

mysql> select * from fuse_history('default', 'ontime');
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
| snapshot_id                      | prev_snapshot_id                 | segment_count | block_count | row_count | bytes_uncompressed | bytes_compressed |
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
| f200b56e7d214695bebbe258a167efb9 | 431635577461447dbf1bd6a77783c18f |             8 |         196 | 195662213 |       143555104196 |      15090106477 |
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
1 row in set (0.03 sec)
Read 1 rows, 136 B in 0.031 sec., 32.61 rows/sec., 4.43 KB/sec.

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| NW      |  5953.887737746666 |
| HP      |  8606.658580393145 |
| F9      |  6078.962351003004 |
| RU      |   6137.49191332561 |
| OH      |  9174.990431086686 |
| B6      | 11177.923627230142 |
| AA      |  9107.644700861949 |
| XE      | 11066.582047557225 |
| HA      | -538.8768939038356 |
| TZ      |  5604.692697405455 |
| OO      |  7156.105931260668 |
| CO      |  7723.676547969114 |
| US      |  7003.805255113001 |
| DH      |  9562.764009429655 |
| AQ      | 1569.9276465368148 |
| YV      | 12465.722388227468 |
| 9E      |  7700.845777541507 |
| UA      | 10474.691240783097 |
| WN      |  9800.984032216535 |
| FL      |   10091.3806677191 |
| EV      | 13248.640462884143 |
| AS      |  9446.867494009617 |
| MQ      |   9004.73935505603 |
| TW      |  7715.900299305268 |
| DL      |  7316.332086536578 |
+---------+--------------------+
25 rows in set (5.13 sec)
Read 195662213 rows, 3.13 GB in 5.116 sec., 38.25 million rows/sec., 611.93 MB/sec.

Query log:
Before optimize:

           log_type: 2
       handler_type: MySQL
          tenant_id: test_tenant
         cluster_id: test_cluster
           sql_user: root
     sql_user_quota: UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 }
sql_user_privileges: UserGrantSet { entries: [GrantEntry { user: "root", host_pattern: "127.0.0.1", object: Global, privileges: BitFlags<UserPrivilegeType>(0b1111111101111, Usage | Select | Insert | Update | Delete | Create | Drop | Alter | S
uper | CreateUser | CreateRole | Grant) }], roles: {} }
           query_id: b7966603-cba6-46f1-919c-7e1c0cd717ab
         query_kind: SelectPlan
         query_text: SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier
         event_date: 2022-02-26
         event_time: 1645863357633
   current_database: default
          databases: 
             tables: 
            columns: 
        projections: 
       written_rows: 0
      written_bytes: 0
          scan_rows: 195662213
         scan_bytes: 3130596976
  scan_byte_cost_ms: 0
         scan_seeks: 0
  scan_seek_cost_ms: 0
    scan_partitions: 196
   total_partitions: 196
        result_rows: 25
       result_bytes: 458
          cpu_usage: 8
       memory_usage: 2672
        client_info: 
     client_address: Some(127.0.0.1:45538)
     exception_code: 0
     exception_text: 
        stack_trace: 
     server_version: 
              extra: 

After optmized:

           log_type: 2
       handler_type: MySQL
          tenant_id: test_tenant
         cluster_id: test_cluster
           sql_user: root
     sql_user_quota: UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 }
sql_user_privileges: UserGrantSet { entries: [GrantEntry { user: "root", host_pattern: "127.0.0.1", object: Global, privileges: BitFlags<UserPrivilegeType>(0b1111111101111, Usage | Select | Insert | Update | Delete | Create | Drop | Alter | S
uper | CreateUser | CreateRole | Grant) }], roles: {} }
           query_id: a6b9a092-2e2f-4c0e-9e63-fa2bc7b6c3f4
         query_kind: SelectPlan
         query_text: SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier
         event_date: 2022-02-26
         event_time: 1645863471230
   current_database: default
          databases: 
             tables: 
            columns: 
        projections: 
       written_rows: 0
      written_bytes: 0
          scan_rows: 195662213
         scan_bytes: 3130596976
  scan_byte_cost_ms: 0
         scan_seeks: 0
  scan_seek_cost_ms: 0
    scan_partitions: 196
   total_partitions: 196
        result_rows: 25
       result_bytes: 458
          cpu_usage: 8
       memory_usage: 2672
        client_info: 
     client_address: Some(127.0.0.1:45538)
     exception_code: 0
     exception_text: 
        stack_trace: 
     server_version: 
              extra: 

Looks there is no difference between them from the query log.

cc @dantengsky

@dantengsky
Copy link
Member

got it. BTW, does the cache of table meta are warmed up after optimize table?

@BohuTANG
Copy link
Member Author

The meta cache is enabled:

mysql> select * from system.configs where name like '%cache%';
+------------------------------+--------+-------+-------------+
| name                         | value  | group | description |
+------------------------------+--------+-------+-------------+
| table_cache_enabled          | true   | query |             |
| table_cache_snapshot_count   | 256    | query |             |
| table_cache_segment_count    | 10240  | query |             |
| table_cache_block_meta_count | 102400 | query |             |
| table_memory_cache_mb_size   | 1024   | query |             |
| table_disk_cache_root        | _cache | query |             |
| table_disk_cache_mb_size     | 10240  | query |             |
+------------------------------+--------+-------+-------------+

@Xuanwo Xuanwo moved this to 📋 Backlog in Xuanwo's Work Feb 27, 2022
@youngsofun
Copy link
Member

I'm sorry for letting this linger. begin working on it now.

@youngsofun
Copy link
Member

youngsofun commented Mar 1, 2022

@dantengsky

the first question is: do we need this level of parallel IO except parallel_read_threads?

I suggest YES:

  • suppose in the pipeline, we need 4 IO to catch up with 1 CPU. parallel read 2 partitions and each with 2 columns in parallel is better than 4 partitions and 1 column each, the latter one starts slow (CPU may wait for more time at first) and needs more memory buffer.
  • this is not unique to parquet, but any columns store

todo:

  • if NO:

  • If YES:

    • need another config item, parallel_read_columns ?
    • still need multi reader/fd
    • I think parquet2 may improve this later.
    • I can try to write a simple workaround by control parallel in factory : one reader dropped, then waiter can get; in arraw2, each column read buffer is allocated after getting a reader
      • if it's not good to you, I can try write another version read_columns_many_async

@dantengsky
Copy link
Member

Pls correct me if I misunderstand the question

the first question is: do we need this level of parallel IO except parallel_read_threads?

does it mean, should we allow simultaneously reading of several columns? (while blocks are already being read parallelly)

imo, yes.

Before read_columns_many_async is introduced, the columns were also read parallelly, or the async tasks of reading columns are buffered to the runtime (by using stream::buffered(n), if recalled correctly).

But there is a limit (hardcoded constant) of the number of tasks that could be buffered, thus the max number of FDs that simultaneously opened is not dependent on the number of columns of the table being scanned.

i.e. instead of

max_threads * parallel_read_columns * number_of_columns where number_of_columns might be quite large, and varies from table to table.

max_threads * parallel_read_columns * N is a configured constant

need another config item, parallel_read_columns ?

used to be a hardcoded constant 10. Not sure if we should have both parallel_read_columns and parallel_read_columns, maybe should have only one config item to rule it.

I can try to write a simple workaround by control parallel in factory: one reader dropped, then waiter can get; in arraw2, each column read buffer is allocated after getting a reader

A little bit lost here, does it mean the factory will not produce another reader till the previous one is dropped? seems not that good if it does mean that.

I think parquet2 may improve this later.

yeah, maybe we should open a feature request issue :D

@youngsofun
Copy link
Member

youngsofun commented Mar 1, 2022

@dantengsky

Before read_columns_many_async is introduced, the columns were also read parallel, or the async tasks of reading columns are buffered to the runtime (by using stream::buffered(n), if recalled correctly).

maybe I missed something, as far as I can see from the code before read_columns_many_async is introduced, read is async but not parallel, (the chuck decoding maybe in parallel), because we had only one reader, which can not be read simultaneously (that is why the parquet API changed), and I didn't see find code like stream::buffered(n), maybe deleted ?

A little bit lost here, does it mean the factory will not produce another reader till the previous one is dropped? seems not that good if it does mean that.

not a good idea for me at first, but works well with the parquet2 new API.
currently can not think of a concrete problem with it.
the current pr use this approach

@dantengsky
Copy link
Member

maybe I missed something, as far as I can see from the code before read_columns_many_async is introduced, read is async but not parallel, (the chuck decoding maybe in parallel), because we had only one reader, which can not be read simultaneously (that is why the parquet API changed), and I didn't see find code like stream::buffered(n), maybe deleted ?

yes, deleted several commits before. one of the version that uses buffered(n)

https://github.com/datafuselabs/databend/blob/dc058c9d22baa9e61763661f77cd10ec62c87c48/query/src/storages/fuse/io/block_reader.rs#L117-L119

not a good idea for me at first, but works well with the parquet2 new API. currently can not think of a concrete problem with it. the current pr use this approach

got it. let's refine it later if necessary.

@Xuanwo Xuanwo moved this from 📋 Backlog to 📦 Done in Xuanwo's Work Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants