`optimize table ontime all` returns `Too many open files (os error 24)` #4253

BohuTANG · 2022-02-26T07:07:50Z

Summary

Using ontime dataset.

Server log:

OpenTelemetry trace error occurred. Exporter jaeger encountered the following error(s): thrift agent failed with message too long
ESC[2m2022-02-26T07:04:57.174582ZESC[0m ESC[31mERRORESC[0m ESC[2mdatabend_query::servers::mysql::writers::query_result_writerESC[0mESC[2m:ESC[0m OnQuery Error: Code: 1024, displayText = fail to read block _b/62b6f5f2b89f42a99d73cab195388fb3.p
arquet, Code: 1024, displayText = Io error: unexpected: (op: read, path: _b/62b6f5f2b89f42a99d73cab195388fb3.parquet, source: io error: error trying to connect: tcp open error: Too many open files (os error 24))..

   0: common_exception::exception_code::<impl common_exception::exception::ErrorCode>::ParquetError
             at /home/ubuntu/databend/common/exception/src/exception_code.rs:36:66
   1: databend_query::storages::fuse::operations::read::<impl databend_query::storages::fuse::table::FuseTable>::do_read::{{closure}}::{{closure}}::{{closure}}::{{closure}}
             at /home/ubuntu/databend/query/src/storages/fuse/operations/read.rs:86:25
      core::result::Result<T,E>::map_err
             at /rustc/f624427f8771c00819684c783bb841bf72095704/library/core/src/result.rs:842:27
      databend_query::storages::fuse::operations::read::<impl databend_query::storages::fuse::table::FuseTable>::do_read::{{closure}}::{{closure}}::{{closure}}
             at /home/ubuntu/databend/query/src/storages/fuse/operations/read.rs:85:21
      <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f624427f8771c00819684c783bb841bf72095704/library/core/src/future/mod.rs:84:19
   2: <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/futures_unordered/mod.rs:514:17
      futures_util::stream::stream::StreamExt::poll_next_unpin
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/mod.rs:1626:9
   3: <futures_util::stream::stream::buffer_unordered::BufferUnordered<St> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/buffer_unordered.rs:79:15
   4: <tracing_futures::Instrumented<T> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-futures-0.2.5/src/lib.rs:342:9
   5: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      <common_streams::stream_progress::ProgressStream as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/databend/common/streams/src/stream_progress.rs:51:15
   6: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      <futures_util::abortable::Abortable<St> as futures_core::stream::Stream>::poll_next::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:170:40
      futures_util::abortable::Abortable<T>::try_poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:133:33
      <futures_util::abortable::Abortable<St> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/abortable.rs:170:9
      <common_streams::stream_abort::AbortStream as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/databend/common/streams/src/stream_abort.rs:54:15
   7: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.21/src/stream.rs:120:9
      futures_util::stream::stream::StreamExt::poll_next_unpin
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.21/src/stream/stream/mod.rs:1626:

The text was updated successfully, but these errors were encountered:

dantengsky · 2022-02-26T07:31:58Z

have seen this in the lab dev box,

ulimit -S -n xxx

can mitigate this.

The read_columns_may_async that was introduced in PR #4212 will read all the columns simultaneously, thus for one block, there will be as many fds as the number of columns be opened, table ontime has 100+ columns. so there might be num_cpus * parallel_read_threads * 100+ fds opened simultaneously.

dantengsky · 2022-02-26T07:40:38Z

@youngsofun

Is it possible that we could use a home-brewed read_columns_may_async that can limit the number of columns read simultaneously?

or shall we fallback to the stream.buffered(n).try_collect() style, but use the _read_single_column_async to read the col (in one go)

BohuTANG · 2022-02-26T07:41:18Z

Thank you.
This will be a problem if a table have thousands of columns, looks we should do some limits here.

BohuTANG · 2022-02-26T08:21:00Z

Another problem:

After the optimize table ontime all, the performance 3X slower.
After optmize:

mysql> select * from fuse_history('default', 'ontime');
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
| snapshot_id                      | prev_snapshot_id                 | segment_count | block_count | row_count | bytes_uncompressed | bytes_compressed |
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
| f200b56e7d214695bebbe258a167efb9 | 431635577461447dbf1bd6a77783c18f |             8 |         196 | 195662213 |       143555104196 |      15090106477 |
+----------------------------------+----------------------------------+---------------+-------------+-----------+--------------------+------------------+
1 row in set (0.03 sec)
Read 1 rows, 136 B in 0.031 sec., 32.61 rows/sec., 4.43 KB/sec.

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| NW      |  5953.887737746666 |
| HP      |  8606.658580393145 |
| F9      |  6078.962351003004 |
| RU      |   6137.49191332561 |
| OH      |  9174.990431086686 |
| B6      | 11177.923627230142 |
| AA      |  9107.644700861949 |
| XE      | 11066.582047557225 |
| HA      | -538.8768939038356 |
| TZ      |  5604.692697405455 |
| OO      |  7156.105931260668 |
| CO      |  7723.676547969114 |
| US      |  7003.805255113001 |
| DH      |  9562.764009429655 |
| AQ      | 1569.9276465368148 |
| YV      | 12465.722388227468 |
| 9E      |  7700.845777541507 |
| UA      | 10474.691240783097 |
| WN      |  9800.984032216535 |
| FL      |   10091.3806677191 |
| EV      | 13248.640462884143 |
| AS      |  9446.867494009617 |
| MQ      |   9004.73935505603 |
| TW      |  7715.900299305268 |
| DL      |  7316.332086536578 |
+---------+--------------------+
25 rows in set (5.13 sec)
Read 195662213 rows, 3.13 GB in 5.116 sec., 38.25 million rows/sec., 611.93 MB/sec.

Query log:
Before optimize:

           log_type: 2
       handler_type: MySQL
          tenant_id: test_tenant
         cluster_id: test_cluster
           sql_user: root
     sql_user_quota: UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 }
sql_user_privileges: UserGrantSet { entries: [GrantEntry { user: "root", host_pattern: "127.0.0.1", object: Global, privileges: BitFlags<UserPrivilegeType>(0b1111111101111, Usage | Select | Insert | Update | Delete | Create | Drop | Alter | S
uper | CreateUser | CreateRole | Grant) }], roles: {} }
           query_id: b7966603-cba6-46f1-919c-7e1c0cd717ab
         query_kind: SelectPlan
         query_text: SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier
         event_date: 2022-02-26
         event_time: 1645863357633
   current_database: default
          databases: 
             tables: 
            columns: 
        projections: 
       written_rows: 0
      written_bytes: 0
          scan_rows: 195662213
         scan_bytes: 3130596976
  scan_byte_cost_ms: 0
         scan_seeks: 0
  scan_seek_cost_ms: 0
    scan_partitions: 196
   total_partitions: 196
        result_rows: 25
       result_bytes: 458
          cpu_usage: 8
       memory_usage: 2672
        client_info: 
     client_address: Some(127.0.0.1:45538)
     exception_code: 0
     exception_text: 
        stack_trace: 
     server_version: 
              extra:

After optmized:

           log_type: 2
       handler_type: MySQL
          tenant_id: test_tenant
         cluster_id: test_cluster
           sql_user: root
     sql_user_quota: UserQuota { max_cpu: 0, max_memory_in_bytes: 0, max_storage_in_bytes: 0 }
sql_user_privileges: UserGrantSet { entries: [GrantEntry { user: "root", host_pattern: "127.0.0.1", object: Global, privileges: BitFlags<UserPrivilegeType>(0b1111111101111, Usage | Select | Insert | Update | Delete | Create | Drop | Alter | S
uper | CreateUser | CreateRole | Grant) }], roles: {} }
           query_id: a6b9a092-2e2f-4c0e-9e63-fa2bc7b6c3f4
         query_kind: SelectPlan
         query_text: SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier
         event_date: 2022-02-26
         event_time: 1645863471230
   current_database: default
          databases: 
             tables: 
            columns: 
        projections: 
       written_rows: 0
      written_bytes: 0
          scan_rows: 195662213
         scan_bytes: 3130596976
  scan_byte_cost_ms: 0
         scan_seeks: 0
  scan_seek_cost_ms: 0
    scan_partitions: 196
   total_partitions: 196
        result_rows: 25
       result_bytes: 458
          cpu_usage: 8
       memory_usage: 2672
        client_info: 
     client_address: Some(127.0.0.1:45538)
     exception_code: 0
     exception_text: 
        stack_trace: 
     server_version: 
              extra:

Looks there is no difference between them from the query log.

cc @dantengsky

dantengsky · 2022-02-26T08:24:01Z

got it. BTW, does the cache of table meta are warmed up after optimize table?

BohuTANG · 2022-02-26T09:55:22Z

The meta cache is enabled:

mysql> select * from system.configs where name like '%cache%';
+------------------------------+--------+-------+-------------+
| name                         | value  | group | description |
+------------------------------+--------+-------+-------------+
| table_cache_enabled          | true   | query |             |
| table_cache_snapshot_count   | 256    | query |             |
| table_cache_segment_count    | 10240  | query |             |
| table_cache_block_meta_count | 102400 | query |             |
| table_memory_cache_mb_size   | 1024   | query |             |
| table_disk_cache_root        | _cache | query |             |
| table_disk_cache_mb_size     | 10240  | query |             |
+------------------------------+--------+-------+-------------+

youngsofun · 2022-03-01T01:48:38Z

I'm sorry for letting this linger. begin working on it now.

youngsofun · 2022-03-01T04:26:37Z

@dantengsky

the first question is: do we need this level of parallel IO except parallel_read_threads?

I suggest YES:

suppose in the pipeline, we need 4 IO to catch up with 1 CPU. parallel read 2 partitions and each with 2 columns in parallel is better than 4 partitions and 1 column each, the latter one starts slow (CPU may wait for more time at first) and needs more memory buffer.
this is not unique to parquet, but any columns store

todo:

if NO:
- back to one reader, and use our own read_columns_many_async instead, its easy
  - If we need a quick fix, I can do this first. @BohuTANG
If YES:
- need another config item, parallel_read_columns ?
- still need multi reader/fd
- I think parquet2 may improve this later.
- I can try to write a simple workaround by control parallel in factory : one reader dropped, then waiter can get; in arraw2, each column read buffer is allocated after getting a reader
  - if it's not good to you, I can try write another version read_columns_many_async

dantengsky · 2022-03-01T09:58:06Z

Pls correct me if I misunderstand the question

the first question is: do we need this level of parallel IO except parallel_read_threads?

does it mean, should we allow simultaneously reading of several columns? (while blocks are already being read parallelly)

imo, yes.

Before read_columns_many_async is introduced, the columns were also read parallelly, or the async tasks of reading columns are buffered to the runtime (by using stream::buffered(n), if recalled correctly).

But there is a limit (hardcoded constant) of the number of tasks that could be buffered, thus the max number of FDs that simultaneously opened is not dependent on the number of columns of the table being scanned.

i.e. instead of

max_threads * parallel_read_columns * number_of_columns where number_of_columns might be quite large, and varies from table to table.

max_threads * parallel_read_columns * N is a configured constant

need another config item, parallel_read_columns ?

used to be a hardcoded constant 10. Not sure if we should have both parallel_read_columns and parallel_read_columns, maybe should have only one config item to rule it.

I can try to write a simple workaround by control parallel in factory: one reader dropped, then waiter can get; in arraw2, each column read buffer is allocated after getting a reader

A little bit lost here, does it mean the factory will not produce another reader till the previous one is dropped? seems not that good if it does mean that.

I think parquet2 may improve this later.

yeah, maybe we should open a feature request issue :D

youngsofun · 2022-03-01T10:26:15Z

@dantengsky

Before read_columns_many_async is introduced, the columns were also read parallel, or the async tasks of reading columns are buffered to the runtime (by using stream::buffered(n), if recalled correctly).

maybe I missed something, as far as I can see from the code before read_columns_many_async is introduced, read is async but not parallel, (the chuck decoding maybe in parallel), because we had only one reader, which can not be read simultaneously (that is why the parquet API changed), and I didn't see find code like stream::buffered(n), maybe deleted ?

A little bit lost here, does it mean the factory will not produce another reader till the previous one is dropped? seems not that good if it does mean that.

not a good idea for me at first, but works well with the parquet2 new API.
currently can not think of a concrete problem with it.
the current pr use this approach

dantengsky · 2022-03-01T10:35:04Z

maybe I missed something, as far as I can see from the code before read_columns_many_async is introduced, read is async but not parallel, (the chuck decoding maybe in parallel), because we had only one reader, which can not be read simultaneously (that is why the parquet API changed), and I didn't see find code like stream::buffered(n), maybe deleted ?

yes, deleted several commits before. one of the version that uses buffered(n)

https://github.com/datafuselabs/databend/blob/dc058c9d22baa9e61763661f77cd10ec62c87c48/query/src/storages/fuse/io/block_reader.rs#L117-L119

not a good idea for me at first, but works well with the parquet2 new API. currently can not think of a concrete problem with it. the current pr use this approach

got it. let's refine it later if necessary.

BohuTANG added the C-bug Category: something isn't working label Feb 26, 2022

BohuTANG assigned dantengsky Feb 26, 2022

BohuTANG assigned youngsofun Feb 26, 2022

Xuanwo moved this to 📋 Backlog in Xuanwo's Work Feb 27, 2022

Xuanwo added this to Xuanwo's Work Feb 27, 2022

youngsofun mentioned this issue Mar 1, 2022

limit parallelization when reading parquet columns. #4283

Merged

BohuTANG closed this as completed in #4283 Mar 1, 2022

Xuanwo moved this from 📋 Backlog to 📦 Done in Xuanwo's Work Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`optimize table ontime all` returns `Too many open files (os error 24)` #4253

`optimize table ontime all` returns `Too many open files (os error 24)` #4253

BohuTANG commented Feb 26, 2022

dantengsky commented Feb 26, 2022

dantengsky commented Feb 26, 2022

BohuTANG commented Feb 26, 2022

BohuTANG commented Feb 26, 2022 •

edited

Loading

dantengsky commented Feb 26, 2022

BohuTANG commented Feb 26, 2022

youngsofun commented Mar 1, 2022

youngsofun commented Mar 1, 2022 •

edited

Loading

dantengsky commented Mar 1, 2022

youngsofun commented Mar 1, 2022 •

edited

Loading

dantengsky commented Mar 1, 2022

optimize table ontime all returns Too many open files (os error 24) #4253

optimize table ontime all returns Too many open files (os error 24) #4253

Comments

BohuTANG commented Feb 26, 2022

dantengsky commented Feb 26, 2022

dantengsky commented Feb 26, 2022

BohuTANG commented Feb 26, 2022

BohuTANG commented Feb 26, 2022 • edited Loading

dantengsky commented Feb 26, 2022

BohuTANG commented Feb 26, 2022

youngsofun commented Mar 1, 2022

youngsofun commented Mar 1, 2022 • edited Loading

dantengsky commented Mar 1, 2022

youngsofun commented Mar 1, 2022 • edited Loading

dantengsky commented Mar 1, 2022

`optimize table ontime all` returns `Too many open files (os error 24)` #4253

`optimize table ontime all` returns `Too many open files (os error 24)` #4253

BohuTANG commented Feb 26, 2022 •

edited

Loading

youngsofun commented Mar 1, 2022 •

edited

Loading

youngsofun commented Mar 1, 2022 •

edited

Loading