Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for COPY command #9931

Merged
merged 4 commits into from
Apr 7, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 33 additions & 6 deletions docs/source/user-guide/sql/dml.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,22 @@ TO '<i><b>file_name</i></b>'
[ OPTIONS( <i><b>option</i></b> [, ... ] ) ]
</pre>

`STORED AS` specifies the file format the `COPY` command will write. If this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ported / reworded this content from write options page

clause is not specified, it will be inferred from the file extension if possible.

`PARTITIONED BY` specifies the columns to use for partitioning the output files into
separate hive-style directories.

The output format is determined by the first match of the following rules:

1. Value of `STORED AS`
2. Value of the `OPTION (FORMAT ..)`
3. Filename extension (e.g. `foo.parquet` implies `PARQUET` format)

For a detailed list of valid OPTIONS, see [Write Options](write_options).

### Examples

Copy the contents of `source_table` to `file_name.json` in JSON format:

```sql
Expand Down Expand Up @@ -72,6 +86,23 @@ of hive-style partitioned parquet files:
+-------+
```

If the the data contains values of `x` and `y` in column1 and only `a` in
column2, output files will appear in the following directory structure:

```
dir_name/
column1=x/
column2=a/
<file>.parquet
<file>.parquet
...
column1=y/
column2=a/
<file>.parquet
<file>.parquet
...
```

Run the query `SELECT * from source ORDER BY time` and write the
results (maintaining the order) to a parquet file named
`output.parquet` with a maximum parquet row group size of 10MB:
Expand All @@ -85,14 +116,10 @@ results (maintaining the order) to a parquet file named
+-------+
```

The output format is determined by the first match of the following rules:

1. Value of `STORED AS`
2. Value of the `OPTION (FORMAT ..)`
3. Filename extension (e.g. `foo.parquet` implies `PARQUET` format)

## INSERT

### Examples

Insert values into a table.

<pre>
Expand Down
36 changes: 14 additions & 22 deletions docs/source/user-guide/sql/write_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,15 @@ If inserting to an external table, table specific write options can be specified

```sql
CREATE EXTERNAL TABLE
my_table(a bigint, b bigint)
STORED AS csv
COMPRESSION TYPE gzip
WITH HEADER ROW
DELIMITER ';'
LOCATION '/test/location/my_csv_table/'
OPTIONS(
NULL_VALUE 'NAN'
);
my_table(a bigint, b bigint)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the existing formatting hard to read, so I added some whitespace

STORED AS csv
COMPRESSION TYPE gzip
WITH HEADER ROW
DELIMITER ';'
LOCATION '/test/location/my_csv_table/'
OPTIONS(
NULL_VALUE 'NAN'
)
```

When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). There will be a single output file if the output path doesn't have folder format, i.e. ending with a `\`. Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.
Expand All @@ -53,26 +53,18 @@ Finally, options can be passed when running a `COPY` command.
```sql
COPY source_table
TO 'test/table_with_options'
(format parquet,
compression snappy,
'compression::col1' 'zstd(5)',
partition_by 'column3, column4'
OPTIONS (
format parquet,
compression snappy,
'compression::col1' 'zstd(5)',
partition_by 'column3, column4'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct format? Based on #9927 the partition_by has moved to the DML and it should be something like: COPY t1 TO '/tmp/hive_output/' PARTITIONED BY (col1) OPTIONS (format parquet);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is an excellent point -- I fixed it in af55db8 (I also tested that it works locally):

❯ create table source_table as values ('1','2','3','4');
0 row(s) fetched.
Elapsed 0.021 seconds.

❯ COPY source_table
  TO 'test/table_with_options'
  PARTITIONED BY (column3, column4)
  OPTIONS (
    format parquet,
    compression snappy,
    'compression::column1' 'zstd(5)',
  )
;
+-------+
| count |
+-------+
| 1     |
+-------+

)
```

In this example, we write the entirety of `source_table` out to a folder of parquet files. One parquet file will be written in parallel to the folder for each partition in the query. The next option `compression` set to `snappy` indicates that unless otherwise specified all columns should use the snappy compression codec. The option `compression::col1` sets an override, so that the column `col1` in the parquet file will use `ZSTD` compression codec with compression level `5`. In general, parquet options which support column specific settings can be specified with the syntax `OPTION::COLUMN.NESTED.PATH`.

## Available Options

### COPY Specific Options

The following special options are specific to the `COPY` command.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options are now specified directly in the DML syntax itself, so I removed them from here


| Option | Description | Default Value |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| FORMAT | Specifies the file format COPY query will write out. If there're more than one output file or the format cannot be inferred from the file extension, then FORMAT must be specified. | N/A |
| PARTITION_BY | Specifies the columns that the output files should be partitioned by into separate hive-style directories. Value should be a comma separated string literal, e.g. 'col1,col2' | N/A |

### JSON Format Specific Options

The following options are available when writing JSON files. Note: If any unsupported option is specified, an error will be raised and the query will fail.
Expand Down
Loading