Transform During Load
Summary
Support transform when copy from stage into table.
Motivation
Currently, we support stage table function, so we can transform when insert from stage into table.
insert into table1 from (select c1, c2 from @stage1/path/to/dir);
but this still missing one feature of COPY INTO
: incremental load (remember files already loaded).
COPY INTO <table>
only support FROM <STAGE>
now.
Guide-level explanation
support copy into like this:
COPY into table1 from (select c1, c2 from @stage1/path/to/dir where c3 > 1);
limitation to the subquery part:
- contain only one base table.
- aggregation is not allowed, because each record inserted should belong to some file.
- due to the prev 2 limits, it is reasonable to state that nested subquery is not need, we may choose to forbidden it.
auto cast
the output expr of subquery should be cast to schema of table if it satisfies the auto_cast_rule.
some cast can be optimized out. e.g.:
for COPY into table1 from (select c1::string from @stage1/path/to/dir);
where c1 is int,
we can parse c1 as string directly.
if dest column for c1 is int too, '::string' can be dropped.
optimize
projection and filter:
- for file formats like parquet, column pruning and filter pushdown can work well.
- for other formats like CSV/TSV/NDJSON, there are chance to quickly parse though the unused columns, for example for TSV,
we can jump to next
\n
directly after the last field in projection is parsed.
Reference-level explanation
COPY into <table> from (select <expr> from @<stage>/<path>(<options>) <alias> where <expr>);
Rationale and alternatives
some other features can be used for "Transform During Load", but only in limited situations:
auto cast
if the source schema can auto cast to dest schema, no explicit transform in SQL should be required. now, some 'cast' is implemented by the parser, as long as use provide the dest type like in RFC: Stage With Schema e.g. Timestamp Type decoder may accept both number and string.
we should improve this later. before that, user can use the way in this RFC to complete the same function.
FileFormatOptions
for example, some Database has an option TRIM
for CSV file format. but user may need diff variant of it
- trim heading or tailing space? or both?
- inside quote or outside quote?
- apply to some column? whitespaces may be meaningful for some column.