Read comma-separated or delimiter-separated text files
## Overview The CSV data source reads comma-separated, tab-separated, and other delimiter-separated text files. It is the most flexible text-format data source, supporting custom delimiters, quoting rules, character encodings, null value handling, comment lines, row skipping, and whitespace trimming. CSV files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING CSV`. DeltaForge reads multiple CSV files in parallel, distributing work across the configured number of reader threads. Each thread processes one file at a time, parsing rows into batches. For very large single files, splitting source data across multiple files is recommended for maximum parallelism. The `file_metadata` option injects system columns such as `df_file_name`, `df_row_number`, `df_file_modified`, and `df_dataset` to track the origin of each row. ## Usage Create an external table pointing to one or more CSV files. The LOCATION can be a single file path or a glob pattern. All option values are strings, even for numeric or boolean settings. ```sql -- Create a zone and schema for CSV data CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.csv_demos; -- Register CSV files as an external table CREATE EXTERNAL TABLE IF NOT EXISTS analytics.csv_demos.sales USING CSV LOCATION '/data/sales*.csv' OPTIONS ( header = 'true', file_metadata = '{"columns":["df_file_name","df_file_modified","df_dataset","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery, no separate DETECT SCHEMA needed -- Grant access GRANT ADMIN ON TABLE analytics.csv_demos.sales TO USER analyst; -- Query the table SELECT * FROM analytics.csv_demos.sales; ``` Pipe-delimited files with custom null handling and comment lines: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS analytics.csv_demos.legacy_export USING CSV LOCATION '/data/legacy/' OPTIONS ( delimiter = '|', null_value = 'N/A', comment_char = '#', skip_starting_rows = '3', trim_whitespace = 'true', has_header = 'true', max_rows = '5' ); ``` ## Schema Detection CSV is not a self-describing format. DeltaForge infers column types by sampling the first N rows (controlled by `infer_schema_rows`, default 1000). Supported inferred types include integer, float, boolean, date, timestamp, and string. For files with inconsistent types across rows, increase the sample size for more accurate results. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying files change shape (new columns, type drift, evolved headers). ## Schema Evolution When reading multiple CSV files with different column sets, DeltaForge unifies all column schemas into a single combined schema. Columns that exist in some files but not others appear as NULL for rows from files that lack those columns. This allows incremental addition of new columns over time without breaking existing queries. ## Key Options - **delimiter**: Set the field separator character (default ','). Use '|' for pipe-delimited, or any single character. - **has_header**: Controls whether the first row is treated as column headers (default 'true'). - **null_value**: Define which string values are interpreted as SQL NULL (e.g., 'N/A'). - **comment_char**: Skip lines starting with the specified character (e.g., '#'). - **skip_starting_rows**: Skip a fixed number of rows at the start of each file before reading headers or data. - **trim_whitespace**: Strip leading and trailing whitespace from field values. - **file_metadata**: Inject system columns (df_file_name, df_row_number, df_file_modified, df_dataset) for provenance tracking. - **max_rows**: Limit the total number of rows read, useful for sampling large datasets.