# tidyVCF

`tidyVCF` is a small tool to convert VCF files to tidy tab/comma
separated tables, ideal for downstream analysis with R's `tidyverse`
or Julia's `DataFrames` ecosystems. All fields are included by
default, keeping the command line simple. `tidyVCF` is written in pure
Rust, replying on the excellent `noodles-vcf` crate developed by
[@zaeleus](https://github.com/zaeleus) and contributors.

**Note**: *The tool works for me, but isn't ready for production use
yet - it's built on a fairly experimental API, it lacks proper
testing, and it's quite brittle in terms of generally (not) handling
various species of wild VCF, and gracelessly erroring at the most
minor of spec violations*.

## Install 

### Cargo

```
cargo install tidyvcf
```

### Pre-built binaries

TBD.

# Usage

## Basic usage

CSV output with `-c`/`--csv`, default is TSV:

```
tidyvcf -i test.vcf -c -o test.csv
```

BGZF compressed VCFs are detected by file extension and handled
automatically:

```
tidyvcf -i test.vcf.gz -o test.tsv
```

If dealing with compressed data from `stdin`, use the `--bgzip` flag:

```
cat test.vcf.gz | tidyvcf --bgzip -o test.tsv
```

## Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples
together, which yields VCFs with multiple sets of 'genotype' or
`FORMAT` fields, one for each sample. By default, `tidyvcf` joins
sample names to the names of the format fields with the underscore
('_') character - `S1_GT S1_DP S2_GT S2_DP...`.

The `--format-delim` option allow changing the sample-format field delimiter:

```
tidyvcf -i test.vcf --format-delim '~' -o test.tsv
```

This behaviour violates the [tidy
data](https://r4ds.had.co.nz/tidy-data.html) principle - to avoid this
we can stack samples into rows, with the cost of repeating the static
and `INFO` columns for each sample.

Stacking samples:

```
tidyvcf -i test.vcf --stack -o test_stacked.tsv
```

## Info prefix

To avoid clashes in field names between `INFO` and `FORMAT` columns,
`INFO` field names are prefixed with the string "info_" by default -
this behaviour can be adjusted with the `--info-prefix` option:

```
tidyvcf -i test.vcf --info-prefix 'i' -c -o test.csv
```

## VEP `CSQ` INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you
can use the `-v`/`--vep-fields` flag to extract those fields into individual
columns:

```
tidyvcf -i vep.vcf.gz --vep-fields -o vep.tsv
```

By default, the output VEP column names are prefixed with "vep_" to
avoid name collisions (for example `CSQ/VAF` and `FMT/VAF`) - this
string can be customised with the `--vep-prefix` option:

```
tidyvcf -i vep.vcf.gz --vep-fields --vep-prefix '.' -o vep.tsv
```

*Note*: Only the first annotated transcript for a record is split, the
others are bundled unsplit into an additional column named
`CSQ_other_transcripts`.

# Comparison with other software

| Feature                                | `tidyVCF`  | `rbt vcf-to-txt`                              | `bcftools -f `         | `gatk VariantsToTable` |
|----------------------------------------|------------|-----------------------------------------------|------------------------|------------------------|
| include all fields                     | by default | individually specified; currently no `FILTER` | individually specified | individually specified |
| include a subset of fields             | ❌         | individually specified; currently no `FILTER` | individually specified | individually specified |
| long format                            | `--stack`  | ❌                                            | ❌                     | ❌                     |
| pipeable                               | ✓          | ✓                                             | ✓                      | ❌                     |
| compressed input without external tool | ✓          | ❌                                            | ✓                      | ?                      |
| bcf input                              | ❌         | ❌                                            | ✓                      | ?                      |
