# tidyVCF

`tidyVCF` is a small tool to convert VCF files to tidy tab/comma
separated tables, ideal for downstream analysis with R's `tidyverse`
or Julia's `DataFrames` ecosystems. `tidyVCF` is written in pure Rust,
replying on the `noodles-vcf` crate written by
[@zaeleus](https://github.com/zaeleus) and contributors.

## Install 

### Cargo

```
cargo install tidyvcf
```

### Pre-built binaries

TBD.

# Usage

## Basic usage

CSV output with `-c`, default is TSV:

```
tidyvcf -i test.vcf -c -o test.tsv
```

Using pipes to deal with compression:

```
zcat test.vcf.gz | tidyvcf | gzip > test.tsv.gz
```

## Multiple samples: stacked or cartesian

It is common to perform variant calling on several related samples
together, which yields VCFs with multiple sets of 'genotype' or
`FORMAT` fields, one for each sample. By default, `tidyvcf` joins
sample names to the names of the format fields with the underscore
('_') character - `S1_GT S1_DP S2_GT S2_DP...`.

The `-j`/`--sample-delim` options allow changing the sample-format field delimiter:

```
tidyvcf -i test.vcf -j '~' -o test.tsv
```

This behaviour violates the [tidy
data](https://r4ds.had.co.nz/tidy-data.html) principle - to avoid this
we can stack samples into rows, with the cost of repeating the static
and `INFO` columns for each sample.

Stacking samples:

```
tidyvcf -i test.vcf --stack -o test_stacked.tsv
```

## Info prefix

To avoid clashes in field names between `INFO` and `FORMAT` columns,
`INFO` field names are prefixed with the string "info_" by default -
this behaviour can be adjusted with the `-p`/`--info-prefix` option:

```
tidyvcf -i test.vcf -p 'i' -c -o test.csv
```

## VEP `CSQ` INFO field splitting

If your VCF is annotated with Ensembl's Variant Effect Predictor, you
can use the `-v` option to extract those fields into individual
columns:

```
tidyvcf -i vep.vcf.gz -v -o vep.tsv
```

*Note*: Only the first annotated transcript for a record is split, the
others are bundled unsplit into an additional column named
`CSQ_other_transcripts`.

# Comparison with other software

| Feature                                | `tidyVCF`  | `rbt vcf-to-txt`                          | `bcftools -f `     | `gatk VariantsToTable` |
|----------------------------------------|------------|-------------------------------------------|--------------------|------------------------|
| include all fields                     | by default | manually specified; currently no `FILTER` | manually specified | manually specified     |
| long format                            | `--stack`  | ❌                                        | ❌                 | ❌                     |
| pipeable                               | ✓          | ✓                                         | ✓                  | ❌                     |
| compressed input without external tool | ✓          | ❌                                        | ✓                  | ?                      |
| bcf input                              | ❌         | ❌                                        | ✓                  | ?                      |
