Initial commit

1ef49895 · Andres Veidenberg · 1ef49895 · 1ef49895 · 1ef49895 · 1ef49895
Commit 1ef49895 authored 4 weeks ago by Andres Veidenberg
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
+*This file only includes changes that are relevant to people running a pheweb site.*
+
+## 1.3.15
+- Fixes `pheweb cluster`.
+
+## 1.3.14
+- Fixes uppercase `field_aliases` in `config.py`.  Column names are case-insensitive now.
+
+## 1.3.13
+- Speeds up autocomplete
+
+## 1.3.12
+- Adds beta/sebeta columns to the tables on /pheno/ and /variant/
+- Shows AF range or MAF range better on /variant/
+- Shows pvalue=0 as p<1e-320 in most places.
+- Improves error-handling on /pheno-filter/
+- Upgrades to LocusZoom.js 0.13, including new PNG downloads
+- Fixes bugs in OAuth and WSGI
+- Uses relative redirects, so that http vs https and hostname don't matter, except in OAuth code.
+
+## 1.3.9
+- Improves hovering on the filtered manhattan plots
+- Includes code for annotating with VEP
+- Shows category on /top_hits
+
+**Changes needed to data:**
+
+- Run `rm generated-by-pheweb/top_hits.json; pheweb top-hits`
+
+## 1.3.7
+- Uses gencode v37 (released 2021-Feb)
+- Shows GClambda and num_samples/num_cases/num_controls and num_loci<5e8 on /phenotypes
+- Supports custom_templates/ again
+
+**Changes needed to data:**
+
+- Run `rm generated-by-pheweb/sites/sites.tsv && pheweb process` (because gene names must agree beween autocompletion and the pre-processed data)
+
+## 1.3.6
+- Speeds up `pheweb gather-pvalues-for-each-gene` ~2x by avoiding reading any variant twice.  (Thanks to finngen for this suggestion.)
+- Allows live-filtering a manhattan plot by MAF or snp/indel, with instructions in README.
+- Verifies that `num_cases + num_controls == num_samples` in `pheweb phenolist verify` (which is included in `pheweb process`).
+
+## 1.3.5
+- Removes dependence on `pandas` (because it wouldn't install on my laptop)
+
+## 1.3.4
+- Allows setting `loading_nice = True`.
+- Allows setting `field_aliases` again.
+- Reduces memory usage by `pheweb qq` by ~10x by switching to `numpy` and `pandas`.
+- Fixes the bug where `pheweb matrix` breaks when `matrix.tsv.gz` is up-to-date.
+
+## 1.3.0
+- Rewrites configuration management, losing the ability to customize `extra_per_*_fields` and `null_values` and `field_aliases`.
+- Fixes bug where config wasn't passed to child processes when using `PHEWEB_DATADIR` or `pheweb conf key=value <subcommand>`.
+
+Bugs:
+
+- `pheweb matrix` breaks when `matrix.tsv.gz` is already up-to-date.
+
+## 1.2.5
+- Makes sure that `pheno_gz/<phenocode>.gz.tbi` gets created, and re-runs traits that don't have it.
+
+## 1.2.3
+- Uses dbSNP v154 (the latest!) with way more rsids.  To use them, run `rm generated-by-pheweb/sites/sites-rsids.tsv && pheweb process`.
+
+## 1.2.1
+- Allows hg38 via `hg_build_number=38`
+- Downloads resources from <https://resources.pheweb.org> instead of processing raw data from EBI, dbSNP, etc.
+- Replaces marisa-trie with sqlite3 to remove a flaky dependency and improve the order of autocomplete suggestions.
+- Replaces more json files with sqlite3 to handle large datasets better.
+- Compresses all internal files with `gzip -2` to save storage and IO.
+- Gets rid of `generated-by-pheweb/pheno/`, relying on `generated-by-pheweb/pheno_gz/` instead.
+- Allows `chr1`-`chr25` in input files.
+
+**Changes needed to data:**
+
+- Run `pheweb download-genes`
+- Run `pheweb make-gene-aliases-sqlite3`
+- Run `rm generated-by-pheweb/phenotypes.json; pheweb phenotypes`
+- Run `pheweb gather-pvalues-for-each-gene`
+
+## 1.2.0 (broken)
+Bugs:
+
+- `pheweb matrix` fails to match filenames to columns.
+
+## 1.1.28
+- Allows selecting which phenotypes to run in most steps via `pheweb <subcommand> --phenos=5-10`.
+- Adds `pheweb cluster --step=<subcommand>`.
--- a/Dockerfile
+++ b/Dockerfile
+FROM ubuntu:22.04
+
+RUN apt-get update
+RUN apt-get install -y python3
+
+RUN groupadd -g 568 apps \
+  && useradd -m -d /app -s /bin/bash -u 568 -g 568 apps \
+  && apt-get install -y python3-pip python3-dev python3-scipy python3-venv libz-dev libffi-dev
+
+USER apps
+
+ENV PATH=$PATH:/app/.local/bin
+
+RUN python3 -m pip install wheel cython
+RUN python3 -m pip install pheweb
+RUN python3 -m pip install markupsafe==2.0.1
+
+WORKDIR /data
+
+EXPOSE 5000
+
+CMD pheweb serve
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+             Copyright 2023 Regents of the University of Michigan
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\ No newline at end of file
--- a/MANIFEST.in
+++ b/MANIFEST.in
+recursive-include pheweb/serve/static *
+recursive-include pheweb/serve/templates *
+recursive-include pheweb *.py
+include pheweb/load/cffi/*.cpp
--- a/README.md
+++ b/README.md
+For a list of available instances of PheWeb, navigate [here](http://pheweb.sph.umich.edu).
+For a walk-through demo see [here](etc/demo.md#demo-navigating-pheweb).
+If you have questions or comments, check out our [Google Group](https://groups.google.com/g/pheweb-umich).
+
+![screenshot of PheWAS plot](https://cloud.githubusercontent.com/assets/862089/25474725/3edbe256-2b02-11e7-8abb-0ca26d406b11.png)
+
+# How to Cite PheWeb
+If you use the PheWeb code base for your work, please cite our paper:
+
+Gagliano Taliun, S.A., VandeHaar, P. et al. Exploring and visualizing large-scale genetic associations by using PheWeb. *Nat Genet* 52, 550–552 (2020).
+
+# How to Build a PheWeb for your Data
+
+If this is broken, [open an issue on github](https://github.com/statgen/pheweb/issues/new) and hopefully I can help.
+
+### 1. Install PheWeb
+
+```bash
+pip3 install pheweb
+```
+
+- If that doesn't work, follow [the detailed install instructions](etc/detailed-install-instructions.md#detailed-install-instructions).
+
+### 2. Create a directory and `config.py` for your new dataset
+
+```
+mkdir ~/my-new-pheweb && cd ~/my-new-pheweb
+```
+
+This directory will store all the files pheweb makes for your dataset. All `pheweb ...` commands should be run in this directory.
+
+Make `config.py` in this directory. In it, either set `hg_build_number = 19` or `hg_build_number = 38`.  Other options you can set are listed [here](etc/detailed-loading-instructions.md#configuration-options).
+
+### 3. Check that your GWAS summary statistics files will work
+
+You need one file for each phenotype.  Most common GWAS file formats should work.  Here are the requirements:
+
+- It needs a header row.
+- Columns can be delimited by tabs, spaces, or commas.
+- It needs a column for the reference allele (which must always match the bases on the reference genome that you specified with `hg_build_number`) and a column for the alternate allele.  If you have a `MARKER_ID` column like `1:234_C/G`, that's okay too.  If you have an allele1 and allele2, and sometimes one or the other is the reference, then you'll need to modify your files.
+- It can be gzipped if you want.
+- Variants must be sorted by chromosome and position, with chromosomes in the order [1-22,X,Y,MT].
+
+The file must have columns for:
+
+| column description | name    | other allowed column names | allowed values |
+| ---                | ---     | ---                        | --- |
+| chromosome         | `chrom` | `#chrom`, `chr`            | 1-22, `X`, `Y`, `M`, `MT`, `chr1`, etc |
+| position           | `pos`   | `beg`, `begin`, `bp`       | integer |
+| reference allele   | `ref`   | `reference`                | must match reference genome |
+| alternate allele   | `alt`   | `alternate`                | anything |
+| p-value            | `pval`  | `pvalue`, `p`, `p.value`   | number in [0,1] |
+
+
+You may also have columns for:
+
+| column description                     | name           | other allowed column names | allowed values |
+| ---                                    | ---            | ---                        | --- |
+| minor allele frequency                 | `maf`          |                            | number in (0,0.5] |
+| allele frequency (of alternate allele) | `af`           | `a1freq`, `frq`            | number in (0,1) |
+| AF among cases                         | `case_af`      | `af.cases`                 | number in (0,1) |
+| AF among controls                      | `control_af`   | `af.controls`              | number in (0,1) |
+| allele count                           | `ac`           |                            | integer |
+| effect size (of alternate allele)      | `beta`         |                            | number |
+| standard error of effect size          | `sebeta`       | `se`                       | number |
+| odds ratio (of alternate allele)       | `or`           |                            | number |
+| R2                                     | `r2`           |                            | number |
+| number of samples                      | `num_samples`  | `ns`, `n`                  | integer, must be the same for every variant in its phenotype |
+| number of controls                     | `num_controls` | `ns.ctrl`, `n_controls`    | integer, must be the same for every variant in its phenotype |
+| number of cases                        | `num_cases`    | `ns.case`, `n_cases`       | integer, must be the same for every variant in its phenotype |
+
+
+Column names are case-insensitive.  If your file has a different column name, set `field_aliases = {"column_name": "field_name"}` in `config.py`.  For example, `field_aliases = {'P_BOLT_LMM_INF': 'pval', 'NSAMPLES': 'num_samples'}`.
+
+Any field can be null if it is one of ['', '.', 'NA', 'N/A', 'n/a', 'nan', '-nan', 'NaN', '-NaN', 'null', 'NULL'].  If a required field is null, the variant gets dropped.
+
+If your pval is log10 (like in REGENIE output), then set these variables in config.py: `pval_is_neglog10 = True` and `field_aliases = {'LOGP':'pval'}`.
+
+### 4. Make a list of your phenotypes
+
+Inside of your data directory, you need a file named `pheno-list.json` that looks like this:
+
+```json
+[
+ {
+  "assoc_files": ["/home/peter/data/ear-length.gz"],
+  "phenocode": "ear-length"
+ },
+ {
+  "assoc_files": ["/home/peter/data/a1c.X.gz","/home/peter/data/a1c.autosomal.gz"],
+  "phenocode": "A1C"
+ }
+]
+```
+
+Each phenotype needs `assoc_files` (a list of paths to association files) and `phenocode` (a string representing your phenotype that is used in filenames and URLs, comprised of `[A-Za-z0-9_~-]`).
+
+If you want, you can also include:
+
+- `phenostring` (string): a name for the phenotype. Shown in tables and tooltips and page headers.
+- `category` (string): groups together phenotypes in the PheWAS plot. Shown in tables and tooltips.
+- `num_cases`, `num_controls`, and/or `num_samples` (number): if your input data only has `AC` or `MAC`, this will be used to calculated `AF` or `MAF`.  Shown in tooltips.  If your input data has correctly-named columns for these, the command `pheweb phenolist read-info-from-association-files` will add them into your existing `pheno-list.json`.
+- anything else you want, but you'll have to modify templates to use it.
+
+You can use a csv by running:
+
+```
+pheweb phenolist import-phenolist "/path/to/pheno-list.csv"
+```
+
+or you can make one from scratch by running:
+
+```
+pheweb phenolist glob --star-is-phenocode "/home/peter/data/*.gz"
+```
+
+You can see other methods [here](etc/detailed-loading-instructions.md#making-pheno-listjson).
+
+
+### 5. Load your association files
+
+Run `pheweb process`.
+
+To distribute jobs across a cluster, follow [these instructions](etc/detailed-loading-instructions.md#distributing-jobs-across-a-cluster).
+
+To include VEP annotations, follow [these instructions](etc/detailed-loading-instructions.md#annotating-with-vep).
+
+If something breaks and you can't understand the error message or it's something that PheWeb should support by default, [open an issue on github](https://github.com/statgen/pheweb/issues/new) or email me.
+
+### 6. Serve the website
+
+Run `pheweb serve --open`.
+
+That command should either open a browser to your new PheWeb, or it should give you a URL that you can open in your browser to access your new PheWeb.
+If it doesn't, follow [the directions for hosting a PheWeb and accessing it from your browser](etc/detailed-webserver-instructions.md#hosting-a-pheweb-and-accessing-it-from-your-browser).
+
+### More options:
+
+To run pheweb through systemd, see sample file [here](etc/pheweb.service).
+To use Apache2 or Nginx, see instructions [here](etc/detailed-webserver-instructions.md#using-apache2-or-nginx).
+To require login via OAuth, see instructions [here](etc/detailed-webserver-instructions.md#using-oauth).
+To track page views with Google Analytics, see instructions [here](etc/detailed-webserver-instructions.md#using-google-analytics).
+To reduce storage use, see instructions [here](etc/detailed-webserver-instructions.md#reducing-storage-use).
+To customize page contents, see instructions [here](etc/detailed-webserver-instructions.md#customizing-page-contents).
+
+PheWeb can display genetic correlations generated by [another tool](https://github.com/statgen/pheweb-rg-pipeline).
+To use this feature, set `show_correlations = True`  in `config.py` and place the output of the rg pipeline as `pheno-correlations.txt` in the same folder as `pheno-list.json`.
+
+To hide the button for downloading summary stats, add `download_pheno_sumstats = "secret"` and `SECRET_KEY = "your random string"` in `config.py`.  That will make a secret page (printed to the console when you start the server) to share summary stats.
+To hide the button for downloading top hits and phenotypes, add `download_top_hits = "hide"` and `download_phenotypes = "hide"` respectively.
+
+To allow dynamically filtering the manhattan plot, run `pheweb best-of-pheno` and set `show_manhattan_filter_button=True` in `config.py`.
+
+# Modifying PheWeb
+
+See instructions [here](etc/detailed-development-instructions.md).
+See documentation about the files in `generated-by-pheweb/` [here](etc/detailed-internal-dataflow.md).
--- a/etc/annotate_vep/make_vcf.py
+++ b/etc/annotate_vep/make_vcf.py
+#!/usr/bin/env python3
+
+from pathlib import Path
+import gzip, sys
+
+in_filepath = Path(sys.argv[1])
+out_filepath = Path(sys.argv[2])
+
+with gzip.open(in_filepath, 'rt') as in_f, gzip.open(out_filepath,'wt') as out_f:
+    def write(line:str): out_f.write(line); out_f.write('\n')
+
+    write('##fileformat=VCFv4.1')
+    write('##reference=http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa')
+    write('\t'.join('#CHROM POS ID REF ALT INFO'.split()))
+
+    header = next(in_f).rstrip('\n')
+    assert header.split('\t') == ['chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes']
+
+    for idx,line in enumerate(in_f):
+        chrom,pos,ref,alt,rsids,nearest_genes = line.rstrip('\n').split('\t')
+        variant_id = f'{chrom}:{pos}:{ref}:{alt}'
+        write('\t'.join([chrom, pos, variant_id, ref, alt, f'nearest_genes={nearest_genes}']))
--- a/etc/annotate_vep/merge.py
+++ b/etc/annotate_vep/merge.py
+#!/usr/bin/env python3
+
+from pathlib import Path
+import gzip, itertools, csv, sys
+
+import pheweb
+from pheweb.file_utils import VariantFileReader, read_maybe_gzip
+
+
+sites_filepath = Path(sys.argv[1])
+vep_filepath = Path(sys.argv[2])
+out_filepath = Path(sys.argv[3])
+
+def sites_reader():
+    with VariantFileReader(sites_filepath) as vfr:
+        variants = iter(vfr)
+        first_variant = next(variants)
+        assert sorted(first_variant.keys()) == sorted(['chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes']), first_variant
+        yield from itertools.chain([first_variant], variants)
+
+def vep_reader():
+    with read_maybe_gzip(vep_filepath) as sites_f:
+        reader = csv.DictReader((line.lstrip('#') for line in sites_f if not line.startswith('##')), delimiter='\t')
+        first_row = next(reader)
+        required_cols = {'Uploaded_variation', 'Consequence'}
+        missing_cols = required_cols - first_row.keys()
+        if missing_cols:
+            raise Exception(f'missing_cols={missing_cols} first_row={first_row}')
+        for row in itertools.chain([first_row], reader):
+            chrom, pos, ref, alt = row['Uploaded_variation'].split(':')
+            pos = int(pos)
+            yield {'chrom':chrom, 'pos':pos, 'ref':ref, 'alt':alt, 'consequence':row['Consequence']}
+
+
+with gzip.open(out_filepath,'wt') as out_f:
+    writer = csv.DictWriter(out_f, 'chrom pos ref alt rsids nearest_genes consequence'.split(), delimiter="\t")
+    writer.writeheader()
+
+    for site_v, vep_v in itertools.zip_longest(sites_reader(), vep_reader(), fillvalue={}):
+        # sites_filepath and vep_filepath must have a perfect one-to-one match!
+        assert all(site_v[k] == vep_v[k] for k in 'chrom pos ref alt'.split()), (site_v, vep_v)
+        writer.writerow({**site_v, **vep_v})
--- a/etc/annotate_vep/run.sh
+++ b/etc/annotate_vep/run.sh
+#!/bin/bash
+set -euo pipefail
+readlinkf() { perl -MCwd -le 'print Cwd::abs_path shift' "$1"; }
+SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+
+set -x
+
+## This script should get run from the directory that contains `generated-by-pheweb`.
+## It needs `generated-by-pheweb/sites/sites.tsv`, so it should get run after `pheweb add-genes` and its preceeding steps.
+## You can see the list of steps with `pheweb process -h`.
+## Then you should be able to continue with the rest of the steps.  I think `pheweb process` should pick up at the right spot.
+## To use these VEP consequences to filter the filterable manhattan plot, set `show_manhattan_filter_consequence = True` in `config.py`.
+
+## Uncomment your build:
+#build="GRCh38"
+build="GRCh37"
+
+## Setting parallel="yes" splits the input into chunks of 3 million variants and annotates them in parallel.
+## None of this is super robust, and parallel is even less.
+parallel="no"
+
+# This script needs a version of python that has pheweb installed.
+python_exe="/data/pheweb/pheweb-installs/pheweb1.3/venv/bin/python3"
+#python_exe="python3"
+
+
+mkdir -p vep_data/input
+chmod a+rwx vep_data
+if ! [[ -e input.vcf.gz ]]; then
+   "$python_exe" "$SCRIPTDIR/make_vcf.py" generated-by-pheweb/sites/sites.tsv input.vcf.gz
+fi
+
+if ! [[ $parallel = "yes" ]]; then
+   cp input.vcf.gz vep_data/input/
+else
+    zcat input.vcf|grep -v '^##'| split --lines=$((3*1000*1000)) - split_
+    for file in split_*; do
+        zcat input.vcf|head -n3 > "vep_data/input/$file"
+        cat "$file" >> "vep_data/input/$file"
+        rm "$file"
+    done
+fi
+
+sudo docker pull ensemblorg/ensembl-vep
+sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y "$build" -g all  # Do we really need `-g all`?
+
+if ! [[ $parallel = "yes" ]]; then
+    sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep ./vep --input_file=/opt/vep/.vep/input/input.vcf.gz --output_file=/opt/vep/.vep/output.tsv --force_overwrite --compress_output=gzip --cache --offline --assembly="$build" --regulatory --most_severe --check_existing
+    mv vep_data/output.tsv out-raw-vep.tsv
+
+else
+    for f in vep_data/input/split_*; do
+        name=$(basename "$f")
+        sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep ./vep --input_file=/opt/vep/.vep/input/$name --output_file=/opt/vep/.vep/output-$name.tsv --force_overwrite --compress_output=gzip --cache --offline --assembly="$build" --regulatory --most_severe --check_existing &
+    done
+    wait  # Wait for child processes to exit (hopefully sucessfully)
+    zcat vep_data/output-split_aa.tsv | grep '^#' | gzip > out-raw-vep.tsv
+    for f in $(echo vep_data/output-split_a*tsv|tr " " "\n"|sort); do
+        zcat $f | grep -v '^#' | gzip >> out-raw-vep.tsv
+    done
+fi
+
+"$python_exe" "$SCRIPTDIR/merge.py" generated-by-pheweb/sites/sites.tsv out-raw-vep.tsv sites-vep.tsv
+
+
+echo "Now check that sites-vep.tsv looks good."
+echo 'It should have the same variants as `generated-by-pheweb/sites/sites.tsv`.'
+echo "It should have the same columns, plus 'consequence'."
+echo 'Then run `mv sites-vep.tsv generated-by-pheweb/sites/sites.tsv`.'
--- a/etc/demo.md
+++ b/etc/demo.md
+## Demo Navigating PheWeb
+
+On the homepage use the **search bar** to look up particular (1) genes (e.g. _APOB_, _FTO_, _TCF7L2_), (2) variants (by either rsID or chromosome:position on the appropriate genome build), or phenotypes/traits. 
+Note: View a list of traits on the PheWeb on the About page. 
+In any view, clicking on the PheWeb icon on the top left corner will allow you to return to the homepage. 
+
+If you are feeling adventurous, hit the **Random** icon in the top panel to view a randomly selected view from the PheWeb. 
+Selecting **Top Hits** in this panel will present a list of the most significant associations in this PheWeb in table format. 
+To learn more about the data behind the PheWeb select **About**.
+
+PheWeb shows 3 types of views: `Manhattan` + `quantile-quantile (QQ)` plots, `LocusZoom` plots, and `PheWAS` plots.
+
+Below I am looking up _TCF7L2_ in the search bar:
+
+![](/etc/images/screen-homepage-search.png?raw=true)
+
+Searching by gene will show you the most significant associations in that gene (table format) and a `LocusZoom` regional view showing the linkage disequilibrium among the variants in the region around the gene (below). 
+Selecting a different row in the table will change the `LocusZoom` plot accordingly.
+
+In my _TCF7L2_ search, this page appears, in which the `LocusZoom` plot below is displaying the row in the table that is selected (“Type 1 diabetes”):
+
+![](/etc/images/screen-lz.png?raw=true)
+
+All plots are interactive. You can hover your mouse above variants to learn more information about them, for example in the `LocusZoom` plot:
+
+![](/etc/images/screen-lz-tooltip.png?raw=true)
+
+Clicking on a variant in the `LocusZoom plot` will display a `PheWAS` view showing the association p-value for the variant across all the phenotypes in the PheWeb. 
+In the `PheWAS` view an upwards facing triangle implies a positive effect of that variant on the phenotype, whereas a downwards facing triangle implies a negative effect. 
+Circles are used for variants in which the estimate of the beta is not precise (e.g. standard error encompassing zero). The variants are colored according to a user-specified biological grouping.
+
+I decided to select a _TCF7L2_ variant from the previous screenshot, and here is the `PheWAS` view followed by a table summary:
+
+![](/etc/images/screen-phewas.png?raw=true)
+
+Selecting a trait in the `PheWAS` plot will navigate you to the Manhattan plot view. Below the `Manhattan` is a table showing the most significant associations, and below that is the `quantile-quantile (QQ)` plot stratified by minor allele frequency bin and the genomic control lambda calculated from various percentiles of variants. 
+
+Below I selected “Stricture of Artery” from the `PheWAS` view, and am hovering my mouse over a variant in the `Manhattan` plot. 
+If I select this variant I will be brought to its `LocusZoom` regional plot.
+
+![](/etc/images/screen-manhattan.png?raw=true)
+
+Scrolling down on the same page I see the `QQ` plot below the table of top associations: 
+
+![](/etc/images/screen-qq.png?raw=true)
+
--- a/etc/detailed-apache2-instructions/README.md
+++ b/etc/detailed-apache2-instructions/README.md
+### Running PheWeb with Apache2
+
+1. Install apache2.
+
+2. Run `tmux` or `screen` to get a shell session that won't exit when you close your terminal.
+
+3. Run `pheweb serve --host 127.0.0.1 --port 9974 --num-workers 4 --no-reloader`.
+
+    - This command is equivalent to `gunicorn -b 127.0.0.1:9974 --access-logfile=- -w4 pheweb.serve.server:app`
+    - Use whatever port you want and whatever number of workers you want.
+
+3. Run `sudo a2enmod proxy proxy_http`.
+
+4. Copy `pheweb.conf` from this directory into `/etc/apache2/sites-available/`.
+
+    - If you need name-based virtual hosts, add uncomment `ServerName foo.example.com` and use your domain instead.
+
+5. Run `sudo a2ensite pheweb`, which should make a symlink in `/etc/apache2/sites-enabled/`
+
+6. Run `sudo service apache2 restart`.
+
+7. Any time the computer crashes, apache2 should start on its own but you'll need to start tmux and pheweb/gunicorn.
--- a/etc/detailed-apache2-instructions/pheweb.conf
+++ b/etc/detailed-apache2-instructions/pheweb.conf
+
+# This will hopefully prevent people from being able to browse the python source code if something goes wrong.
+Options -Indexes
+
+<VirtualHost *:80>
+        # requires `a2enmod proxy proxy_http`
+
+        ## Use this if you want to use name-based virtualhosts for multiple (sub)domains on one IP
+        # ServerName foo.example.com
+
+        ProxyPreserveHost On
+        ProxyPass / http://127.0.0.1:9974/
+        ProxyPassReverse / http://127.0.0.1:9974/
+
+        LogLevel warn
+        ErrorLog ${APACHE_LOG_DIR}/pheweb_error.log
+        CustomLog ${APACHE_LOG_DIR}/pheweb_access.log combined
+</VirtualHost>
--- a/etc/detailed-development-instructions.md
+++ b/etc/detailed-development-instructions.md
+## Detailed development instructions
+
+This document contains information useful for those looking to modify and develop the PheWeb source code. 
+It requires some familiarity with Python and terminal.
+
+### Installing PheWeb
+In order to reflect code changes as you work, PheWeb should be installed in "editable" mode.
+
+1. Clone the repository to a new folder.
+2. Create and active a new virtual environment. For example, in the checked-out PheWeb directory: `python3 -m venv .venv && source .venv/bin/activate` (if you prefer to manage your virtualenv some other way, that is ok)
+3. With the virtualenv activated, install the package in "editable" mode: `pip3 install -e .`
+4. When complete, verify that PheWeb is installed and working correctly: `pheweb -h`
+
+### Running static analysis
+
+You can do simple static analysis by running `./etc/pre-commit`.  It requires `pip3 install flake8 mypy`. If it is broken, it might not be a problem, but it can be a good way to catch bugs.
+
+### Running the unit tests
+The tests take a minute or two. PheWeb loads a sample dataset, runs a local server, and then queries some pages on that server.  It doesn't test everything in PheWeb, but it gets most of it.
+
+`pytest`
+
+
+### Running a local server with sample data
+Run `./tests/run-all.sh`, and then open <http://localhost:5000/> to view your site.  
+
+This uses the same data as the unit tests to serve a website you can browse.
+
+The homepage links to some good pages.  Most of the other pages aren't very useful because the data is so sparse.
+
+If you are only modifying the server code, you can quickly re-run just `pheweb serve` without re-running all the loading steps.  Use the line like `+ pheweb conf ... serve` that is printed to your console.
+
--- a/etc/detailed-install-instructions.md
+++ b/etc/detailed-install-instructions.md
+## Detailed install instructions
+
+First, try:
+
+```bash
+python3 -m pip install -U cython wheel pip setuptools
+python3 -m pip install pheweb
+pheweb
+```
+
+*(Note: In most cases this is equivalent to `pip3 install pheweb`, but if you have a bad version of `pip3` on your `$PATH`, using `python3 -m pip` will avoid it.)*
+
+- If you get the error `Segmentation fault (core dumped)`, try running `python3 -m pip install --no-binary=cffi,cryptography,pyopenssl pheweb` instead. ([more info](https://github.com/pypa/pip/issues/5366))
+
+- If you get an error related to pysam, run `python3 -m pip install -U cython; python3 -m pip install https://github.com/pysam-developers/pysam/archive/master.zip` and try again.
+
+- If installation was successful but running `pheweb` results in "command not found", you need to add `pheweb` to your PATH.  You should be able to just add the line `PATH="$HOME/.local/bin:$PATH"` to the end of `~/.bashrc`, start a new terminal, and run `pheweb` again.  If you're on macOS, you might need to add the line `source "$HOME/.bashrc"` to `~/.bash_profile`.
+
+- If that command fails in a different way, then use one of the approaches below.
+
+
+### Installing on Linux with `sudo`:
+
+*(Note: If you're not sure whether you have permissions for `sudo`, just try it.  If you don't have root access, it will say something like `you are not in the sudoers file.`*)
+
+Install prerequisites:
+
+- If you are running Ubuntu (or another `apt-get`-based distribution), run:
+
+   ```bash
+   sudo apt-get update
+   sudo apt-get install python3-pip python3-dev libz-dev libffi-dev
+   ```
+
+- If you are running Fedora, RedHat, or CentOS (or another `yum`-based distribution), run:
+
+   ```bash
+   sudo yum install python3-devel gcc-c++ zlib-devel
+   ```
+
+Then run:
+
+```bash
+sudo python3 -m pip install wheel cython
+sudo python3 -m pip install pheweb
+sudo pheweb
+```
+
+If this doesn't work, try the miniconda3 approach instead.
+
+
+### Installing on Linux or Mac with Miniconda3:
+
+If you are on macOS, install XCode Developer Tools with `xcode-select --install`.
+
+To install miniconda3, follow the instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/).
+
+When you're installing miniconda3, you can close the terms & conditions with "q".
+You should install into the default directory of `~/miniconda3`.
+You should let miniconda modify `$PATH` in your `~/.bash_profile` or `~/.bashrc`, so that you'll be able to run just `pheweb` instead of needing to type `~/miniconda3/bin/pheweb` on the command line.
+
+Next, close and re-open your terminal, to make the new `$PATH` take effect.
+You can check that you have the miniconda3 python set up by running `which python3`, which should reply something like `/home/peter/miniconda3/bin/python3`.
+Then run:
+
+```bash
+python3 -m pip install pheweb
+```
+
+If none of these work, open a Github issue.
--- a/etc/detailed-internal-dataflow.md
+++ b/etc/detailed-internal-dataflow.md
+# Internal Data-Handling
+```
+                 input-association-files
+                      │         │
+                      │     [phenolist]
+                      │         │
+                      │         v
+                      │  pheno-list.json
+                      │   │           │
+                     [parse]          │
+                      │   │           │
+                      v   v           │
+                     parsed/*         │
+                      │   └──────┐    │
+                   [sites]       │    │
+   rsids.tsv.gz--[add-rsids]     │    │
+      genes.bed--[add-genes]     │    │
+                      │          │    │
+                      v          │    │
+                  sites.tsv      │    │
+                  │   │   └──[augment-phenos]
+          [make-...]  │             │
+                  │   │             v
+                  v   │          pheno_gz/*
+ cpras-rsids-sqlite3  └─[matrix]─┘  │  │  └─[best-of-pheno]─> best_of_pheno/*
+                           │        │  └─[qq]-> qq/*  
+                           v        └─[manhattan]-> manhattan/*
+                     matrix.tsv.gz                   │      │
+                           │                  [top-hits]  [phenotypes]
+           [gather-pvalues-for-each-gene]            │      │
+                           │                         v      v
+                           v               top_hits.json  phenotypes.json
+              best-phenos-by-gene.sqlite3
+```
+
+Square brackets show `pheweb <step>` subcommands.
+Filenames are in `generated-by-pheweb/` or its subdirectories (except `pheno-list.json` which is its sibling).
+
+Reference this diagram against the filepaths listed in `file_utils.py` and the steps in `pheweb process -h`.
+You can see all of the per-variant fields, per-association fields, and per-phenotype fields in `parse_utils.py`.
+
+- `parsed/*` files have the per-variant and per-association fields from the input files.
+- `sites.tsv` has every variant in the dataset, with the per-variant fields from the `parsed/*` plus `rsids` and `nearest_genes` and (optionally) `consequence`.
+- `pheno_gz/*` files are like `parsed/*` plus `rsids` and `nearest_genes` and (optionally) `consequence`.
+    - Every line in these files must begin with a line from `sites.tsv` in order for `pheweb matrix` to work.  ie, they've got to have the same per-variant fields.
+- `matrix.tsv.gz` contains all the per-variant fields (ie, an exact copy of `sites.tsv` in its left few columns), and all per-assoc fields (with header format `<fieldname>@<phenocode>`, eg `maf@a1c`).
--- a/etc/detailed-loading-instructions.md
+++ b/etc/detailed-loading-instructions.md
+## Configuration options
+
+- `assoc_min_maf` (float): an association (between a phenotype and variant) will only be included if its MAF is greater than or equal to this value. (default: `0`)
+
+- `cache` (string): a directory where files shared by all datasets can be cached. If you're loading multiple phewebs, setting `cache = "~/.pheweb/cache/"` will avoid downloading files multiples times. (default: None)
+
+- `num_procs` (int): the number of processes to use for parallel loading steps.  (default: 2/3 of the number of cores on your machine)
+
+- `loading_nice = True`: sets nice=19 (reducing cpu priority) and sets ionice to class "Idle" (reducing IO when anything else is using disk)
+
+- `debugging_limit_num_variants` (int): only parses this many variants from each input association file and from the rsids file.  This is convenient for quickly loading part of a dataset to check that it works as expected.
+
+- `download_pheno_sumstats`: explained in [README](../README.md)
+
+- `show_correlations`: explained in [README](../README.md)
+
+
+## Making pheno-list.json
+
+
+There are four ways to make a `pheno-list.json`:
+
+1. If you have a csv (or tsv, optionally gzipped) with a header that has exactly the right column names, just import it by running `pheweb phenolist import-phenolist "/path/to/my/pheno-list.csv"`.
+
+   If you have multiple association files for each phenotype, you may put them all into a single column with `|` between them. For example, your file `pheno-list.csv` might look like this:
+
+   ```
+   phenocode,assoc_files
+   a1c,/home/peter/data/a1c.autosomal.gz|/home/peter/data/a1c.X.gz
+   ear-length,/home/peter/data/ear-length.gz
+   ```
+
+2. If you have one association file per phenotype, you can use a shell-glob to get assoc-files. Suppose that your assocation files are at paths like:
+
+   - `/home/peter/data/a1c.autosomal.gz`
+   - `/home/peter/data/ear-length.gz`
+
+   Then you could run `pheweb phenolist glob "/home/peter/data/*.gz"` to get `assoc-files`.
+
+   To get `phenocodes`, you can use this command which will take the text after the last `/` and before the next `.`:
+
+   ```
+   pheweb phenolist extract-phenocode-from-filepath --simple
+   ```
+   
+   If that doesn't work, see `pheweb phenolist extract-phenocode-from-filepath -h` for how to use a regex capture group.
+
+3. If you have multiple association files for some phenotypes, you can follow the directions in 2 and then run `pheweb phenolist unique-phenocode`.
+
+   For example, if your association files are at:
+
+   - `/home/peter/data/ear-length.gz`
+   - `/home/peter/data/a1c.autosomal.gz`
+   - `/home/peter/data/a1c.X.gz`
+
+   then you can run:
+
+   ```
+   pheweb phenolist glob "/home/peter/data/*.gz"
+   pheweb phenolist extract-phenocode-from-filepath --simple
+   pheweb phenolist unique-phenocode
+   ```
+
+4. If you want to do more advanced things, like merging in more information from another file, check out the tools in `pheweb phenolist --help`.
+
+
+
+
+
+## Distributing jobs across a cluster
+
+`pheweb process` runs a bunch of steps, which you can see by running `pheweb process -h`.
+Some of those steps can instead be run distributed across a cluster.
+You can see which steps by running `pheweb cluster -h`.
+
+The schedulers SLURM and SGE are natively supported.
+Use `--engine=slurm` or `--engine=sge` when you run `pheweb cluster`.
+For other schedulers, you'll have to modify the output of `pheweb cluster`.
+
+For example, on SLURM you could run:
+
+```
+pheweb phenolist verify
+pheweb cluster --engine=slurm --step=parse
+pheweb sites && pheweb make-gene-aliases-sqlite3 && pheweb add-rsids && pheweb add-genes && pheweb make-cpras-rsids-sqlite3
+pheweb cluster --engine=slurm --step=augment-phenos
+pheweb cluster --engine=slurm --step=manhattan
+pheweb cluster --engine=slurm --step=qq
+pheweb process  # This won't re-create any files that are already up-to-date.
+```
+
+
+## Annotating with VEP
+
+Run the code in `etc/annotate_vep/run.sh`.  It requires docker (and thus sudo) and only works on hg38.
+Read the comments at the top of that script.
+
+
+<br><br><br><br><br><br><br><br><br><br><br><br>
--- a/etc/detailed-webserver-instructions.md
+++ b/etc/detailed-webserver-instructions.md
+## Hosting a pheweb and accessing it from your browser
+
+Run `pheweb serve --open`.  That command should either open a web browser showing your PheWeb, or it should give you a URL that you can open in your web browser.  If that doesn't work, try these:
+
+- If pheweb's output says that port 5000 is already taken, run `pheweb serve --open --port=5001` instead.  Or try some other port.
+
+- If `pheweb serve` is running fine, but you can't open it in a web browser, you have two options:
+
+  1. Option 1: Serve PheWeb on port 80.
+
+     You need a port that can get through your firewall.  80 or 443 probably work.
+
+     To use port 80 or 443 you'll need root permissions.  Run  `sudo $(which python3) $(which pheweb) serve --open --port=80`.
+     Then open the URLs that they suggest.
+
+  3. Option 2: Run PheWeb with the default settings, then connect an SSH tunnel between your computer and your server.
+
+     Here's how to do that if your laptop runs Mac or Linux:
+
+     Suppose you normally ssh in with `ssh me@example.com`.  Instead, run `ssh -N -L localhost:5000:localhost:5000 me@example.com`.
+     Then open <http://localhost:5000> in your web browser.
+
+     Sometimes MacOS itself uses port 5000, so I usually use port 8000.
+
+
+
+## Using Apache2 or Nginx
+
+At this point your PheWeb should be working how you want it to, except maybe the URL you're using.
+
+`pheweb serve` already uses gunicorn. For maximum speed and safety, you should run gunicorn routed through a reverse proxy like Apache2 or Nginx. If you choose Apache2, I have some documentation [here](detailed-apache2-instructions/README.md).
+
+
+
+## Using OAuth
+
+1. Make your own random `SECRET_KEY` for flask.
+
+   ```bash
+   $ python3 -c 'import os; print(os.urandom(24))'
+   b'(\x1e\xe5IY\xe4\xdc\x00s\xc6z\xf8\x9b\xf3\x99Miw\x9dct\xdf}\xeb'
+   ```
+
+   In `config.py` in your pheweb directory, set
+
+   ```python
+   SECRET_KEY = '(\x1e\xe5IY\xe4\xdc\x00s\xc6z\xf8\x9b\xf3\x99Miw\x9dct\xdf}\xeb'
+   ```
+
+2. Set up OAuth with Google.
+
+   Go [here](https://console.developers.google.com/apis/credentials) to create a project.
+   In the list "Authorized redirect URIs" add your OAuth callback URL, which should look like `http://example.com/callback/google` or `http://example.com:5000/callback/google`.
+
+   In `config.py`, set:
+
+   ```python
+   login = {
+     'GOOGLE_LOGIN_CLIENT_ID': 'something-something.apps.googleusercontent.com',
+     'GOOGLE_LOGIN_CLIENT_SECRET': 'letters-letters',
+     'whitelist': [
+       'user1@example.com',
+       'user2@example.com',
+       'user3@gmail.com',
+       '@umich.edu',  # Allows any email @umich.edu
+     ]
+   }
+   ```
+
+   The correct values of `GOOGLE_LOGIN_CLIENT_ID` and `GOOGLE_LOGIN_CLIENT_SECRET` are at the top of the Google project page.  The whitelist can contain any email addresses connected to Google accounts.
+
+
+
+## Using Google Analytics
+
+Go [here](https://analytics.google.com/analytics/web) and do whatever you have to to get your own tracking id (i.e. AW-XXXXX or G-XXXXX).
+
+Then, in `config.py`, set:
+
+```
+GOOGLE_ANALYTICS_TRACKING_ID = 'G-XXXXX'
+```
+
+and kill and restart `pheweb serve`.
+
+If you visit your site, you should see the activity at [the Google Analytics web console](https://analytics.google.com/analytics/web).
+
+
+## Reducing storage use
+To make PheWeb use less space, you can delete some of the files created during the loading process.
+
+Files in `generated-by-pheweb/parsed/` are only needed for re-buiding the site with more GWAS.  You can replace those files with symlinks to the files in `pheno_gz/`.
+
+Files in `generated-by-pheweb/tmp/` can also be removed.
+
+This should work:
+
+```bash
+cd generated-by-pheweb/parsed/
+for f in *; do
+  ln -sf ../pheno_gz/$f.gz $f
+done
+
+cd ..
+rm tmp/*
+```
+
+## Customizing page contents
+To modify the contents of the About page and others, create a directory named `custom_templates` next to `generated-by-pheweb`.
+
+Here are some templates that are intended to be modified:
+
+- `custom_templates/about/content.html`: contents of the about page
+- `custom_templates/index/h1.html`: large title above the search bar on the homepage
+-  `custom_templates/index/below-h1.html`: subtext above the search bar on the homepage
+- `custom_templates/index/below-query.html`: beneath the search bar on the homepage
+- `custom_templates/pheno/h1.html`: the large text at the top of the phenotype (Manhattan Plot) page
+- `custom_templates/region/h1.html`: the large text at the top of the region (LocusZoom Region Plot) page
+- `custom_templates/title.html`: the title of the window, usually shown in the tab bar
+
+You can also override any template found in [pheweb/serve/templates](https://github.com/statgen/pheweb/tree/master/pheweb/serve/templates).  It'll work best if you copy the original version and modify it.  If you update Pheweb after overriding entire pages like this, those pages might be broken.  The templating language is Jinja2 and you can see what variables are available by looking at `route`s with `render_template` in [pheweb/serve/server.py](https://github.com/statgen/pheweb/tree/master/pheweb/serve/server.py).
--- a/etc/images/screen-homepage-search.png
+++ b/etc/images/screen-homepage-search.png
--- a/etc/images/screen-homepage.png
+++ b/etc/images/screen-homepage.png
--- a/etc/images/screen-lz-tooltip.png
+++ b/etc/images/screen-lz-tooltip.png
--- a/etc/images/screen-lz.png
+++ b/etc/images/screen-lz.png