Skip to content
Snippets Groups Projects
Commit 1ef49895 authored by Andres Veidenberg's avatar Andres Veidenberg
Browse files

Initial commit

parents
No related branches found
No related tags found
No related merge requests found
Showing
with 869 additions and 0 deletions
*This file only includes changes that are relevant to people running a pheweb site.*
## 1.3.15
- Fixes `pheweb cluster`.
## 1.3.14
- Fixes uppercase `field_aliases` in `config.py`. Column names are case-insensitive now.
## 1.3.13
- Speeds up autocomplete
## 1.3.12
- Adds beta/sebeta columns to the tables on /pheno/ and /variant/
- Shows AF range or MAF range better on /variant/
- Shows pvalue=0 as p<1e-320 in most places.
- Improves error-handling on /pheno-filter/
- Upgrades to LocusZoom.js 0.13, including new PNG downloads
- Fixes bugs in OAuth and WSGI
- Uses relative redirects, so that http vs https and hostname don't matter, except in OAuth code.
## 1.3.9
- Improves hovering on the filtered manhattan plots
- Includes code for annotating with VEP
- Shows category on /top_hits
**Changes needed to data:**
- Run `rm generated-by-pheweb/top_hits.json; pheweb top-hits`
## 1.3.7
- Uses gencode v37 (released 2021-Feb)
- Shows GClambda and num_samples/num_cases/num_controls and num_loci<5e8 on /phenotypes
- Supports custom_templates/ again
**Changes needed to data:**
- Run `rm generated-by-pheweb/sites/sites.tsv && pheweb process` (because gene names must agree beween autocompletion and the pre-processed data)
## 1.3.6
- Speeds up `pheweb gather-pvalues-for-each-gene` ~2x by avoiding reading any variant twice. (Thanks to finngen for this suggestion.)
- Allows live-filtering a manhattan plot by MAF or snp/indel, with instructions in README.
- Verifies that `num_cases + num_controls == num_samples` in `pheweb phenolist verify` (which is included in `pheweb process`).
## 1.3.5
- Removes dependence on `pandas` (because it wouldn't install on my laptop)
## 1.3.4
- Allows setting `loading_nice = True`.
- Allows setting `field_aliases` again.
- Reduces memory usage by `pheweb qq` by ~10x by switching to `numpy` and `pandas`.
- Fixes the bug where `pheweb matrix` breaks when `matrix.tsv.gz` is up-to-date.
## 1.3.0
- Rewrites configuration management, losing the ability to customize `extra_per_*_fields` and `null_values` and `field_aliases`.
- Fixes bug where config wasn't passed to child processes when using `PHEWEB_DATADIR` or `pheweb conf key=value <subcommand>`.
Bugs:
- `pheweb matrix` breaks when `matrix.tsv.gz` is already up-to-date.
## 1.2.5
- Makes sure that `pheno_gz/<phenocode>.gz.tbi` gets created, and re-runs traits that don't have it.
## 1.2.3
- Uses dbSNP v154 (the latest!) with way more rsids. To use them, run `rm generated-by-pheweb/sites/sites-rsids.tsv && pheweb process`.
## 1.2.1
- Allows hg38 via `hg_build_number=38`
- Downloads resources from <https://resources.pheweb.org> instead of processing raw data from EBI, dbSNP, etc.
- Replaces marisa-trie with sqlite3 to remove a flaky dependency and improve the order of autocomplete suggestions.
- Replaces more json files with sqlite3 to handle large datasets better.
- Compresses all internal files with `gzip -2` to save storage and IO.
- Gets rid of `generated-by-pheweb/pheno/`, relying on `generated-by-pheweb/pheno_gz/` instead.
- Allows `chr1`-`chr25` in input files.
**Changes needed to data:**
- Run `pheweb download-genes`
- Run `pheweb make-gene-aliases-sqlite3`
- Run `rm generated-by-pheweb/phenotypes.json; pheweb phenotypes`
- Run `pheweb gather-pvalues-for-each-gene`
## 1.2.0 (broken)
Bugs:
- `pheweb matrix` fails to match filenames to columns.
## 1.1.28
- Allows selecting which phenotypes to run in most steps via `pheweb <subcommand> --phenos=5-10`.
- Adds `pheweb cluster --step=<subcommand>`.
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y python3
RUN groupadd -g 568 apps \
&& useradd -m -d /app -s /bin/bash -u 568 -g 568 apps \
&& apt-get install -y python3-pip python3-dev python3-scipy python3-venv libz-dev libffi-dev
USER apps
ENV PATH=$PATH:/app/.local/bin
RUN python3 -m pip install wheel cython
RUN python3 -m pip install pheweb
RUN python3 -m pip install markupsafe==2.0.1
WORKDIR /data
EXPOSE 5000
CMD pheweb serve
\ No newline at end of file
Copyright 2023 Regents of the University of Michigan
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\ No newline at end of file
recursive-include pheweb/serve/static *
recursive-include pheweb/serve/templates *
recursive-include pheweb *.py
include pheweb/load/cffi/*.cpp
README.md 0 → 100644
For a list of available instances of PheWeb, navigate [here](http://pheweb.sph.umich.edu).
For a walk-through demo see [here](etc/demo.md#demo-navigating-pheweb).
If you have questions or comments, check out our [Google Group](https://groups.google.com/g/pheweb-umich).
![screenshot of PheWAS plot](https://cloud.githubusercontent.com/assets/862089/25474725/3edbe256-2b02-11e7-8abb-0ca26d406b11.png)
# How to Cite PheWeb
If you use the PheWeb code base for your work, please cite our paper:
Gagliano Taliun, S.A., VandeHaar, P. et al. Exploring and visualizing large-scale genetic associations by using PheWeb. *Nat Genet* 52, 550–552 (2020).
# How to Build a PheWeb for your Data
If this is broken, [open an issue on github](https://github.com/statgen/pheweb/issues/new) and hopefully I can help.
### 1. Install PheWeb
```bash
pip3 install pheweb
```
- If that doesn't work, follow [the detailed install instructions](etc/detailed-install-instructions.md#detailed-install-instructions).
### 2. Create a directory and `config.py` for your new dataset
```
mkdir ~/my-new-pheweb && cd ~/my-new-pheweb
```
This directory will store all the files pheweb makes for your dataset. All `pheweb ...` commands should be run in this directory.
Make `config.py` in this directory. In it, either set `hg_build_number = 19` or `hg_build_number = 38`. Other options you can set are listed [here](etc/detailed-loading-instructions.md#configuration-options).
### 3. Check that your GWAS summary statistics files will work
You need one file for each phenotype. Most common GWAS file formats should work. Here are the requirements:
- It needs a header row.
- Columns can be delimited by tabs, spaces, or commas.
- It needs a column for the reference allele (which must always match the bases on the reference genome that you specified with `hg_build_number`) and a column for the alternate allele. If you have a `MARKER_ID` column like `1:234_C/G`, that's okay too. If you have an allele1 and allele2, and sometimes one or the other is the reference, then you'll need to modify your files.
- It can be gzipped if you want.
- Variants must be sorted by chromosome and position, with chromosomes in the order [1-22,X,Y,MT].
The file must have columns for:
| column description | name | other allowed column names | allowed values |
| --- | --- | --- | --- |
| chromosome | `chrom` | `#chrom`, `chr` | 1-22, `X`, `Y`, `M`, `MT`, `chr1`, etc |
| position | `pos` | `beg`, `begin`, `bp` | integer |
| reference allele | `ref` | `reference` | must match reference genome |
| alternate allele | `alt` | `alternate` | anything |
| p-value | `pval` | `pvalue`, `p`, `p.value` | number in [0,1] |
You may also have columns for:
| column description | name | other allowed column names | allowed values |
| --- | --- | --- | --- |
| minor allele frequency | `maf` | | number in (0,0.5] |
| allele frequency (of alternate allele) | `af` | `a1freq`, `frq` | number in (0,1) |
| AF among cases | `case_af` | `af.cases` | number in (0,1) |
| AF among controls | `control_af` | `af.controls` | number in (0,1) |
| allele count | `ac` | | integer |
| effect size (of alternate allele) | `beta` | | number |
| standard error of effect size | `sebeta` | `se` | number |
| odds ratio (of alternate allele) | `or` | | number |
| R2 | `r2` | | number |
| number of samples | `num_samples` | `ns`, `n` | integer, must be the same for every variant in its phenotype |
| number of controls | `num_controls` | `ns.ctrl`, `n_controls` | integer, must be the same for every variant in its phenotype |
| number of cases | `num_cases` | `ns.case`, `n_cases` | integer, must be the same for every variant in its phenotype |
Column names are case-insensitive. If your file has a different column name, set `field_aliases = {"column_name": "field_name"}` in `config.py`. For example, `field_aliases = {'P_BOLT_LMM_INF': 'pval', 'NSAMPLES': 'num_samples'}`.
Any field can be null if it is one of ['', '.', 'NA', 'N/A', 'n/a', 'nan', '-nan', 'NaN', '-NaN', 'null', 'NULL']. If a required field is null, the variant gets dropped.
If your pval is log10 (like in REGENIE output), then set these variables in config.py: `pval_is_neglog10 = True` and `field_aliases = {'LOGP':'pval'}`.
### 4. Make a list of your phenotypes
Inside of your data directory, you need a file named `pheno-list.json` that looks like this:
```json
[
{
"assoc_files": ["/home/peter/data/ear-length.gz"],
"phenocode": "ear-length"
},
{
"assoc_files": ["/home/peter/data/a1c.X.gz","/home/peter/data/a1c.autosomal.gz"],
"phenocode": "A1C"
}
]
```
Each phenotype needs `assoc_files` (a list of paths to association files) and `phenocode` (a string representing your phenotype that is used in filenames and URLs, comprised of `[A-Za-z0-9_~-]`).
If you want, you can also include:
- `phenostring` (string): a name for the phenotype. Shown in tables and tooltips and page headers.
- `category` (string): groups together phenotypes in the PheWAS plot. Shown in tables and tooltips.
- `num_cases`, `num_controls`, and/or `num_samples` (number): if your input data only has `AC` or `MAC`, this will be used to calculated `AF` or `MAF`. Shown in tooltips. If your input data has correctly-named columns for these, the command `pheweb phenolist read-info-from-association-files` will add them into your existing `pheno-list.json`.
- anything else you want, but you'll have to modify templates to use it.
You can use a csv by running:
```
pheweb phenolist import-phenolist "/path/to/pheno-list.csv"
```
or you can make one from scratch by running:
```
pheweb phenolist glob --star-is-phenocode "/home/peter/data/*.gz"
```
You can see other methods [here](etc/detailed-loading-instructions.md#making-pheno-listjson).
### 5. Load your association files
Run `pheweb process`.
To distribute jobs across a cluster, follow [these instructions](etc/detailed-loading-instructions.md#distributing-jobs-across-a-cluster).
To include VEP annotations, follow [these instructions](etc/detailed-loading-instructions.md#annotating-with-vep).
If something breaks and you can't understand the error message or it's something that PheWeb should support by default, [open an issue on github](https://github.com/statgen/pheweb/issues/new) or email me.
### 6. Serve the website
Run `pheweb serve --open`.
That command should either open a browser to your new PheWeb, or it should give you a URL that you can open in your browser to access your new PheWeb.
If it doesn't, follow [the directions for hosting a PheWeb and accessing it from your browser](etc/detailed-webserver-instructions.md#hosting-a-pheweb-and-accessing-it-from-your-browser).
### More options:
To run pheweb through systemd, see sample file [here](etc/pheweb.service).
To use Apache2 or Nginx, see instructions [here](etc/detailed-webserver-instructions.md#using-apache2-or-nginx).
To require login via OAuth, see instructions [here](etc/detailed-webserver-instructions.md#using-oauth).
To track page views with Google Analytics, see instructions [here](etc/detailed-webserver-instructions.md#using-google-analytics).
To reduce storage use, see instructions [here](etc/detailed-webserver-instructions.md#reducing-storage-use).
To customize page contents, see instructions [here](etc/detailed-webserver-instructions.md#customizing-page-contents).
PheWeb can display genetic correlations generated by [another tool](https://github.com/statgen/pheweb-rg-pipeline).
To use this feature, set `show_correlations = True` in `config.py` and place the output of the rg pipeline as `pheno-correlations.txt` in the same folder as `pheno-list.json`.
To hide the button for downloading summary stats, add `download_pheno_sumstats = "secret"` and `SECRET_KEY = "your random string"` in `config.py`. That will make a secret page (printed to the console when you start the server) to share summary stats.
To hide the button for downloading top hits and phenotypes, add `download_top_hits = "hide"` and `download_phenotypes = "hide"` respectively.
To allow dynamically filtering the manhattan plot, run `pheweb best-of-pheno` and set `show_manhattan_filter_button=True` in `config.py`.
# Modifying PheWeb
See instructions [here](etc/detailed-development-instructions.md).
See documentation about the files in `generated-by-pheweb/` [here](etc/detailed-internal-dataflow.md).
#!/usr/bin/env python3
from pathlib import Path
import gzip, sys
in_filepath = Path(sys.argv[1])
out_filepath = Path(sys.argv[2])
with gzip.open(in_filepath, 'rt') as in_f, gzip.open(out_filepath,'wt') as out_f:
def write(line:str): out_f.write(line); out_f.write('\n')
write('##fileformat=VCFv4.1')
write('##reference=http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa')
write('\t'.join('#CHROM POS ID REF ALT INFO'.split()))
header = next(in_f).rstrip('\n')
assert header.split('\t') == ['chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes']
for idx,line in enumerate(in_f):
chrom,pos,ref,alt,rsids,nearest_genes = line.rstrip('\n').split('\t')
variant_id = f'{chrom}:{pos}:{ref}:{alt}'
write('\t'.join([chrom, pos, variant_id, ref, alt, f'nearest_genes={nearest_genes}']))
#!/usr/bin/env python3
from pathlib import Path
import gzip, itertools, csv, sys
import pheweb
from pheweb.file_utils import VariantFileReader, read_maybe_gzip
sites_filepath = Path(sys.argv[1])
vep_filepath = Path(sys.argv[2])
out_filepath = Path(sys.argv[3])
def sites_reader():
with VariantFileReader(sites_filepath) as vfr:
variants = iter(vfr)
first_variant = next(variants)
assert sorted(first_variant.keys()) == sorted(['chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes']), first_variant
yield from itertools.chain([first_variant], variants)
def vep_reader():
with read_maybe_gzip(vep_filepath) as sites_f:
reader = csv.DictReader((line.lstrip('#') for line in sites_f if not line.startswith('##')), delimiter='\t')
first_row = next(reader)
required_cols = {'Uploaded_variation', 'Consequence'}
missing_cols = required_cols - first_row.keys()
if missing_cols:
raise Exception(f'missing_cols={missing_cols} first_row={first_row}')
for row in itertools.chain([first_row], reader):
chrom, pos, ref, alt = row['Uploaded_variation'].split(':')
pos = int(pos)
yield {'chrom':chrom, 'pos':pos, 'ref':ref, 'alt':alt, 'consequence':row['Consequence']}
with gzip.open(out_filepath,'wt') as out_f:
writer = csv.DictWriter(out_f, 'chrom pos ref alt rsids nearest_genes consequence'.split(), delimiter="\t")
writer.writeheader()
for site_v, vep_v in itertools.zip_longest(sites_reader(), vep_reader(), fillvalue={}):
# sites_filepath and vep_filepath must have a perfect one-to-one match!
assert all(site_v[k] == vep_v[k] for k in 'chrom pos ref alt'.split()), (site_v, vep_v)
writer.writerow({**site_v, **vep_v})
#!/bin/bash
set -euo pipefail
readlinkf() { perl -MCwd -le 'print Cwd::abs_path shift' "$1"; }
SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
set -x
## This script should get run from the directory that contains `generated-by-pheweb`.
## It needs `generated-by-pheweb/sites/sites.tsv`, so it should get run after `pheweb add-genes` and its preceeding steps.
## You can see the list of steps with `pheweb process -h`.
## Then you should be able to continue with the rest of the steps. I think `pheweb process` should pick up at the right spot.
## To use these VEP consequences to filter the filterable manhattan plot, set `show_manhattan_filter_consequence = True` in `config.py`.
## Uncomment your build:
#build="GRCh38"
build="GRCh37"
## Setting parallel="yes" splits the input into chunks of 3 million variants and annotates them in parallel.
## None of this is super robust, and parallel is even less.
parallel="no"
# This script needs a version of python that has pheweb installed.
python_exe="/data/pheweb/pheweb-installs/pheweb1.3/venv/bin/python3"
#python_exe="python3"
mkdir -p vep_data/input
chmod a+rwx vep_data
if ! [[ -e input.vcf.gz ]]; then
"$python_exe" "$SCRIPTDIR/make_vcf.py" generated-by-pheweb/sites/sites.tsv input.vcf.gz
fi
if ! [[ $parallel = "yes" ]]; then
cp input.vcf.gz vep_data/input/
else
zcat input.vcf|grep -v '^##'| split --lines=$((3*1000*1000)) - split_
for file in split_*; do
zcat input.vcf|head -n3 > "vep_data/input/$file"
cat "$file" >> "vep_data/input/$file"
rm "$file"
done
fi
sudo docker pull ensemblorg/ensembl-vep
sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y "$build" -g all # Do we really need `-g all`?
if ! [[ $parallel = "yes" ]]; then
sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep ./vep --input_file=/opt/vep/.vep/input/input.vcf.gz --output_file=/opt/vep/.vep/output.tsv --force_overwrite --compress_output=gzip --cache --offline --assembly="$build" --regulatory --most_severe --check_existing
mv vep_data/output.tsv out-raw-vep.tsv
else
for f in vep_data/input/split_*; do
name=$(basename "$f")
sudo docker run -v "$PWD/vep_data":/opt/vep/.vep ensemblorg/ensembl-vep ./vep --input_file=/opt/vep/.vep/input/$name --output_file=/opt/vep/.vep/output-$name.tsv --force_overwrite --compress_output=gzip --cache --offline --assembly="$build" --regulatory --most_severe --check_existing &
done
wait # Wait for child processes to exit (hopefully sucessfully)
zcat vep_data/output-split_aa.tsv | grep '^#' | gzip > out-raw-vep.tsv
for f in $(echo vep_data/output-split_a*tsv|tr " " "\n"|sort); do
zcat $f | grep -v '^#' | gzip >> out-raw-vep.tsv
done
fi
"$python_exe" "$SCRIPTDIR/merge.py" generated-by-pheweb/sites/sites.tsv out-raw-vep.tsv sites-vep.tsv
echo "Now check that sites-vep.tsv looks good."
echo 'It should have the same variants as `generated-by-pheweb/sites/sites.tsv`.'
echo "It should have the same columns, plus 'consequence'."
echo 'Then run `mv sites-vep.tsv generated-by-pheweb/sites/sites.tsv`.'
## Demo Navigating PheWeb
On the homepage use the **search bar** to look up particular (1) genes (e.g. _APOB_, _FTO_, _TCF7L2_), (2) variants (by either rsID or chromosome:position on the appropriate genome build), or phenotypes/traits.
Note: View a list of traits on the PheWeb on the About page.
In any view, clicking on the PheWeb icon on the top left corner will allow you to return to the homepage.
If you are feeling adventurous, hit the **Random** icon in the top panel to view a randomly selected view from the PheWeb.
Selecting **Top Hits** in this panel will present a list of the most significant associations in this PheWeb in table format.
To learn more about the data behind the PheWeb select **About**.
PheWeb shows 3 types of views: `Manhattan` + `quantile-quantile (QQ)` plots, `LocusZoom` plots, and `PheWAS` plots.
Below I am looking up _TCF7L2_ in the search bar:
![](/etc/images/screen-homepage-search.png?raw=true)
Searching by gene will show you the most significant associations in that gene (table format) and a `LocusZoom` regional view showing the linkage disequilibrium among the variants in the region around the gene (below).
Selecting a different row in the table will change the `LocusZoom` plot accordingly.
In my _TCF7L2_ search, this page appears, in which the `LocusZoom` plot below is displaying the row in the table that is selected (“Type 1 diabetes”):
![](/etc/images/screen-lz.png?raw=true)
All plots are interactive. You can hover your mouse above variants to learn more information about them, for example in the `LocusZoom` plot:
![](/etc/images/screen-lz-tooltip.png?raw=true)
Clicking on a variant in the `LocusZoom plot` will display a `PheWAS` view showing the association p-value for the variant across all the phenotypes in the PheWeb.
In the `PheWAS` view an upwards facing triangle implies a positive effect of that variant on the phenotype, whereas a downwards facing triangle implies a negative effect.
Circles are used for variants in which the estimate of the beta is not precise (e.g. standard error encompassing zero). The variants are colored according to a user-specified biological grouping.
I decided to select a _TCF7L2_ variant from the previous screenshot, and here is the `PheWAS` view followed by a table summary:
![](/etc/images/screen-phewas.png?raw=true)
Selecting a trait in the `PheWAS` plot will navigate you to the Manhattan plot view. Below the `Manhattan` is a table showing the most significant associations, and below that is the `quantile-quantile (QQ)` plot stratified by minor allele frequency bin and the genomic control lambda calculated from various percentiles of variants.
Below I selected “Stricture of Artery” from the `PheWAS` view, and am hovering my mouse over a variant in the `Manhattan` plot.
If I select this variant I will be brought to its `LocusZoom` regional plot.
![](/etc/images/screen-manhattan.png?raw=true)
Scrolling down on the same page I see the `QQ` plot below the table of top associations:
![](/etc/images/screen-qq.png?raw=true)
### Running PheWeb with Apache2
1. Install apache2.
2. Run `tmux` or `screen` to get a shell session that won't exit when you close your terminal.
3. Run `pheweb serve --host 127.0.0.1 --port 9974 --num-workers 4 --no-reloader`.
- This command is equivalent to `gunicorn -b 127.0.0.1:9974 --access-logfile=- -w4 pheweb.serve.server:app`
- Use whatever port you want and whatever number of workers you want.
3. Run `sudo a2enmod proxy proxy_http`.
4. Copy `pheweb.conf` from this directory into `/etc/apache2/sites-available/`.
- If you need name-based virtual hosts, add uncomment `ServerName foo.example.com` and use your domain instead.
5. Run `sudo a2ensite pheweb`, which should make a symlink in `/etc/apache2/sites-enabled/`
6. Run `sudo service apache2 restart`.
7. Any time the computer crashes, apache2 should start on its own but you'll need to start tmux and pheweb/gunicorn.
# This will hopefully prevent people from being able to browse the python source code if something goes wrong.
Options -Indexes
<VirtualHost *:80>
# requires `a2enmod proxy proxy_http`
## Use this if you want to use name-based virtualhosts for multiple (sub)domains on one IP
# ServerName foo.example.com
ProxyPreserveHost On
ProxyPass / http://127.0.0.1:9974/
ProxyPassReverse / http://127.0.0.1:9974/
LogLevel warn
ErrorLog ${APACHE_LOG_DIR}/pheweb_error.log
CustomLog ${APACHE_LOG_DIR}/pheweb_access.log combined
</VirtualHost>
## Detailed development instructions
This document contains information useful for those looking to modify and develop the PheWeb source code.
It requires some familiarity with Python and terminal.
### Installing PheWeb
In order to reflect code changes as you work, PheWeb should be installed in "editable" mode.
1. Clone the repository to a new folder.
2. Create and active a new virtual environment. For example, in the checked-out PheWeb directory: `python3 -m venv .venv && source .venv/bin/activate` (if you prefer to manage your virtualenv some other way, that is ok)
3. With the virtualenv activated, install the package in "editable" mode: `pip3 install -e .`
4. When complete, verify that PheWeb is installed and working correctly: `pheweb -h`
### Running static analysis
You can do simple static analysis by running `./etc/pre-commit`. It requires `pip3 install flake8 mypy`. If it is broken, it might not be a problem, but it can be a good way to catch bugs.
### Running the unit tests
The tests take a minute or two. PheWeb loads a sample dataset, runs a local server, and then queries some pages on that server. It doesn't test everything in PheWeb, but it gets most of it.
`pytest`
### Running a local server with sample data
Run `./tests/run-all.sh`, and then open <http://localhost:5000/> to view your site.
This uses the same data as the unit tests to serve a website you can browse.
The homepage links to some good pages. Most of the other pages aren't very useful because the data is so sparse.
If you are only modifying the server code, you can quickly re-run just `pheweb serve` without re-running all the loading steps. Use the line like `+ pheweb conf ... serve` that is printed to your console.
## Detailed install instructions
First, try:
```bash
python3 -m pip install -U cython wheel pip setuptools
python3 -m pip install pheweb
pheweb
```
*(Note: In most cases this is equivalent to `pip3 install pheweb`, but if you have a bad version of `pip3` on your `$PATH`, using `python3 -m pip` will avoid it.)*
- If you get the error `Segmentation fault (core dumped)`, try running `python3 -m pip install --no-binary=cffi,cryptography,pyopenssl pheweb` instead. ([more info](https://github.com/pypa/pip/issues/5366))
- If you get an error related to pysam, run `python3 -m pip install -U cython; python3 -m pip install https://github.com/pysam-developers/pysam/archive/master.zip` and try again.
- If installation was successful but running `pheweb` results in "command not found", you need to add `pheweb` to your PATH. You should be able to just add the line `PATH="$HOME/.local/bin:$PATH"` to the end of `~/.bashrc`, start a new terminal, and run `pheweb` again. If you're on macOS, you might need to add the line `source "$HOME/.bashrc"` to `~/.bash_profile`.
- If that command fails in a different way, then use one of the approaches below.
### Installing on Linux with `sudo`:
*(Note: If you're not sure whether you have permissions for `sudo`, just try it. If you don't have root access, it will say something like `you are not in the sudoers file.`*)
Install prerequisites:
- If you are running Ubuntu (or another `apt-get`-based distribution), run:
```bash
sudo apt-get update
sudo apt-get install python3-pip python3-dev libz-dev libffi-dev
```
- If you are running Fedora, RedHat, or CentOS (or another `yum`-based distribution), run:
```bash
sudo yum install python3-devel gcc-c++ zlib-devel
```
Then run:
```bash
sudo python3 -m pip install wheel cython
sudo python3 -m pip install pheweb
sudo pheweb
```
If this doesn't work, try the miniconda3 approach instead.
### Installing on Linux or Mac with Miniconda3:
If you are on macOS, install XCode Developer Tools with `xcode-select --install`.
To install miniconda3, follow the instructions [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/).
When you're installing miniconda3, you can close the terms & conditions with "q".
You should install into the default directory of `~/miniconda3`.
You should let miniconda modify `$PATH` in your `~/.bash_profile` or `~/.bashrc`, so that you'll be able to run just `pheweb` instead of needing to type `~/miniconda3/bin/pheweb` on the command line.
Next, close and re-open your terminal, to make the new `$PATH` take effect.
You can check that you have the miniconda3 python set up by running `which python3`, which should reply something like `/home/peter/miniconda3/bin/python3`.
Then run:
```bash
python3 -m pip install pheweb
```
If none of these work, open a Github issue.
# Internal Data-Handling
```
input-association-files
│ │
│ [phenolist]
│ │
│ v
│ pheno-list.json
│ │ │
[parse] │
│ │ │
v v │
parsed/* │
│ └──────┐ │
[sites] │ │
rsids.tsv.gz--[add-rsids] │ │
genes.bed--[add-genes] │ │
│ │ │
v │ │
sites.tsv │ │
│ │ └──[augment-phenos]
[make-...] │ │
│ │ v
v │ pheno_gz/*
cpras-rsids-sqlite3 └─[matrix]─┘ │ │ └─[best-of-pheno]─> best_of_pheno/*
│ │ └─[qq]-> qq/*
v └─[manhattan]-> manhattan/*
matrix.tsv.gz │ │
│ [top-hits] [phenotypes]
[gather-pvalues-for-each-gene] │ │
│ v v
v top_hits.json phenotypes.json
best-phenos-by-gene.sqlite3
```
Square brackets show `pheweb <step>` subcommands.
Filenames are in `generated-by-pheweb/` or its subdirectories (except `pheno-list.json` which is its sibling).
Reference this diagram against the filepaths listed in `file_utils.py` and the steps in `pheweb process -h`.
You can see all of the per-variant fields, per-association fields, and per-phenotype fields in `parse_utils.py`.
- `parsed/*` files have the per-variant and per-association fields from the input files.
- `sites.tsv` has every variant in the dataset, with the per-variant fields from the `parsed/*` plus `rsids` and `nearest_genes` and (optionally) `consequence`.
- `pheno_gz/*` files are like `parsed/*` plus `rsids` and `nearest_genes` and (optionally) `consequence`.
- Every line in these files must begin with a line from `sites.tsv` in order for `pheweb matrix` to work. ie, they've got to have the same per-variant fields.
- `matrix.tsv.gz` contains all the per-variant fields (ie, an exact copy of `sites.tsv` in its left few columns), and all per-assoc fields (with header format `<fieldname>@<phenocode>`, eg `maf@a1c`).
## Configuration options
- `assoc_min_maf` (float): an association (between a phenotype and variant) will only be included if its MAF is greater than or equal to this value. (default: `0`)
- `cache` (string): a directory where files shared by all datasets can be cached. If you're loading multiple phewebs, setting `cache = "~/.pheweb/cache/"` will avoid downloading files multiples times. (default: None)
- `num_procs` (int): the number of processes to use for parallel loading steps. (default: 2/3 of the number of cores on your machine)
- `loading_nice = True`: sets nice=19 (reducing cpu priority) and sets ionice to class "Idle" (reducing IO when anything else is using disk)
- `debugging_limit_num_variants` (int): only parses this many variants from each input association file and from the rsids file. This is convenient for quickly loading part of a dataset to check that it works as expected.
- `download_pheno_sumstats`: explained in [README](../README.md)
- `show_correlations`: explained in [README](../README.md)
## Making pheno-list.json
There are four ways to make a `pheno-list.json`:
1. If you have a csv (or tsv, optionally gzipped) with a header that has exactly the right column names, just import it by running `pheweb phenolist import-phenolist "/path/to/my/pheno-list.csv"`.
If you have multiple association files for each phenotype, you may put them all into a single column with `|` between them. For example, your file `pheno-list.csv` might look like this:
```
phenocode,assoc_files
a1c,/home/peter/data/a1c.autosomal.gz|/home/peter/data/a1c.X.gz
ear-length,/home/peter/data/ear-length.gz
```
2. If you have one association file per phenotype, you can use a shell-glob to get assoc-files. Suppose that your assocation files are at paths like:
- `/home/peter/data/a1c.autosomal.gz`
- `/home/peter/data/ear-length.gz`
Then you could run `pheweb phenolist glob "/home/peter/data/*.gz"` to get `assoc-files`.
To get `phenocodes`, you can use this command which will take the text after the last `/` and before the next `.`:
```
pheweb phenolist extract-phenocode-from-filepath --simple
```
If that doesn't work, see `pheweb phenolist extract-phenocode-from-filepath -h` for how to use a regex capture group.
3. If you have multiple association files for some phenotypes, you can follow the directions in 2 and then run `pheweb phenolist unique-phenocode`.
For example, if your association files are at:
- `/home/peter/data/ear-length.gz`
- `/home/peter/data/a1c.autosomal.gz`
- `/home/peter/data/a1c.X.gz`
then you can run:
```
pheweb phenolist glob "/home/peter/data/*.gz"
pheweb phenolist extract-phenocode-from-filepath --simple
pheweb phenolist unique-phenocode
```
4. If you want to do more advanced things, like merging in more information from another file, check out the tools in `pheweb phenolist --help`.
## Distributing jobs across a cluster
`pheweb process` runs a bunch of steps, which you can see by running `pheweb process -h`.
Some of those steps can instead be run distributed across a cluster.
You can see which steps by running `pheweb cluster -h`.
The schedulers SLURM and SGE are natively supported.
Use `--engine=slurm` or `--engine=sge` when you run `pheweb cluster`.
For other schedulers, you'll have to modify the output of `pheweb cluster`.
For example, on SLURM you could run:
```
pheweb phenolist verify
pheweb cluster --engine=slurm --step=parse
pheweb sites && pheweb make-gene-aliases-sqlite3 && pheweb add-rsids && pheweb add-genes && pheweb make-cpras-rsids-sqlite3
pheweb cluster --engine=slurm --step=augment-phenos
pheweb cluster --engine=slurm --step=manhattan
pheweb cluster --engine=slurm --step=qq
pheweb process # This won't re-create any files that are already up-to-date.
```
## Annotating with VEP
Run the code in `etc/annotate_vep/run.sh`. It requires docker (and thus sudo) and only works on hg38.
Read the comments at the top of that script.
<br><br><br><br><br><br><br><br><br><br><br><br>
## Hosting a pheweb and accessing it from your browser
Run `pheweb serve --open`. That command should either open a web browser showing your PheWeb, or it should give you a URL that you can open in your web browser. If that doesn't work, try these:
- If pheweb's output says that port 5000 is already taken, run `pheweb serve --open --port=5001` instead. Or try some other port.
- If `pheweb serve` is running fine, but you can't open it in a web browser, you have two options:
1. Option 1: Serve PheWeb on port 80.
You need a port that can get through your firewall. 80 or 443 probably work.
To use port 80 or 443 you'll need root permissions. Run `sudo $(which python3) $(which pheweb) serve --open --port=80`.
Then open the URLs that they suggest.
3. Option 2: Run PheWeb with the default settings, then connect an SSH tunnel between your computer and your server.
Here's how to do that if your laptop runs Mac or Linux:
Suppose you normally ssh in with `ssh me@example.com`. Instead, run `ssh -N -L localhost:5000:localhost:5000 me@example.com`.
Then open <http://localhost:5000> in your web browser.
Sometimes MacOS itself uses port 5000, so I usually use port 8000.
## Using Apache2 or Nginx
At this point your PheWeb should be working how you want it to, except maybe the URL you're using.
`pheweb serve` already uses gunicorn. For maximum speed and safety, you should run gunicorn routed through a reverse proxy like Apache2 or Nginx. If you choose Apache2, I have some documentation [here](detailed-apache2-instructions/README.md).
## Using OAuth
1. Make your own random `SECRET_KEY` for flask.
```bash
$ python3 -c 'import os; print(os.urandom(24))'
b'(\x1e\xe5IY\xe4\xdc\x00s\xc6z\xf8\x9b\xf3\x99Miw\x9dct\xdf}\xeb'
```
In `config.py` in your pheweb directory, set
```python
SECRET_KEY = '(\x1e\xe5IY\xe4\xdc\x00s\xc6z\xf8\x9b\xf3\x99Miw\x9dct\xdf}\xeb'
```
2. Set up OAuth with Google.
Go [here](https://console.developers.google.com/apis/credentials) to create a project.
In the list "Authorized redirect URIs" add your OAuth callback URL, which should look like `http://example.com/callback/google` or `http://example.com:5000/callback/google`.
In `config.py`, set:
```python
login = {
'GOOGLE_LOGIN_CLIENT_ID': 'something-something.apps.googleusercontent.com',
'GOOGLE_LOGIN_CLIENT_SECRET': 'letters-letters',
'whitelist': [
'user1@example.com',
'user2@example.com',
'user3@gmail.com',
'@umich.edu', # Allows any email @umich.edu
]
}
```
The correct values of `GOOGLE_LOGIN_CLIENT_ID` and `GOOGLE_LOGIN_CLIENT_SECRET` are at the top of the Google project page. The whitelist can contain any email addresses connected to Google accounts.
## Using Google Analytics
Go [here](https://analytics.google.com/analytics/web) and do whatever you have to to get your own tracking id (i.e. AW-XXXXX or G-XXXXX).
Then, in `config.py`, set:
```
GOOGLE_ANALYTICS_TRACKING_ID = 'G-XXXXX'
```
and kill and restart `pheweb serve`.
If you visit your site, you should see the activity at [the Google Analytics web console](https://analytics.google.com/analytics/web).
## Reducing storage use
To make PheWeb use less space, you can delete some of the files created during the loading process.
Files in `generated-by-pheweb/parsed/` are only needed for re-buiding the site with more GWAS. You can replace those files with symlinks to the files in `pheno_gz/`.
Files in `generated-by-pheweb/tmp/` can also be removed.
This should work:
```bash
cd generated-by-pheweb/parsed/
for f in *; do
ln -sf ../pheno_gz/$f.gz $f
done
cd ..
rm tmp/*
```
## Customizing page contents
To modify the contents of the About page and others, create a directory named `custom_templates` next to `generated-by-pheweb`.
Here are some templates that are intended to be modified:
- `custom_templates/about/content.html`: contents of the about page
- `custom_templates/index/h1.html`: large title above the search bar on the homepage
- `custom_templates/index/below-h1.html`: subtext above the search bar on the homepage
- `custom_templates/index/below-query.html`: beneath the search bar on the homepage
- `custom_templates/pheno/h1.html`: the large text at the top of the phenotype (Manhattan Plot) page
- `custom_templates/region/h1.html`: the large text at the top of the region (LocusZoom Region Plot) page
- `custom_templates/title.html`: the title of the window, usually shown in the tab bar
You can also override any template found in [pheweb/serve/templates](https://github.com/statgen/pheweb/tree/master/pheweb/serve/templates). It'll work best if you copy the original version and modify it. If you update Pheweb after overriding entire pages like this, those pages might be broken. The templating language is Jinja2 and you can see what variables are available by looking at `route`s with `render_template` in [pheweb/serve/server.py](https://github.com/statgen/pheweb/tree/master/pheweb/serve/server.py).
etc/images/screen-homepage-search.png

69.7 KiB

etc/images/screen-homepage.png

66.5 KiB

etc/images/screen-lz-tooltip.png

257 KiB

etc/images/screen-lz.png

249 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment