Treehouse Public Data

The Treehouse Childhood Cancer Initiative is a research arm of the UCSC Genomics Institute. We enable the sharing of pediatric cancer genomic data using tools developed by our Genomics Institute colleagues. We use shared data to analyze a child's tumor against both child and adult patient cancer tumors using a "pan cancer" or cross-comparison gene expression analysis. Our goal is to identify situations where an an approved drug, often an adult drug, is predicted to work on a child with cancer.

As part of our research, we have gathered a compendium of RNA gene expression data which we have made available for download and visualization.

Our samples are derived from partner clinical sites and publicly available repositories, including TARGET and TCGA. Expression data from over 12,000 samples is available along with clinical data including age, gender, disease type, and provider site's original ID.

New: Kallisto TPM Gene Expression

June 2023

Kallisto TPM expression matrices are now available for the Tumor Compendium v11 Public PolyA and the Tumor Compendium v9 Public Ribodeplete. These matrices were generated using Kallisto version 0.43.1 and include 12,537 samples (PolyA) and 295 samples (Ribodeplete).

Newest Tumor Compendium: v11

April 2020

The Tumor Compendium v11 Public PolyA is now available for download and visualization. This compendium includes RNA expression data from over 12,000 samples, including 406 newly added samples from the Therapeutically Applicable Research To Generate Effective Treatments (TARGET) program.

For older dataset releases, please visit the Previous Compendia page.

Visualizations

TumorMap

The UCSC TumorMap interactively displays samples in the Treehouse dataset positioned according to their RNA profiles. Users can color the samples based on dataset features like Disease. This browser shows samples clustered using the OpenOrd algorithm and best separates smaller groups. (See "TumorMap: Exploring the Molecular Similarities of Cancer Samples in an Interactive Portal." Cancer Research November 2017).

Cluster Browser

The UCSC Cell Browser interactively displays samples in the Treehouse dataset positioned according to their RNA profiles. Users can color the samples based on dataset features like Disease. This browser quickly shows samples clustered using multiple algorithms such as UMAP and t-SNE, and best shows relationships among larger groups.

Xena

UCSC Xena allows users to explore the Treehouse dataset, finding correlations and trends within and across genomic and phenotypic variables. Users can interactively add, remove, and rearrange arbitrary slices of data including genes, transcripts and other dataset features. This example from our July 2017 dataset shows that neuroblastoma in comparison to other pediatric cancers has a much stronger ALK gene expression and younger patient population.

Files

Three different file types are available.

Selected De-identified Clinical Data

Age at diagnosis, gender, and disease are provided for RNASeq samples compiled by the UCSC Treehouse Childhood Cancer Initiative. Age is in years. For datasets v8 and onwards, the "pedaya" field provides Pediatric Adolescent and Young Adult status. For datasets v9 and onwards, the "site_donor_id" provides the donor ID assigned by the sample's original repository, available for samples derived from publicly available repositories; and "site_sampleid" provides the original sample ID assigned. Samples are derived from clinical sites, publicly available repositories (see our Dataset Accessions Legend page), TARGET, and TCGA.

TPM Gene Expression, log2-Normalized

Values in this dataset use HUGO gene names and are TPM (Transcripts Per Million) normalized, transformed by log2(x+1) of the TPM value. These values were originally generated with Ensembl gene IDs. Where multiple Ensembl gene IDs map to a single HUGO name, it was necessary to combine the corresponding values into a single data point. For datasets v4 and v5, this was done by taking the mean of the input values. For datasets v8 and onwards, the input values were instead added together before the logarithm was taken to more accurately reflect the underlying data.

Expected Count Gene Expression

Values in this dataset are expected_count and use Ensembl gene IDs.

Download

Download the newest Tumor PolyA, Tumor Ribodeplete, Tumor Hybrid Capture, Patient-Derived Xenograft and Cell Line datasets here, as well as all publication-associated datasets.

Patient-Derived Xenograft Compendium PolyA v22.03 (March 2022)

Visualize

Files

This compendium was released in March 2022. It includes 33 samples from Treehouse clinical sites (identifiers start with 'TH') and publicly available repositories. These data were generated by library preparation methods including polyA selection.

Pipeline

The genomic data provided here was processed with the RNA-Seq pipeline developed by the UC Santa Cruz Computational Genomics Lab. This pipeline processes primary BAM or FASTQ files into gene expression data using the RSEM software package. The pipeline is available for general use; the source code is hosted on GitHub at BD2KGenomics/toil-rnaseq and a Dockerized version is available at UCSC-Treehouse/pipelines.

Contribute

We are committed to data sharing and encourage you to be part of this sharing network. If you use our data, and have data of your own, pay it forward by running our pipeline and sharing back. We will add these results to our public compendium of expression data, with a credit to your contribution. By doing so, your samples and those of other partner sites will contribute to an ever-improving virtuous cycle of data sharing, ensuring that each participant's data pays it forward to future participants!

Support From Our Partners

Thank you to all who are sharing data. A special shout out to the St. Baldrick's Foundation and the California Initiative to Advance Precision Medicine, not only for supporting Treehouse but for their commitment to data sharing and their efforts to advance responsible data sharing.

Data Usage Policy

If you use our data, please acknowledge the Treehouse Childhood Cancer Initiative as the source of the data.
If you use our pipeline to process your data, we would appreciate it if you share the results with us, so it can be added to the public database. Just send us an email and we'll get in touch to arrange the data transfer. Our goal is to benefit researchers and pediatric patients everywhere through access to data.

UCSC Treehouse Public Data

New: Kallisto TPM Gene Expression

June 2023

Newest Tumor Compendium: v11

April 2020

Visualizations

TumorMap

Cluster Browser

Xena

Files

Selected De-identified Clinical Data

TPM Gene Expression, log2-Normalized

Expected Count Gene Expression

Download

Patient-Derived Xenograft Compendium PolyA v22.03 (March 2022)

Visualize

Files

Patient-Derived Xenograft Compendium Ribodeplete v22.03 (March 2022)

Visualize

Files

Cell Line Compendium PolyA v21.06 (June 2021)

Visualize

Files

Tumor Compendium 21.02 Public Hybrid Capture (February 2021)

Visualize

Files

Tumor Compendium v11 Public PolyA (April 2020)

Visualize

Files

Tumor cohorts for Vaske et al. publication (October 2019)

Tumor Compendium v9 Public Ribodeplete (March 2019)

Visualize

Files

Previous Compendia

Pipeline

Contribute

Support From Our Partners

Data Usage Policy