distToNearest - Distance to nearest neighbor
Get non-zero distance of every sequence (as defined by
sequenceColumn) to its
nearest sequence sharing same V gene, J gene, and sequence length.
distToNearest(db, sequenceColumn = "JUNCTION", vCallColumn = "V_CALL", jCallColumn = "J_CALL", model = c("ham", "aa", "hh_s1f", "hh_s5f", "mk_rs1nf", "mk_rs5nf", "m1n_compat", "hs1f_compat"), normalize = c("len", "none"), symmetry = c("avg", "min"), first = TRUE, nproc = 1, fields = NULL, cross = NULL, mst = FALSE, subsample = NULL, progress = FALSE)
- data.frame containing sequence data.
- name of the column containing nucleotide sequences to compare. Also used to determine sequence length for grouping.
- name of the column containing the V-segment allele calls.
- name of the column containing the J-segment allele calls.
- underlying SHM model, which must be one of
c("ham", "aa", "hh_s1f", "hh_s5f", "mk_rs1nf", "hs1f_compat", "m1n_compat"). See Details for further information.
- method of normalization. The default is
"len", which divides the distance by the length of the sequence group. If
"none"then no normalization if performed.
- if model is hs5f, distance between seq1 and seq2 is either the average (avg) of seq1->seq2 and seq2->seq1 or the minimum (min).
TRUEonly the first call of the gene assignments is used. if
FALSEthe union of ambiguous gene assignments is used to group all sequences with any overlapping gene calls.
- number of cores to distribute the function over.
- additional fields to use for grouping.
- character vector of column names to use for grouping to calculate distances across groups. Meaning the columns that define self versus others.
TRUE, return comma-separated branch lengths from minimum spanning tree.
- number of sequences to subsample for speeding up pairwise-distance-matrix calculation.
Subsampling is performed without replacement in each group of sequences with the
jCallColumn, and junction length. If
subsampleis larger than the unique number of sequences in each group, then the subsampling process is ignored for that group. For each sequence in
db, the reported
DIST_NEARESTis the distance to the closest sequence in the subsampled set for the group. If
NULLno subsampling is performed.
TRUEprint a progress bar.
Returns a modified
db data.frame with nearest neighbor distances in the
DIST_NEAREST column if
cross was specified, distances will be added as the
The distance to nearest neighbor can be used to estimate a threshold for assigning Ig sequences to clonal groups. A histogram of the resulting vector is often bimodal, with the ideal threshold being a value that separates the two modes.
The following distance measures are accepted by the
"ham": Single nucleotide Hamming distance matrix from getDNAMatrix with gaps assigned zero distance.
"aa": Single amino acid Hamming distance matrix from getAAMatrix.
"hh_s1f": Human single nucleotide distance matrix derived from HH_S1F with calcTargetingDistance.
"hh_s5f": Human 5-mer nucleotide context distance matix derived from HH_S5F with calcTargetingDistance.
"mk_rs1nf": Mouse single nucleotide distance matrix derived from MK_RS1NF with calcTargetingDistance.
"mk_rs5nf": Mouse 5-mer nucleotide context distance matrix derived from MK_RS1NF with calcTargetingDistance.
"hs1f_compat": Backwards compatible human single nucleotide distance matrix used in SHazaM v0.1.4 and Change-O v0.3.3.
"m1n_compat": Backwards compatibley mouse single nucleotide distance matrix used in SHazaM v0.1.4 and Change-O v0.3.3.
NAs: if, for a given combination of V gene, J gene, and sequence length,
there is only 1 sequence (as defined by
NA is returned
instead of a distance (since it has no neighbor). If for a given combination there are
multiple sequences but only 1 unique sequence, (in which case every sequence in this
group is the de facto nearest neighbor to each other, thus giving rise to distances
NAs are returned instead of zero-distances.
subsample: Subsampling is performed independently in each group of sequences
sharing the same
jCallColumn, and junction length. If
is larger than number of sequences in the group, it is ignored. In other words, subsampling
is performed only on groups of sequences of size equal to or greater than
DIST_NEAREST has values calculated using all sequences in the group for groups of size
subsample and values calculated using a subset of sequences for the larger
groups. To select a value of
subsample, it can be useful to explore the group sizes in
- Smith DS, et al. Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells. J Immunol. 1996 156:2642-52.
- Glanville J, Kuo TC, von Budingen H-C, et al. Naive antibody gene-segment frequencies are heritable and unaltered by chronic lymphocyte ablation. Proc Natl Acad Sci USA. 2011 108(50):20066-71.
- Yaari G, et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front Immunol. 2013 4:358.
# Subset example data to one sample as a demo data(ExampleDb, package="alakazam") db <- subset(ExampleDb, SAMPLE == "-1h") # Use genotyped V assignments, Hamming distance, and normalize by junction length dist <- distToNearest(db, vCallColumn="V_CALL_GENOTYPED", model="ham", first=FALSE, normalize="len") # Plot histogram of non-NA distances p1 <- ggplot(data=subset(dist, !is.na(DIST_NEAREST))) + theme_bw() + ggtitle("Distance to nearest: Hamming") + xlab("distance") + geom_histogram(aes(x=DIST_NEAREST), binwidth=0.025, fill="steelblue", color="white") plot(p1)