observedMutations - Calculate observed numbers of mutations
observedMutations calculates the observed number of mutations for each
sequence in the input
observedMutations( db, sequenceColumn = "sequence_alignment", germlineColumn = "germline_alignment_d_mask", regionDefinition = NULL, mutationDefinition = NULL, ambiguousMode = c("eitherOr", "and"), frequency = FALSE, combine = FALSE, nproc = 1, cloneColumn = "clone_id", juncLengthColumn = "junction_length" )
data.framecontaining sequence data.
charactername of the column containing input sequences. IUPAC ambiguous characters for DNA are supported.
charactername of the column containing the germline or reference sequence. IUPAC ambiguous characters for DNA are supported.
- RegionDefinition object defining the regions
and boundaries of the Ig sequences. If NULL, mutations
are counted for entire sequence. To use regions definitions,
germlineColumnmust be aligned, following the IMGT schema.
- MutationDefinition object defining replacement
and silent mutation criteria. If
NULLthen replacement and silent are determined by exact amino acid identity.
- whether to consider ambiguous characters as
"and"when determining and counting the type(s) of mutations. Applicable only if
germlineColumncontain(s) ambiguous characters. One of
c("eitherOr", "and"). Default is
logicalindicating whether or not to calculate mutation frequencies. Default is
logicalindicating whether for each sequence should the mutation counts for the different regions (CDR, FWR) and mutation types be combined and return one value of count/frequency per sequence instead of multiple values. Default is
- number of cores to distribute the operation over. If the
cluster has already been set the call function with
nproc= 0 to not reset or reinitialize. Default is
- clone id column name in
- junction length column name in
data.frame with observed mutation counts for each
sequence listed. The columns names are dynamically created based on the
regions in the
regionDefinition. For example, when using the
IMGT_V definition, which defines positions for CDR and
FWR, the following columns are added:
mu_count_cdr_r: number of replacement mutations in CDR1 and CDR2 of the V-segment.
mu_count_cdr_s: number of silent mutations in CDR1 and CDR2 of the V-segment.
mu_count_fwr_r: number of replacement mutations in FWR1, FWR2 and FWR3 of the V-segment.
mu_count_fwr_s: number of silent mutations in FWR1, FWR2 and FWR3 of the V-segment.
frequency=TRUE, R and S mutation frequencies are
calculated over the number of non-N positions in the specified regions.
mu_freq_cdr_r: frequency of replacement mutations in CDR1 and CDR2 of the V-segment.
mu_freq_cdr_s: frequency of silent mutations in CDR1 and CDR2 of the V-segment.
mu_freq_fwr_r: frequency of replacement mutations in FWR1, FWR2 and FWR3 of the V-segment.
mu_freq_fwr_s: frequency of silent mutations in FWR1, FWR2 and FWR3 of the V-segment.
combine=TRUE, the mutations and non-N positions
are aggregated and a single
mu_freq value is returned
mu_freq: frequency of replacement and silent mutations in the specified region
Mutation counts are determined by comparing a reference sequence to the input sequences in the
column specified by
sequenceColumn. See calcObservedMutations for more technical details,
including criteria for which sequence differences are included in the mutation
counts and which are not.
The mutations are binned as either replacement (R) or silent (S) across the different
regions of the sequences as defined by
regionDefinition. Typically, this would
be the framework (FWR) and complementarity determining (CDR) regions of IMGT-gapped
nucleotide sequences. Mutation counts are appended to the input
db includes lineage information, such as the
parent_sequence column created by
makeGraphDf, the reference sequence can be set to use that field as reference sequence
# Subset example data data(ExampleDb, package="alakazam") db <- ExampleDb[1:10, ] # Calculate mutation frequency over the entire sequence db_obs <- observedMutations(db, sequenceColumn="sequence_alignment", germlineColumn="germline_alignment_d_mask", frequency=TRUE, nproc=1) # Count of V-region mutations split by FWR and CDR # With mutations only considered replacement if charge changes db_obs <- observedMutations(db, sequenceColumn="sequence_alignment", germlineColumn="germline_alignment_d_mask", regionDefinition=IMGT_V, mutationDefinition=CHARGE_MUTATIONS, nproc=1) # Count of VDJ-region mutations, split by FWR and CDR db_obs <- observedMutations(db, sequenceColumn="sequence_alignment", germlineColumn="germline_alignment_d_mask", regionDefinition=IMGT_VDJ, nproc=1) # Extend data with lineage information data(ExampleTrees, package="alakazam") graph <- ExampleTrees[] clone <- alakazam::makeChangeoClone(subset(ExampleDb, clone_id == graph$clone)) gdf <- makeGraphDf(graph, clone) # Count of mutations between observed sequence and immediate ancenstor db_obs <- observedMutations(gdf, sequenceColumn="sequence", germlineColumn="parent_sequence", regionDefinition=IMGT_VDJ, nproc=1)
calcObservedMutations is called by this function to get the number of mutations
in each sequence grouped by the RegionDefinition.
See IMGT_SCHEMES for a set of predefined RegionDefinition objects.
See expectedMutations for calculating expected mutation frequencies.
See makeGraphDf for creating the field