collapseClones - Constructs effective clonal sequences for all clones
Description¶
collapseClones
creates effective input and germline sequences for each clonal
group and appends columns containing the consensus sequences to the input
data.frame
.
Usage¶
collapseClones(
db,
cloneColumn = "clone_id",
sequenceColumn = "sequence_alignment",
germlineColumn = "germline_alignment_d_mask",
muFreqColumn = NULL,
regionDefinition = NULL,
method = c("mostCommon", "thresholdedFreq", "catchAll", "mostMutated", "leastMutated"),
minimumFrequency = NULL,
includeAmbiguous = FALSE,
breakTiesStochastic = FALSE,
breakTiesByColumns = NULL,
expandedDb = FALSE,
nproc = 1,
juncLengthColumn = "junction_length",
fields = NULL
)
Arguments¶
- db
data.frame
containing sequence data. Required.- cloneColumn
character
name of the column containing clonal identifiers. Required.- sequenceColumn
character
name of the column containing input sequences. Required. The length of each input sequence should match that of its corresponding germline sequence.- germlineColumn
character
name of the column containing germline sequences. Required. The length of each germline sequence should match that of its corresponding input sequence.- muFreqColumn
character
name of the column containing mutation frequency. Optional. Applicable to the"mostMutated"
and"leastMutated"
methods. If not supplied, mutation frequency is computed by callingobservedMutations
. Default isNULL
. See Cautions for note on usage.- regionDefinition
- RegionDefinition object defining the regions
and boundaries of the Ig sequences. Optional. Default is
NULL
. - method
- method for calculating input consensus sequence. Required.
One of
"thresholdedFreq"
,"mostCommon"
,"catchAll"
,"mostMutated"
, or"leastMutated"
. See “Methods” for details. - minimumFrequency
- frequency threshold for calculating input consensus sequence.
Applicable to and required for the
"thresholdedFreq"
method. A canonical choice is 0.6. Default isNULL
. - includeAmbiguous
- whether to use ambiguous characters to represent positions
at which there are multiple characters with frequencies that
are at least
minimumFrequency
or that are maximal (i.e. ties). Applicable to and required for the"thresholdedFreq"
and"mostCommon"
methods. Default isFALSE
. See “Choosing ambiguous characters” for rules on choosing ambiguous characters. - breakTiesStochastic
- In case of ties, whether to randomly pick a sequence from
sequences that fulfill the criteria as consensus. Applicable
to and required for all methods except for
"catchAll"
. Default isFALSE
. See “Methods” for details. - breakTiesByColumns
- A list of the form
list(c(col_1, col_2, ...), c(fun_1, fun_2, ...))
, wherecol_i
is acharacter
name of a column indb
, andfun_i
is a function to be applied on that column. Currently, onlymax
andmin
are supported. Note that the twoc()
’s inlist()
are essential (i.e. if there is only 1 column, the list should be of the formlist(c(col_1), c(func_1))
. Applicable to and optional for the"mostMutated"
and"leastMutated"
methods. If supplied,fun_i
’s are applied oncol_i
’s to help break ties. Default isNULL
. See “Methods” for details. - expandedDb
logical
indicating whether or not to return the expandeddb
, containing all the sequences (as opposed to returning just one sequence per clone).- nproc
- Number of cores to distribute the operation over. If the
cluster
has already been set earlier, then pass thecluster
. This will ensure that it is not reset. - juncLengthColumn
character
name of the column containing the junction length. Needed whenregionDefinition
includes CDR3 and FWR4.- fields
- additional fields used for grouping. Use sample_id, to avoid combining sequences with the same clone_id that belong to different sample_id.
Value¶
A modified db
with the following additional columns:
clonal_sequence
: effective sequence for the clone.clonal_germline
: germline sequence for the clone.clonal_sequence_mufreq
: mutation frequency ofclonal_sequence
; only added for the"mostMutated"
and"leastMutated"
methods.
clonal_sequence
is generated with the method of choice indicated
by method
, and clonal_germline
is generated with the
"mostCommon"
method, along with, where applicable, user-defined
parameters such as minimumFrequency
, includeAmbiguous
,
breakTiesStochastic
, and breakTiesByColumns
.
Consensus lengths¶
For each clone, clonal_sequence
and
clonal_germline
have the same length.
- For the
"thresholdedFreq"
,"mostCommon"
, and"catchAll"
methods:
The length of the consensus sequences is determined by the longest possible
consensus sequence (baesd on inputSeq
and germlineSeq
) and
regionDefinition@seqLength
(if supplied), whichever is shorter.
Given a set of sequences of potentially varying lengths, the longest possible
length of their consensus sequence is taken to be the longest length along
which there is information contained at every nucleotide position across
majority of the sequences. Majority is defined to be greater than
floor(n/2)
, where n
is the number of sequences. If the longest
possible consensus length is 0, there will be a warning and an empty string
(""
) will be returned.
If a length limit is defined by supplying a regionDefinition
via
regionDefinition@seqLength
, the consensus length will be further
restricted to the shorter of the longest possible length and
regionDefinition@seqLength
.
- For the
"mostMutated"
and"leastMutated"
methods:
The length of the consensus sequences depends on that of the most/least
mutated input sequence, and, if supplied, the length limit defined by
regionDefinition@seqLength
, whichever is shorter. If the germline
consensus computed using the "mostCommon"
method is longer than
the most/least mutated input sequence, the germline consensus is trimmed
to be of the same length as the input consensus.
Methods¶
The descriptions below use “sequences” as a generalization of input sequences and germline sequences.
method="thresholdedFreq"
A threshold must be supplied to the argument minimumFrequency
. At
each position along the length of the consensus sequence, the frequency
of each nucleotide/character across sequences is tabulated. The
nucleotide/character whose frequency is at least (i.e. >=
)
minimumFrequency
becomes the consensus; if there is none, the
consensus nucleotide will be "N"
.
When there are ties (frequencies of multiple nucleotides/characters
are at least minimumFrequency
), this method can be deterministic
or stochastic, depending on additional parameters.
- With
includeAmbiguous=TRUE
, ties are resolved deterministically by representing ties using ambiguous characters. See “Choosing ambiguous characters” for how ambiguous characters are chosen. - With
breakTiesStochastic=TRUE
, ties are resolved stochastically by randomly picking a character amongst the ties. - When both
TRUE
,includeAmbiguous
takes precedence overbreakTiesStochastic
. - When both
FALSE
, the first character from the ties is taken to be the consensus following the order of"A"
,"T"
,"G"
,"C"
,"N"
,"."
, and"-"
.
Below are some examples looking at a single position based on 5
sequences with minimumFrequency=0.6
,
includeAmbiguous=FALSE
, and breakTiesStochastic=FALSE
:
- If the sequences have
"A"
,"A"
,"A"
,"T"
,"C"
, the consensus will be"A"
, because"A"
has frequency 0.6, which is at leastminimumFrequency
. -
If the sequences have
"A"
,"A"
,"T"
,"T"
,"C"
, the consensus will be"N"
, because none of"A"
,"T"
, or"C"
has frequency that is at leastminimumFrequency
. -
method="mostCommon"
The most frequent nucleotide/character across sequences at each position along the length of the consensus sequence makes up the consensus.
When there are ties (multiple nucleotides/characters with equally
maximal frequencies), this method can be deterministic or stochastic,
depending on additional parameters. The same rules for breaking ties
for method="thresholdedFreq"
apply.
Below are some examples looking at a single position based on 5
sequences with includeAmbiguous=FALSE
, and
breakTiesStochastic=FALSE
:
- If the sequences have
"A"
,"A"
,"T"
,"A"
,"C"
, the consensus will be"A"
. -
If the sequences have
"T"
,"T"
,"C"
,"C"
,"G"
, the consensus will be"T"
, because"T"
is before"C"
in the order of"A"
,"T"
,"G"
,"C"
,"N"
,"."
, and"-"
. -
method="catchAll"
This method returns a consensus sequence capturing most of the information contained in the sequences. Ambiguous characters are used where applicable. See “Choosing ambiguous characters” for how ambiguous characters are chosen. This method is deterministic and does not involve breaking ties.
Below are some examples for method="catchAll"
looking at a
single position based on 5 sequences:
- If the sequences have
"N"
,"N"
,"N"
,"N"
,"N"
, the consensus will be"N"
. - If the sequences have
"N"
,"A"
,"A"
,"A"
,"A"
, the consensus will be"A"
. - If the sequences have
"N"
,"A"
,"G"
,"A"
,"A"
, the consensus will be"R"
. - If the sequences have
"-"
,"-"
,"."
,"."
,"."
, the consensus will be"-"
. - If the sequences have
"-"
,"-"
,"-"
,"-"
,"-"
, the consensus will be"-"
. -
If the sequences have
"."
,"."
,"."
,"."
,"."
, the consensus will be"."
. -
method="mostMutated"
andmethod="leastMutated"
These methods return the most/least mutated sequence as the consensus sequence.
When there are ties (multple sequences have the maximal/minimal mutation frequency), this method can be deterministic or stochastic, depending on additional parameters.
- With
breakTiesStochastic=TRUE
, ties are resolved stochastically by randomly picking a sequence out of sequences with the maximal/minimal mutation frequency. - When
breakTiesByColumns
is supplied, ties are resolved deterministically. Column by column, a function is applied on the column and sequences with column value matching the functional value are retained, until ties are resolved or columns run out. In the latter case, the first remaining sequence is taken as the consensus. - When
breakTiesStochastic=TRUE
andbreakTiesByColumns
is also supplied,breakTiesStochastic
takes precedence overbreakTiesByColumns
. - When
breakTiesStochastic=FALSE
andbreakTiesByColumns
is not supplied (i.e.NULL
), the sequence that appears first amongst the ties is taken as the consensus.
Choosing ambiguous characters¶
Ambiguous characters may be present in the returned consensuses when using the
"catchAll"
method and when using the "thresholdedFreq"
or
"mostCommon"
methods with includeAmbiguous=TRUE
.
The rules on choosing ambiguous characters are as follows:
- If a position contains only
"N"
across sequences, the consensus at that position is"N"
. - If a position contains one or more of
"A"
,"T"
,"G"
, or"C"
, the consensus will be an IUPAC character representing all of the characters present, regardless of whether"N"
,"-"
, or"."
is present. - If a position contains only
"-"
and"."
across sequences, the consensus at that position is taken to be"-"
. - If a position contains only one of
"-"
or"."
across sequences, the consensus at that position is taken to be the character present.
Cautions¶
- Note that this function does not perform multiple sequence alignment.
As a prerequisite, it is assumed that the sequences in
sequenceColumn
andgermlineColumn
have been aligned somehow. In the case of immunoglobulin repertoire analysis, this usually means that the sequences are IMGT-gapped. - When using the
"mostMutated"
and"leastMutated"
methods, if you supply bothmuFreqColumn
andregionDefinition
, it is your responsibility to ensure that the mutation frequency inmuFreqColumn
was calculated with sequence lengths restricted to the sameregionDefinition
you are supplying. Otherwise, the “most/least mutated” sequence you obtain might not be the most/least mutated given theregionDefinition
supplied, because your mutation frequency was based on aregionDefinition
different from the one supplied. - If you intend to run
collapseClones
before building a 5-mer targeting model, you must choose parameters such that your collapsed clonal consensuses do not include ambiguous characters. This is because the targeting model functions do NOT support ambiguous characters in their inputs.
Examples¶
# Subset example data
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, c_call %in% c("IGHA", "IGHG") & sample_id == "+7d" &
clone_id %in% c("3100", "3141", "3184"))
# thresholdedFreq method, resolving ties deterministically without using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="thresholdedFreq", minimumFrequency=0.6,
includeAmbiguous=FALSE, breakTiesStochastic=FALSE)
# mostCommon method, resolving ties deterministically using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon",
includeAmbiguous=TRUE, breakTiesStochastic=FALSE)
# Make a copy of db that has a mutation frequency column
db2 <- observedMutations(db, frequency=TRUE, combine=TRUE)
# mostMutated method, resolving ties stochastically
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=TRUE, breakTiesByColumns=NULL)
# mostMutated method, resolving ties deterministically using additional columns
clones <- collapseClones(db2, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostMutated", muFreqColumn="mu_freq",
breakTiesStochastic=FALSE,
breakTiesByColumns=list(c("duplicate_count"), c(max)))
# Build consensus for V segment only
# Capture all nucleotide variations using ambiguous characters
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="catchAll", regionDefinition=IMGT_V)
# Return the same number of rows as the input
clones <- collapseClones(db, cloneColumn="clone_id", sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
method="mostCommon", expandedDb=TRUE)
See also¶
See IMGT_SCHEMES for a set of predefined RegionDefinition objects.