![]() |
OpenMS
|
An Aho Corasick trie (a set of nodes with suffix links mainly) More...
#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
Public Member Functions | |
ACTrie (uint32_t max_aaa=0, uint32_t max_mm=0) | |
Default C'tor which just creates a root node. More... | |
~ACTrie () | |
D'tor. More... | |
void | addNeedle (const std::string &needle) |
void | addNeedles (const std::vector< std::string > &needles) |
void | addNeedlesAndCompress (const std::vector< std::string > &needles) |
size_t | size () const |
void | compressTrie () |
Traverses the trie in BFS order and makes it more compact and efficient to traverse. More... | |
size_t | getNeedleCount () const |
How many needles were added to the trie? More... | |
void | setMaxAAACount (const uint32_t max_aaa) |
uint32_t | getMaxAAACount () const |
Maximum number of ambiguous amino acids allowed during search. More... | |
void | setMaxMMCount (const uint32_t max_mm) |
uint32_t | getMaxMMCount () const |
Maximum number of mismatches allowed during search. More... | |
bool | nextHits (ACTrieState &state) const |
void | getAllHits (ACTrieState &state) const |
Private Member Functions | |
bool | nextHitsNoClear_ (ACTrieState &state) const |
Index | add_ (const Index from, const AA edge) |
bool | addHits_ (Index i, const size_t text_pos, std::vector< Hit > &hits) const |
Add all hits occurring in node i (including all its suffix hits) More... | |
bool | addHitsScout_ (Index i, const ACScout &scout, const size_t text_pos, std::vector< Hit > &hits, const int current_scout_depths) const |
same as addHits_, but only follows the suffix chain until the scout loses its prefix More... | |
Index | follow_ (const Index i, const AA edge) const |
bool | followScout_ (ACScout &scout, const AA edge, ACTrieState &state) const |
Index | stepPrimary_ (const Index i, const AA edge, ACTrieState &state) const |
bool | stepScout_ (ACScout &scout, ACTrieState &state) const |
void | createScouts_ (const Index i, const AA fromAA, const AA toAA, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const |
void | createSubScouts_ (const ACScout &prototype, const AA fromAA, const AA toAA, ACTrieState &state) const |
Create scouts from a scout with an AAA or MM, using prototype as template, following edges in range fromAA to toAA . More... | |
void | createMMScouts_ (const Index i, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const |
Same as createScouts_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More... | |
void | createMMSubScouts_ (const ACScout &prototype, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state) const |
Same as createSubScouts_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More... | |
Index | findChildNaive_ (Index parent, AA child_label) |
During needle addition (naive trie), obtain the child with edge child_label from parent ; if it does not exist, an invalid Index is returned. More... | |
Index | findChildBFS_ (const Index parent, const AA child_label) const |
After compression (BFS trie), obtain the child with edge child_label from parent ; if it does not exist, an invalid Index is returned. More... | |
Private Attributes | |
std::vector< ACNode > | trie_ |
the trie, in either naive structure or BFS order (after compressTrie) More... | |
uint32_t | needle_count_ {0} |
total number of needles in the trie More... | |
uint32_t | max_aaa_ {0} |
maximum number of ambAAs allowed More... | |
uint32_t | max_mm_ {0} |
maximum number of mismatches allowed More... | |
std::vector< std::vector< uint32_t > > | vec_index2needles_ = { {} } |
maps a node to which needles end there (valid for both naive and BFS tree) More... | |
std::vector< std::vector< Index > > | vec_index2children_naive_ = { {} } |
maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly More... | |
An Aho Corasick trie (a set of nodes with suffix links mainly)
ACTrie | ( | uint32_t | max_aaa = 0 , |
uint32_t | max_mm = 0 |
||
) |
Default C'tor which just creates a root node.
max_aaa | Maximum number of ambiguous amino acids (B,J,Z,X) allowed in a hit |
max_mm | Maximum number of mismatched amino acids allowed in a hit |
~ACTrie | ( | ) |
D'tor.
Insert a new child node into the trie (unless already present) when starting at parent node from
and following the edge labeled edge
. Return the index of the (new) child node. Note: This operates on the naive trie, not the BFS.
Add all hits occurring in node i
(including all its suffix hits)
i | The ACNode where a needle ends (also all its suffices are checked) | |
text_pos | current position in query (i.e. end of matched hit) | |
[out] | hits | Result vector which will be expanded with hits (if any) |
|
private |
same as addHits_, but only follows the suffix chain until the scout loses its prefix
void addNeedle | ( | const std::string & | needle | ) |
Add a needle to build up the trie. Call compressTrie() after the last needle was added before searching
Exception::InvalidValue | if needle contains an invalid amino acid (such as '*') |
void addNeedles | ( | const std::vector< std::string > & | needles | ) |
Convenience function; adds needles to build up the trie. Call compressTrie() after the last needle was added before searching
Exception::InvalidValue | if needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception |
void addNeedlesAndCompress | ( | const std::vector< std::string > & | needles | ) |
Convenience function which adds needles and immediately compresses the trie (i.e. no more needles can be added)
Exception::InvalidValue | if needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception |
void compressTrie | ( | ) |
Traverses the trie in BFS order and makes it more compact and efficient to traverse.
Also creates the suffix links.
Call this function after adding all needles, and before searching any queries.
|
private |
Same as createScouts_, but instantiate all possible AA's except for those in the range from except_fromAA
to except_toAA
and the except_edge
itself.
|
private |
Same as createSubScouts_, but instantiate all possible AA's except for those in the range from except_fromAA
to except_toAA
and the except_edge
itself.
|
private |
Create scouts from an AAA or MM, starting at trie node i
, following edges in range fromAA
to toAA
The number of AAA's/MM's left for the scout must be passed (these numbers already reflect the original edge label)
|
private |
Create scouts from a scout with an AAA or MM, using prototype
as template, following edges in range fromAA
to toAA
.
After compression (BFS trie), obtain the child with edge child_label
from parent
; if it does not exist, an invalid Index is returned.
During needle addition (naive trie), obtain the child with edge child_label
from parent
; if it does not exist, an invalid Index is returned.
Starting at node i
, find the child with label edge
This requires an exact match, e.g. if edge
is an 'X' it will only match to 'X' in the trie (=needles) If no child exists, follow the suffix link and try again (until root is reached) Note: This operates on the BFS trie (after compressTrie()), not the naive one.
|
private |
Advances scout
by consuming edge
; same as follow_(), just for a scout Returns true if scout is still alive, false otherwise
void getAllHits | ( | ACTrieState & | state | ) | const |
Collects all hits from the current position in the query until the end of the query I.e. similar to while(next(state)) merge(hits_all, state.hits);
uint32_t getMaxAAACount | ( | ) | const |
Maximum number of ambiguous amino acids allowed during search.
uint32_t getMaxMMCount | ( | ) | const |
Maximum number of mismatches allowed during search.
size_t getNeedleCount | ( | ) | const |
How many needles were added to the trie?
bool nextHits | ( | ACTrieState & | state | ) | const |
Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits
is cleared & filled and true is returned. If the query ends and there is no hit, false is returned.
|
private |
Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits
is NOT cleared, but filled and true is returned. If the query ends and all scouts are processed, false is returned (but hits might still have changed)
void setMaxAAACount | ( | const uint32_t | max_aaa | ) |
Set maximum number of ambiguous amino acids allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.
void setMaxMMCount | ( | const uint32_t | max_mm | ) |
Set maximum number of mismatches allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.
|
inline |
|
private |
Same as follow_, but considers trying mismatches and AAA's if possible (by adding scouts to state
)
edge
|
private |
Same as follow_, but considers trying mismatches and AAA's if possible (by adding scouts to state
)
|
private |
maximum number of ambAAs allowed
|
private |
maximum number of mismatches allowed
|
private |
total number of needles in the trie
|
private |
the trie, in either naive structure or BFS order (after compressTrie)
|
private |
maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly
|
private |
maps a node to which needles end there (valid for both naive and BFS tree)