I am working on a tandem repeat project and I want to define a repeated motif that is complex, including indels and substitutions, with most bases being conserved. The motif varies in length between 22 and 28 bp.
I need to define it as an HMM motif because the tool I am using can only take exact motifs or HMM motifs. The exact way is not working for every sample because of the complexity of the motif.
I am looking for an appropriate way to model this motif, considering its variability. Since the motif includes both insertions/deletions (indels) and substitutions, I want to use a probabilistic model that can capture these variations while still recognizing the overall conserved structure.
I first used GLAM2 from the MEME suite, which provided me with a position probability matrix (PPM) for this motif. I was wondering if I could define it as an HMM motif, or if it lacks key information such as transition probabilities.
glam2 -a 11 -r 50 n motifs_vntr_20p.fasta -o glam2_motif_20p
Another approach I tried was multiple sequence alignment using MAFFT. I created a FASTA file where each sequence corresponds to one repeat of the motif (a total of 3,713 sequences). Then, I used hmmbuild from HMMER to build an HMM profile from the MAFFT alignment. However, I am unsure if this approach is reliable for modeling such a complex motif.
mafft --maxiterate 1000 --globalpair motifs_vntr_20p.fasta > mafft_vntr_20p.fasta
hmmbuild motifs_vntr_20p.hmm motifs_vntr_20p.fasta
Do you have any suggestions for better modeling this motif? Are there other tools that could be more suitable?
Thanks a lot!