view 0notes.txt @ 18:e4d75f9efb90 draft

planemo upload commit b'4303231da9e48b2719b4429a29b72421d24310f4\n'-dirty
author nick
date Thu, 02 Feb 2017 18:44:31 -0500
parents
children
line wrap: on
line source

============================
Reverse engineering of mafft
============================
For reference, the source code I'm working with is at ~/bx/src/mafft-7.221-without-extensions.

-----------------
mafft bash script
-----------------
Note: At this point, a faster way to reverse engineer this is probably by running it with debug prints added at various points to check variable values.

align_families.py executes the mafft command "mafft --nuc --quiet $tempfile".
"mafft" is actually a bash script at "scripts/mafft" in the source. But it's so insane it's essentially obfuscated (it doesn't help that the only comments are in Japanese). It's clear, though, that the bash script decides which executable to run, based on your arguments.

Approaching from another direction, I can see that when I execute the mafft command, it always executes the same exact sub-command (even for different input read lengths):
  disttbfast -q 0 -E 2 -V -1.53 -s 0.0 -W 6 -O -C 0 -D -b 62 -f -1.53 -Q 100.0 -h 0 -F -X 0.1
Searching the bash script, this occurs on line 2060. The input comes from stdin, which is fed $TMPFILE/infile. That resolves to /tmp/mafft.*/infile (lines 826 & 829). This seems to simply be the input file I give to the mafft command, processed a little: \r is converted to \n (line 849), and a newline is added to the end (line 850). HOWEVER, there are many points where the input file may be additionally processed before disttbfast gets to it.
The output is piped to $TMPFILE/pre. This appears to be the aligned FASTA.
It is potentially altered by several executables before final output:
  line	command		executed?	condition
  2068	splittbfast	false		  [ $distance = "parttree" ]
  2083	setcore		false		  [ $coreout -eq 1 ]
  2086	restoreu	false		! [ $coreout -eq 1 ] && [ $anysymbol -eq 1 ]
  2181	f2cl		false		  [ "$outputfile" = "" ] && ! [ "$outputopt" = "null" ]
  2187	f2cl		false		! [ "$outputfile" = "" ] && ! [ "$outputopt" = "null" ]

Here's a dissection of the invocation of disttbfast, correlating variables in line 2060 with their values, as seen in the actual executed command. I resolved all of them by looking at the mafft bash script.
value	variable
	"$prefix/disttbfast"
	-q
0	$npickup
	-E
2	$cycledisttbfast
	-V
-1.53	"-"$gopdist
	-s
0.0	$unalignlevel
 	$legacygapopt
 	$mergearg
	-W
6	$tuplesize
 -O	$termgapopt
 	$outnum
	$addarg
	-C
0	$numthreads
	$memopt
	$weightopt
	$treeinopt
	$treeoutopt
	$distoutopt
 -D	$seqtype
-b 62	$model
	-f
-1.53	"-"$gop
	-Q
100.0	$spfactor
	-h
0	$aof
 -F	$param_fft
	$algopt
 -X 0.1	$treealg
	$scoreoutarg

--------
Makefile
--------
The commands it executes when you run "make disttbfast" (in a directory lacking only disttbfast and disttbfast.o) are:
$ gcc  -Denablemultithread -O3 -c disttbfast.c
$ gcc -o disttbfast mtxutl.o io.o mltaln9.o tddis.o constants.o partSalignmm.o Lalignmm.o rna.o Salignmm.o Falign.o Falign_localhom.o Galign11.o SAalignmm.o MSalignmm.o disttbfast.o defs.o fft.o fftFunctions.o addfunctions.o  -Denablemultithread -O3  -lm  -lpthread
The first command uses the -c option to stop gcc from doing the linking, producing only disttfast.o. The second command links it with all its dependencies and produces the final binary.
The long list of .o files is apparently stored in $(OBJDISTTBFAST). In order to create disttbfast.so, it seems you'll have to recompile all its dependencies as .so files.

Update 1:
I looked at Makefile.sos, and used their CFLAGS to try compiling disttbfast.so:
$ gcc -Denablemultithread -O0 -fPIC -pedantic -Wall -std=c99 -g -DMALLOC_CHECK_=3 disttbfast.c -o disttbfast.so mtxutl.o io.o mltaln9.o tddis.o constants.o partSalignmm.o Lalignmm.o rna.o Salignmm.o Falign.o Falign_localhom.o Galign11.o SAalignmm.o MSalignmm.o disttbfast.o defs.o fft.o fftFunctions.o addfunctions.o -lm  -lpthread
It looks like it would've worked, except for many functions which are defined in multiple source files (including main()). I guess I could edit the source files to remove those definitions, but I'm guessing that's not right. But I don't know how to compile it correctly when functions are defined in both disttbfast.c and its dependencies.

Update 2:
Okay, or all I need to do is use their Makefiles.sos' libdisttbfast.so directive:
[Sat Mar 19] me@yoga: ~/bx/code/mafft/core
$ make -f Makefile.sos libdisttbfast.so
gcc -shared -o libdisttbfast.so mtxutl.o io.o mltaln9.o tddis.o constants.o partSalignmm.o Lalignmm.o rna.o Salignmm.o Falign.o Falign_localhom.o Galign11.o SAalignmm.o MSalignmm.o disttbfast.o defs.o fft.o fftFunctions.o addfunctions.o  -Denablemultithread -fPIC -O0  -fPIC -pedantic -Wall -std=c99 -g -DMALLOC_CHECK_=3   -lm  -lpthread


------------
disttbfast.c
------------
There's a lot that'll need modification:
- Assuming the cwd is the temporary directory.
  - Reading from or writing to filenames (search "fopen", FILE variables)
- Calling external commands like line 2477: system( "cp infile.tree GuideTree" );
- Printing to stderr (I'd like to be able to silence that).
- Calling exit();

It looks like the writing of results might happen wherever writeData_pointer() is called.
- Defined on line 2425 of io.c.
- Could just replace with a return statement.