I do not want to bog you down or lecture to you as I have done enough of that already, but here are some suggestions for developing assembly code:
Always write your functions in C first.
Vectorize it in C if possible.
Debug the C. Single step it.
Lock the code. There should be little to no (preferably no) changes later.
If code is not fast enough, then and only then start your assembly.
Single step the assembly.
Compare output of C to output of assembly. Do them both! Their outputs must match.
Keep the C code in a safe place.
It never fails. You may be in the middle of optimizing some assembly code and management comes by for a demo. If you still have the C code, you can switch over to it; it runs slower but you are still able to run the demo.
If your code deals with arrays of numbers or a series of like numbers, then orient your C code in groups of four for single-precision floating-point numbers , etc. The idea is 128-bit data. Read Vector Game Math Processors for more information on this topic.
Debug your C code! Every loop, every variable. Make sure it works exactly as you think it does. This is important when you go to benchmark its results with the assembly code.
Locking your function means that it is done. There is (hopefully) absolutely no reason ever to make any more changes to it.
Maybe your optimized C code is fast enough for your needs. It takes time to write assembly code and get it working correctly. Also, you will want to phase in your assembly code. From within your C code, call the assembly function (at least initially during development). This allows you to generate two sets of results and compare them. This also allows you to handle the aligned memory algorithm first before moving on to the misaligned version, etc. One by one, you phase them in. Eventually you actually vector the assembly code instead of the C code. You keep the C code as a fallback position, plus it gives you a starting point if the function is changed in a major way.