FFmpeg is a free and open-source video editing tool capable of trimming, cropping, concatenating, muxing, and transcoding almost any type of media file you throw at it.
It's also a very robust solution for implementing video automation, as we use it extensively in our own video editing API. For this tutorial we'll use FFmpeg 5.1.2, but any recent version will do.
Use the following command to join two clips into one file:
$ ffmpeg -i audio1.mp3 -i audio2.mp3 -filter_complex "[0:a][1:a]concat=n=2:v=0:a=1" output.mp3
The tracks are joined using the concat filter. When we specify v=0:a=1, we are telling the concat filter that there are no video streams to merge, only audio streams.
As with the example above, adding inputs to the concat filter will allow you to concatenate more than two audio files. Here is an example of adding an additional input file:
$ ffmpeg -i audio1.mp3 -i audio2.mp3 -i audio3.mp3 -filter_complex "[0:a][1:a][2:a]concat=n=3:v=0:a=1" output.mp3
Crossfading allows us to fade out one clip while fading in another. We can do this with the acrossfade filter:
$ ffmpeg -i audio1.mp3 -i audio2.mp3 -filter_complex "acrossfade=d=5:c1=tri:c2=tri" output.mp3