Family photo and video memories are pretty priceless, and the only effort I had put into saving copies was copying them from SD cards to an old portable hard drive. While a good option to begin with, these hard drives can fail, and are pretty susceptible to damage from kiddos and house fires. So, to the cloud with them!
My wife and I had looked into storing our memories on some popular options (Amazon Photos, Google Photos, Dropbox), but I didn't want to pay a premium for storing videos and photos we will optimistically never access. As a programmer, I thought I could come up with a solution to use a lower-level storage service that would be cheaper since it didn't include bells and whistles. Also, there is no reason the copies of our photos and videos would need to be full resolution. This meant my script or program should try to lower the resolution of the videos and photos when possible.
I decided to write a program in rust that leverages ffmpeg and some other libraries to compress videos and photos, then I chose S3 as the storage solution because it has an ULTRA cheap option for storage called 'S3 Glacier Deep Archive'. As of this writing, storing data on S3 Glacier Deep Archive costs $0.00099 per GB per month. That's right, storing 1 TB of data for a year will only cost you ~$12.
Deepfrei
Well every program needs a name, right? I decided to combine Deep from Glacier deep archive with my last name Frei. I actually apologize for how terrible that name is, but that's what I'm going with.
Here's a high level look at the flow of this application:
- Takes source directory and target S3 location as input.
- Evaluate all the files in source directory detecting if any are already uploaded.
- Split into two tasks that run on separate parallel threads: a. Video & photo resizing b. Video & photo uploading
- Collect all events from threads and print out some very basic metrics
Now a bit more in depth (skipping step one above, because it's self explanatory).
Step 2: Scoping out the work
Deepfrei needs to know what work it needs to do. It digs through all files, compares them with S3, and builds a "to-do list", more or less. The "to-do list" is actually a Vec<FileToUpload>
where FileToUpload
looks like:
struct FileToUpload {
input_path: PathBuf,
target_path: PathBuf,
uploaded: bool,
process_state: Option<processing::ProcessState>,
}
The first three fields seem obvious to me, but the fourth one is more interesting. We'll get to ProcessState
later, but by its name you might guess it holds information about the file being processed. You see it's wrapped in an Option
above which means a file might get processed or might not. If I had a PDF in my source directory, for instance, that wouldn't need processing.
Step 3a: Video & Photo Resizing
This is one of two threads which runs during the majority of Deepfrei. Its job is to look at the files_to_upload
to-do list and pick the next one that satisfies the condition: it's not been uploaded, can be processed, and hasn't been processed yet. The latter of these conditions is recorded in ProcessState
:
pub struct ProcessState {
pub is_processing_complete: bool,
pub process_file_path: PathBuf,
process_type: ProcessFileType,
in_progress_file_path: PathBuf,
}
It's got a couple public fields and a couple private fields. The public fields are available to the other thread because they're required to know when uploading. The private fields concern only the processing logic. ProcessFileType
is an enum with items Mp4Video
and Image
. This value is determined using file extensions. Using this enum, the main processing logic determines which function to use when processing:
match process_state.process_type {
ProcessFileType::Mp4Video => {
video::process(&f.input_path, &process_state.in_progress_file_path)?
}
ProcessFileType::Image => {
image::process(&f.input_path, &process_state.in_progress_file_path)?
}
}
Processed files and in-progress processing files get stored in a temporary directory. They're deleted after upload is complete. Also if Deepfrei gets interrupted while processing, any remaining temp files will be used upon restarting Deepfrei so it doesn't start over completely.
This thread ends when all files which need processing have been processed. This thread always finishes before the other.
Step 3b: Video & Photo Uploading
The other main thread in the program spins until there is a to-do item which satisfies this condition (pseudocode): it's not been uploaded and (it doesn't need processing or it's been processed). If no files are ready to be uploaded, then it sleeps for a few seconds and checks again, for the other thread may've finished processing one. If the internet is acting a bit wonky, there's logic to retry the upload a few times before moving on to the next file. This thread ends when all files are uploaded.
Step 4: Wait for Threads then Print Metrics
During steps 3a and 3b, the threads are constantly adding log events to a shared list. The events are strictly typed:
pub enum EventType {
StartUploading(String, u64), // file name & size in bytes
EndUploading(String), // file name
StartProcessing(String, crate::processing::ProcessFileType), // file name & type
EndProcessing(String), // file name
}
The time is also logged, but the logger code wraps it next to this EventType. It makes sense for the threads to not care about the current time. This is a full event:
pub struct Event {
pub time: DateTime<Local>,
pub event_type: EventType,
}
I built the logger to be thread-safe which is a requirement. It looks like this: pub struct Logger(Arc<Mutex<Vec<Event>>>);
. It has a method add_event
which locks the mutex and adds an item the inner vec:
pub fn add_event(&mut self, event_type: EventType) {
if let Ok(mut lock) = self.0.lock() {
let event = Event {
time: Local::now(),
event_type,
};
lock.push(event);
}
}
I thought this would make it very easy to sprinkle log events throughout the logic within the main threads. For example, here is the line that adds an event when a file has completed processing:
logger.add_event(EventType::EndProcessing(
f.input_path.to_str().unwrap().to_owned(),
));
Using these log events, the main program is able to output some helpful information like total_bytes_uploaded
and total_uploaded_time
which can be used with a little division to get your internet's upload speed.
A note on parallel design
When designing a parallel solution there are two main ways to share data between threads: message passing and shared memory. In my implementation, I used chose shared memory. I wrapped all items that need to be shared in an Arc<Mutex<???>>
. The Arc
allows me to pass a reference of some data on the heap to multiple threads. The Mutex
allows write access to those multiple owners.
If I were asked to do it over, I'd probably try message passing. While shared memory worked well, the way work got divided would fit message passing. Imagine the processing thread starts with a queue of items it needs to process. Similarly, the uploading thread would start with a queue of items it needs to upload. As the processing thread finishes processing files, it would send a message to the uploading thread which would tell it to add an item to its queue. Logging, however, works nicely as shared memory, I think.