1

I would like to launch multiple Amazon EC2 spot instances (fleet?) using a custom AMI (docker?) for performing a deep-learning training task. I would like all the instances to share a common set of files for the purposes of training the model.

The idea here is not to lose training history and keep a backup in EBS (network drive?) when the spot instance is terminated by AWS due to pricing-limit/demand. The task state can be updated in a file and then resumed when instances are available.

Is it possible to launch all instances and let them work cooperatively to complete the training task? What kind of a setup could accomplish this?

1 Answer 1

2

Firstly, you might be interested in the Deep Learning AMI from the AWS Marketplace, which comes fully-configured with popular Deep Learning tools.

If the software you are using wishes to save its data to a local file system (as opposed to Amazon S3), then you could use Deep Learning AMI to share a file system amongst multiple Amazon EC2 instances (including Spot instances). Amazon EFS is similar to a NAS and can be used simultaneously across multiple instances.

The EFS volume could be mounted via a User Data script, together with a setup script to load and run your desired application (which can be easier than making a new AMI).

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for pointing out the DL AMI. Your inputs are greatly appreciated. As I see, the spot instance fleet is a very valuable and cost-effective tool in AWS. I will experiment and post my learning on this thread. I am also looking at their API to automate some of the tasks.
Hi @SampathVanimisetti, if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this.
Apologies! New around here as you may have noticed. I tried upvoting, but it seems I need reputation points before I am able to do so. I have accepted the answer.
I am assuming that the options you indicated can be instantiated using the Amazon EC2 API. If I use the API to launch a spot instance fleet, some of the instances may be terminated due to pricing/demand. I understand there are different storage options. What is the most cost-efficient option for the purposes of training and storing DL training data and models. I can see that the spot fleet instance may need to be started multiple times before the training can be completed.
As mentioned in my answer, you might consider using Amazon EFS (a shared disk between instances). Other choices are Amazon S3 (but this would need software to specifically handle it) or a database hosted outside of the Spot instances (eg Amazon RDS or Amazon DynamoDB).
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.