Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add timeout pog, allocation bonus, test container name #221

Merged
merged 6 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,43 +210,43 @@ The score calculation function now determines a miner's performance primarily ba
- NVIDIA RTX A5000: 0.36
- NVIDIA RTX A4500: 0.34

**Scaling Factor**: Determine the highest GPU base score, multiply it by 8 (the maximum number of GPUs), and set this scenario as the 100-point baseline. A scaling factor is derived so that using eight of the top GPU models equals 100 points.
**Scaling Factor**: Determine the highest GPU base score, multiply it by 8 (the maximum number of GPUs), and set this scenario as the 100-point baseline. A scaling factor is derived so that using eight of the top GPU models equals 50 points.

**GPU Score**: Multiply the chosen GPU’s base score by the number of GPUs (up to 8) and by the scaling factor to find the miner’s GPU score (0–100).
**GPU Score**: Multiply the chosen GPU’s base score by the number of GPUs (up to 8) and by the scaling factor to find the miner’s GPU score (0–50).

**Allocation Bonus**: If a miner has allocated machine resources, add 100 points to the GPU score, allowing a maximum score of up to 200.
**Allocation Bonus**: If a miner has allocated machine resources, the GPU score is multiplied by 2, allowing a maximum score of up to 100.

**Total Score**:

- Score (not allocated) = GPU Score (0–100)
- Score (allocated) = GPU Score + 100 (up to 200)
- Score (not allocated) = GPU Score (0–50)
- Score (allocated) = GPU Score * 2 (up to 100)

### Example 1: Miner A's Total Score

- **GPU**: NVIDIA H200 (Base Score: 3.90)
- **Number of GPUs**: 8
- **Allocation**: True
- **Allocation**: False

Step-by-step calculation:
1. Highest scenario: 3.90 * 8 = 31.2
2. Scaling factor: 100 / 31.23.2051
3. GPU Score: 3.90 * 8 * 3.2051100
4. Allocation Bonus: 100 + 100 = 200
1. Highest scenario: 4 * 8 = 32
2. Scaling factor: 50 / 321.5625
3. GPU Score: 4 * 8 * 1.562550
4. Allocation Bonus: 0

Total Score = 200
Total Score = 50

### Example 2: Miner B's Total Score

- **GPU**: NVIDIA RTX 4090 (Base Score: 0.69)
- **Number of GPUs**: 2
- **Allocation**: False
- **Allocation**: True

Step-by-step calculation:
1. Scaling factor (same as above): 3.2051
2. GPU Score: 0.69 * 2 * 3.20514.42
3. No allocation bonus applied.
1. Scaling factor (same as above): 1.5625
2. GPU Score: 0.68 * 2 * 1.56252.125
3. Allocation Bonus: 2.125 * 2 = 4.25

Total Score = 4.42
Total Score = 4.25

## Resource Allocation Mechanism

Expand Down
2 changes: 1 addition & 1 deletion compute/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import string

# Define the version of the template module.
__version__ = "1.6.0"
__version__ = "1.6.1"
__minimal_miner_version__ = "1.6.0"
__minimal_validator_version__ = "1.6.0"

Expand Down
34 changes: 32 additions & 2 deletions neurons/Miner/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@

image_name = "ssh-image" # Docker image name
container_name = "ssh-container" # Docker container name
container_name_test = "ssh-test-container"
volume_name = "ssh-volume" # Docker volumne name
volume_path = "/tmp" # Path inside the container where the volume will be mounted
ssh_port = 4444 # Port to map SSH service on the host
Expand All @@ -56,7 +57,7 @@ def kill_container():
client, containers = get_docker()
running_container = None
for container in containers:
if container_name in container.name:
if container.name == container_name:
running_container = container
break
if running_container:
Expand All @@ -76,6 +77,31 @@ def kill_container():
bt.logging.info(f"Error killing container {e}")
return False

# Kill the currently running test container
def kill_test_container():
try:
client, containers = get_docker()
running_container = None
for container in containers:
if container.name == container_name_test:
running_container = container
break
if running_container:
# stop and remove the container by using the SIGTERM signal to PID 1 (init) process in the container
if running_container.status == "running":
running_container.exec_run(cmd="kill -15 1")
running_container.wait()
# running_container.stop()
running_container.remove()
# Remove all dangling images
client.images.prune(filters={"dangling": True})
bt.logging.info("Test container was killed successfully")
else:
bt.logging.info("No running container.")
return True
except Exception as e:
bt.logging.info(f"Error killing container {e}")
return False

# Run a new docker container with the given docker_name, image_name and device information
def run_container(cpu_usage, ram_usage, hard_disk_usage, gpu_usage, public_key, docker_requirement: dict):
Expand Down Expand Up @@ -150,13 +176,17 @@ def run_container(cpu_usage, ram_usage, hard_disk_usage, gpu_usage, public_key,
# Create the Docker volume with the specified size
# client.volumes.create(volume_name, driver = 'local', driver_opts={'size': hard_disk_capacity})

# Determine container name based on ssh key
container_to_run = container_name if docker_ssh_key else container_name_test


# Step 2: Run the Docker container
device_requests = [DeviceRequest(count=-1, capabilities=[["gpu"]])]
# if gpu_usage["capacity"] == 0:
# device_requests = []
container = client.containers.run(
image=image_name,
name=container_name,
name=container_to_run,
detach=True,
device_requests=device_requests,
environment=["NVIDIA_VISIBLE_DEVICES=all"],
Expand Down
11 changes: 6 additions & 5 deletions neurons/Validator/calculate_pow_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,23 +39,24 @@ def calc_score_pog(gpu_specs, hotkey, allocated_hotkeys, config_data, mock=False
# Get the GPU with the maximum score
max_gpu = max(gpu_scores, key=gpu_scores.get)
max_score = gpu_scores[max_gpu]*8
score_factor = 100/max_score
score_factor = 50/max_score

gpu_name = gpu_specs.get("gpu_name")
num_gpus = min(gpu_specs.get("num_gpus"), 8)

# Get GPU score
score = gpu_scores.get(gpu_name) * num_gpus * score_factor

# Add allocation score, i.e. max un-allocated score = 100
# Add allocation score, multiplier = 2
if hotkey in allocated_hotkeys:
score += 100
score = score * 2

# Logging score
bt.logging.info(f"Score - {hotkey}: {score:.2f}/200")
bt.logging.info(f"Score - {hotkey}: {score:.2f}/100")

# Normalize the score
normalized_score = normalize(score, 0, 200)
normalized_score = normalize(score, 0, 100)

return normalized_score
except Exception as e:
bt.logging.error(f"An error occurred while calculating score for the following hotkey - {hotkey}: {e}")
Expand Down
2 changes: 2 additions & 0 deletions neurons/miner.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
build_sample_container,
check_container,
kill_container,
kill_test_container,
restart_container,
exchange_key_container,
pause_container,
Expand Down Expand Up @@ -218,6 +219,7 @@ def __check_alloaction_errors(self):
bt.logging.info(
"Container is already running without allocated. Killing the container."
)
kill_test_container()

def init_axon(self):
# Step 6: Build and link miner functions to the axon.
Expand Down
20 changes: 17 additions & 3 deletions neurons/validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,8 +643,13 @@ async def worker():
break
hotkey = axon.hotkey
try:
result = await asyncio.get_event_loop().run_in_executor(
self.executor, self.test_miner_gpu, axon, self.config_data
# Set a timeout for the GPU test
timeout = 300 # e.g., 5 minutes
result = await asyncio.wait_for(
asyncio.get_event_loop().run_in_executor(
self.executor, self.test_miner_gpu, axon, self.config_data
),
timeout=timeout
)
if result[1] is not None and result[2] > 0:
async with results_lock:
Expand All @@ -655,6 +660,16 @@ async def worker():
update_pog_stats(self.db, hotkey, result[1], result[2])
else:
raise RuntimeError("GPU test failed")
except asyncio.TimeoutError:
bt.logging.warning(f"⏳ Timeout while testing {hotkey}. Retrying...")
retry_counts[hotkey] += 1
if retry_counts[hotkey] < retry_limit:
bt.logging.info(f"🔄 {hotkey}: Retrying miner -> (Attempt {retry_counts[hotkey]})")
await asyncio.sleep(retry_interval)
await queue.put(axon)
else:
bt.logging.info(f"❌ {hotkey}: Miner failed after {retry_limit} attempts (Timeout).")
update_pog_stats(self.db, hotkey, None, None)
except Exception as e:
bt.logging.trace(f"Exception in worker for {hotkey}: {e}")
retry_counts[hotkey] += 1
Expand All @@ -668,7 +683,6 @@ async def worker():
finally:
queue.task_done()


# Number of concurrent workers
# Determine a safe default number of workers
cpu_cores = os.cpu_count() or 1
Expand Down