Chapter 18 Design A Payment System

Design a scalable, reliable payment service which could be used by your company to process payment, it could read the account info and payment info from the user and make payment. Source of truth ledger system, reduce the total number of transactions(1-2% fee per transaction)

Pay per transaction: From rider to Uber

  • Paid per-transaction: Amazon
  • Paid post-transaction: Uber trip

Schedule batched monthly payment: pay to driver

18.1 Feature Requirements

18.1.1 Post-transaction: uber trip

  • Account balance (uber credit), account info: saved payment methods (credit numbers, bank account, etc)
  • Currency, localization? (USD, global: multiple dc across multiple regions)
  • User has multiple accounts

18.1.2 Scheduled/Batch payments

  • Monthly (configrable, hard-code is fine)
  • Batch several accounts in one payment request?
  • One account scheduled per month

Additionally, payment transaction to bank latency could be ~ 1 sec level.

18.1.3 What could go wrong:

  • Lack of payment

No sufficient fund (bank denied it)

  • Double spending/payout

Charge twice (it’s not you use the same money to pay for two things)
correctness/at-least-one transaction

  • Incorrect currency conversion

n/a

  • Dangling authorization

Don’t worry about auth, not client side auth?
Auth to the payment system
Auth from bank
PSP auth (username/password pass to bank)

  • Incorrect payment

Discrepancy from downstream services??

  • Incompatible IDs(only temporarily unique)

Idempotency key? Diff id system (tmp unique)

  • 3rd party PSP outage

Any downstream service outages

  • Data discrepancy of charges from stripes / braintree

Same as incorrect payment

18.2 10 mins: Constraints/SLA

  • Scalable: 30M transaction per day = 300 QPS; peak: 10 times more = 3K QPS
  • Durable: hardware/software failures
  • Fault-tolerant: any part of the system goes down/failures
  • Consistency: Strong consistency.
  • Availability: 99.999% (5 mins downtime/year) –> Five 9, which is 5 mins per year
  • Latency: p99. SLA < 200 ms for others

PSP: payment service provider.

18.3 10 mins: High Level Architecture

Client (mobile/web) end trip -> frontend webservice —emit–> trip end event –async–> payment service -> downstream service (PSP, bank, credit cards, etc.)

  • retry and de-duplication with idempotent API

Request: idempotent key (UUID, deterministic id), status: success/failure (retryable/non-retryable)

Request Life Cycle

  1. Trip end event from event queue (durable), AWS SQS (at-least-once delivery)
  • only one worker work on the same event
  • If the worker failed/timeout, the event would go back to the queue to be picked up by another worker
  • If it failed for configurable times, it goes to DLQ
  • Scale up workers for past processing
  1. Backend Worker pull event from queue (eventid as idempotent key)

Db record: Idempotentkey, status (success/retryable_failure/non-retryable_failure) Missing: queued / running? Serve 1 -> req A → failed.

Prepare (db row-level lock on the idempotent key, lease expiry): expiry time: Configurable value > max downstream processing time

Get or create an entry in db If the entry in db with status finished, return
If the entry in db with status (retryable) or the entry doesn’t exist, we need to continue process
Process: talk to downstream services (PSP, bank) for the actual money transfer
Async, circuit break (fail fast), retry (exponentially back off retry )  After we get response back, we need to update the db record

First update the status to either success/failure (5xx retryable failure, 4xx non-retryable failures)
Release the lock on the idempotent key
Delete that event from queue

Scenario 1: Serve 1 -> req A locked → server 1 crashed —> will not deadlock. lock have expiration: need process to restart
Queue: send and forget (at least once process)
SQS message end of lifecycle stop when the process is fully done
MQ / Kafka end of life cycle is the consume the message
Backend process to restart

Scenario 2: expirable time 5 mins.

Serve1 -> Row 1 lock —-> stop the world GC—more than 5 mins–> W
Server 2-> row1 lock -> W data —X

Scenario 3: worker process payment actually went through, didn’t go back to SQS.
Failed to write to DB as success:
Retry -> idempotent -> no side effect
Failed to delete the event
Next worker gets the event, and check db and saw it’s already finished, do nothing and return
Distributed locking: id (monotonically increment id), server1 request with id: 10, server2 request with id: 11, when server1 goes back online, trying to commit, it can see it’s id is invalid (abort that transaction)
After processing finished, bookkeeping for audit/data analysis purpose
Save transaction history, bookkeeping datas (money in/out for each account)
Sweep process to cleanup/purge the tables (delete rows that are 1 year old)

18.4 10 mins: API Design

18.5 10 mins: Data Model

18.6 10 mins: DB choice

1 Row – 10K
30 M per day —> 300 GB per day * 180 -> 50 TB = 100 DB instance (500 GB per instance)
Sharding based on transaction key - trip id → cross shard transaction
SQL (ACID, transaction): account, request tables (purge every 6 months)
No-SQL: transaction_history, bookkeeping tables