...

The zEC12 Racehorse - a Linux Performance Update

by user

on
Category: Documents
1

views

Report

Comments

Transcript

The zEC12 Racehorse - a Linux Performance Update
Linux on System z / Dept. 3235 System & Performance Evaluation
The zEC12 Racehorse a Linux Performance Update
Mario Held, System Performance Analyst,
IBM Research and Development, Germany
September 2013
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions
worldwide. A current list of IBM trademarks is available on the Web at
“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.
Linux is a registered trademark of Linus Torvalds in the United States,
other countries, or both.
Java is a registered trademarks of Oracle and/or its affiliates.
Other product and service names might be trademarks of IBM or other
companies.
2
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Disclaimer - Lab Measurements versus Production Environment
 The Boeblingen Linux on System z performance team
– continuously checks the quality / performance of development items in the Linux on System
z area
– evaluates new or updated SUSE and Red Hat distributions
– evaluates new mainframe models and new hardware (network, disk, cryptography …)
– is directly involved in customer cases, proof of concepts … on demand
 Measurements using benchmarks are
– often stress tests to evaluate quality and stability of developed items and hardware
– quality assessments of distributions and their updates resp. service packs
– required to get numbers as input for future hardware sizings (knowing the limits ...)
 Benchmarks are special and often of limited use when it comes to real-life scenarios
– In real environments much workload runs in parallel
•
•
•
•
Usually resources and applications are not driven to their limits
Important is the quality of service
Many resources are shared intendedly
Real environments are often cost-constraint
– In performance measurements the workload runs exclusively
• Focus is on correct and reproducible results
• LPAR instead of z/VM in many scenarios
• Shared resources occur rarely
3
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Agenda
 zEnterprise EC12 design
 Linux performance comparison zEC12 and z196
4
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
zEC12 Continues the Mainframe Heritage
 Every two to three years a new mainframe generation appears
 New generations run at a higher rate
– Bigger and smaller steps
– 5.2 GHz with zEC12 compared to predecessor z196 running at 5.2 GHz
5.2 GHz
6000
5.5 GHz
4.4 GHz
5000
4000
MHz
3000
2000
1000
0
5
300
MHz
420
MHz
550
MHz
1997
G4
1998
G5
1999
G6
770
MHz
2000
z900
1.2
GHz
2003
z990
1.7
GHz
2005 2008
Z9 EC Z10 EC
2010
z196
2012
zEC12
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
The Evolution of Mainframe Generations
 PCI (uniprocessor performance ) increased from 1202 on z196 to more
than 1500 on zEC12
 Up to 101 processors supported for all systems on zEC12, was 80 for
z196
zEC12
6
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Processor Design Basics
 CPU (core)
–
–
–
–
Cycle time
Pipeline, execution order
Branch prediction
Hardware versus millicode
Generic Hierarchy example
Memory
 Memory subsystem
–
–
–
–
High speed buffers (caches)
On chip, on book
Private, shared
Coherency required
 Buses
–
Private
Cache
Private
Cache
CPU
CPU
Bandwidth
 Design Limited by distance, speed
of light, and available space
7
Shared Cache
Private
Cache
…
CPU
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
zEC12 versus z196 – Memory and Caches
 z196
–
Memory
Caches
●
●
●
●
L1 private 64k instr, 128k data
L2 private 1.5 MiB
L3 shared 24 MiB per chip
L4 shared 192 MiB per book
L4 Cache
L2
L1
–
…
CPU 1
 zEC12
L3 Cache
L2
L2
L2
L1
L1
CPU 4
CPU 1
…
L1
CPU 4
Caches
●
●
●
●
●
L1 private 64k instr, 96k data
L1+ 1 MiB (acts as second level
data cache)
L2 private 1 MiB (acts as
second instruction cache)
L3 shared 48 MiB per chip
L4 shared 2 x 192 MiB => 384
MiB per book
Memory
L4 Cache
L3 Cache
L2
L1+
L1
…
CPU 1
8
…
L3 Cache
…
L3 Cache
L2
L2
L1+
L1
L1+
CPU 6
CPU 1
L1
L2
…
L1+
L1
CPU 6
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
zEC12 PU Core
 Each core is a super-scalar processor with
these characteristics:
–
Six execution units
●
–
–
–
–
Up to three instructions decoded /
completed per cycle
Up to seven instructions issued per cycle
New instructions
Better branch prediction
●
–
Virtual branch unit and queue
Enhanced out-of-(program)-order (OOO+)
capabilities
●
9
2 fixed point (integer), 2 load/store, 1
binary floating point, 1 decimal floating
point
More out of order groups
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
zEC12 Out of Order – significant performance benefit
 Re-ordering instruction execution
–
–
–
Instructions stall in a pipeline because they are waiting for results from a
previous instruction or the execution resource they require is busy
In an in-order core, this stalled instruction stalls all later instructions in the
code stream
In an out-of-order core, later instructions are allowed to execute ahead of the
stalled instruction
 Re-ordering storage accesses
–
–
–
Instructions which access storage can stall because they are waiting on
results needed to compute storage address
In an in-order core, later instructions are stalled
In an out-of-order core, later storage-accessing instructions which can
compute their storage address are allowed to execute
 Hiding storage access latency
–
–
–
–
10
Many instructions access data from storage
Storage accesses can miss the L1 and require 10 to 500 additional cycles to
retrieve the storage data
In an in-order core, later instructions in the code stream are stalled
In an out-of-order core, later instructions which are not dependent on this
storage data are allowed to execute
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Out of Order Execution – z196 versus zEC12
 Out of Order (OOO) execution allows the earlier and parallel
execution of logical independent instructions
– Improved algorithms save even more time on zEC12
Instrs
1
2
3
4
5
6
7
In-order core execution
L1 miss
z196 Out-of-order core execution
L1 miss
Time
Time
zEC12 Out-of-order core execution
L1 miss
Dependency
Execution
Storage access
11
Improved
overlapping
opportunities
Time
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Set the expectations
 Performance relevant changes in these areas
–
–
Increased clock speed (5.2 GHz to 5.5GHz)
Changes in cache
●
●
●
–
Level 1 data cache smaller (128 k → 96k)
Level 1+ and Level 2 cache bigger (1.5 MiB → 2MiB)
Level 3 and Level 4 now of double size
Modern processor with
●
●
●
Improved Pipelining and Decoding
Better branch prediction
Out of Order of the second generation
 The expectation was to see 25 percent performance improvement
overall
12
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Agenda
 zEnterprise EC12 design
 Linux performance comparison zEC12 and z196
13
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Our zEC12 versus z196 Measurement Environment
 Hardware
–
zEC12
●
●
●
–
2827-789 H89
OSA Express 4S 10 GiB and 1 GiB
Connected to a DS8870 via FICON Express 8S
z196
●
●
●
2817-766 M66
OSA Express 4S 10 Gib and 1 Gib
Connected to a DS8800 via FICON Express 8S
 Recent Linux distribution with kernel
–
–
–
SLES11 SP2: 3.0.13-0.27-default (if not stated otherwise)
Shared processors (if not stated otherwise)
Linux in LPAR exclusively measured
●
All other LPARs deactivated
Source: If applicable, describe source origin
14
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
File Server Benchmark Description
 Dbench 3
–
–
–
–
–
Emulation of Netbench benchmark
Generates file system load on the Linux VFS
Does the same I/O calls like the smbd server in Samba (without
networking calls)
Mixed file operations workload for each process: create, write,
read, append, delete
Measures throughput of transferred data
 Configuration
–
–
–
15
2 GiB memory, mainly memory operations
Scaling processors 1, 2, 4, 8, 16
For each processor configuration scaling
processes 1, 4, 8, 12, 16, 20, 26, 32, 40
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Dbench3
 Throughput improves by 38 to 52 percent in this scaling
experiment, comparing zEC12 to z196
– Ramp–up phase until #processes = #processors
– Highest improvements during the ramp-up phase
Dbench Throughput
1 CPU z196
2 CPUs z196
4 CPUs z196
8 CPUs z196
16 CPUs z196
1 CPUs zEC12
2 CPUs zEC12
4 CPUs zEC12
8 CPUs zEC12
16 CPUs zEC12
1
4
8
12
16
20
26
32
40
Number of processes
16
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Kernel Benchmark Description
 Lmbench 3
–
–
–
–
–
Suite of operating system micro-benchmarks
Focuses on interactions between the operating system and the hardware
architecture
Latency measurements for process handling and communication
Latency measurements for basic system calls
Bandwidth measurements for memory and file access, operations and
movement
 Configuration
–
–
17
2 GB memory
4 processors
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Lmbench3
 Benefits seen in close to all measured latency and bandwidth operations
Measured operation
Deviation zEC12 to z196 in %
simple syscall
52
simple read/write
46 /43
select of file descriptors
32
signal handler
55
process fork
25
libc bcopy aligned L1 / L2 / L3 / L4 cache / main memory
0 / 12 / 25 / 10 / n/a
libc bcopy unaligned L1 / L2 / L3 / L4 cache / main memory
0 / 26 / 25 / 35 / n/a
memory bzero L1 / L2 / L3 / L4 cache / main memory
40 / 13 / 20 / 45 / n/a
memory partial read L1 / L2 / L3 / L4 cache / main memory
-10 / 25 / 45 / 105 / n/a
memory partial read/write L1 / L2 / L3 / L4 cache / main memory
75 / 75 / 90 / 180 / n/a
memory partial write L1 / L2 / L3 / L4 cache / main memory
45 / 50 / 62 / 165 / n/a
memory read L1 / L2 / L3 / L4 cache / main memory
5 / 10 / 45 / 120 / n/a
memory write L1 / L2 / L3 / L4 cache / main memory
80 / 92 / 120 / 250 / n/a
Mmap read L1 / L2 / L3 / L4 cache / main memory
0 / 13 / 35 / 110 / n/a
Mmap read open2close L1 / L2 / L3 / L4 cache / main memory
23 / 18 / 19 / 55 / n/a
Read L1 / L2 / L3 / L4 cache / main memory
60 / 30 / 35 / 50 / n/a
Read open2close L1 / L2 / L3 / L4 cache / main memory
27 / 30 / 35 / 60 / n/a
Unrolled bcopy unaligned L1 / L2 / L3 / L4 cache / main memory
35 / 28 / 60 / 35 / n/a
Unrolled partial bcopy unaligned L1 / L2 / L3 / L4 cache / main memory
35 / 13 / 45 / 20 / n/a
mappings
34 to 41
18
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Java Benchmark Description
 Industry standard benchmark
–
–
Evaluates the performance of server side Java
Exercises
●
●
●
●
●
–
Java Virtual Machine (JVM)
Just-In-Time compiler (JIT)
Garbage collection
Multiple threads
Simulates real-world applications including XML processing or floating point
operations
Can be used to measure performance of processors, memory hierarchy
and scalability
 Configurations
–
–
19
16 processors, 4 GiB memory, 1 JVM, 2GiB max heap size
IBM J9 JRE 1.7.0 SR4 64-bit
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Java Benchmark
 Business operation throughput improved by approximately 30% to 35%
Java benchmark throughput
Relative performance in %
40
35
zEC12
SLES11-SP2
-GMC
16_CPU
1JVM
JRE-1.7 SR4
BASE z196
SLES11-SP2GMC 16_CPU
1JVM JRE-1.7
SR4
30
Relative performance in %
Throughput
z196
SLES11-SP2
-GMC
16_CPU
1JVM
JRE-1.7 SR4
25
20
15
zEC12
SLES11-SP2GMC 16_CPU
1JVM JRE-1.7
SR4
10
5
0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of Warehouses
20
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of Warehouses
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Compute-Intense Benchmark Description
 Industry standard benchmark
–
–
–
–
–
Stressing a system's processor, memory subsystem and compiler
Workloads developed from real user applications
Exercising integer and floating point in C, C++, and Fortran programs
Can be used to evaluate compile options
Can be used to optimize the compiler's code generation for a given
target system
 Configuration
–
21
1 processor, 2 GiB memory, executing one test case at a time
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Single-threaded, compute-intense Workload
 SLES11 SP2 GA, gcc-4.3-62.198, glibc-2.11.3-17.31.1 using default
machine optimization options as in gcc-4.3 s390x
 Integer suite 28% more throughput (geometric mean)
 Floating point suite 31% more throughput (geometric mean), not shown
in a chart
Integer zEC12 versus z196 (march=z9-109 mtune=z10)
improvements [%]
0
5
10
15
20
25
30
35
40
45
testcase 1
testcase 2
testcase 3
testcase 4
testcase 5
testcase 6
testcase 7
testcase 8
testcase 9
testcase 10
testcase 11
testcase 12
22
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Description - Network Benchmark uperf
 Open Source (GPL) network performance tool that supports modeling
 Transactional Workloads
–
–
–
rr1c 1 x 1
rr1c 200 x 1000
rr1c 200 x 30k
Simulating low latency keep-alives
Simulating online transactions
Simulating database query
 Streaming workloads
–
–
str read
str write
Simulating incoming file transfers
Simulating outgoing file transfers
 All tests are done with 1, 10, 50, and 250 simultaneous connections
 All that across on multiple connection types
–
–
–
23
LPAR-LPAR / LPAR-z/VM / z/VM-z/VM
Physical and virtual connections
Different OSA cards and MTU sizes
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
uperf – OSA Express4S 10GiB MTU1492
 Throughput increased
–
Especially with streaming and many parallel connections
 Measured with a SLES-SP2 maintenance kernel (lanidle settings fixed)
Deviation App-Throughput in %
50
40
30
20
Base: z196, OE4s-10Gb-cable-1492
zEC12 ,OE4s-10Gb-cable-1492
10
str-writex30k-250
str-writex30k--50
str-writex30k--10
str-writex30k---1
str-readx30k-250
str-readx30k--50
str-readx30k--10
str-readx30k---1
rr1c-200x30k-250
rr1c-200x30k--50
rr1c-200x30k--10
rr1c-200x30k---1
rr1c-200x1000-250
rr1c-200x1000--50
rr1c-200x1000--10
rr1c-200x1000---1
rr1c-1x1-250
rr1c-1x1--50
rr1c-1x1--10
0
rr1c-1x1---1
Relative performance in %
60
Workload
24
© 2013 IBM Corporation
Processor consumption deviation in % (lower is better)
25
0
-20
-30
str-writex30k-250
str-writex30k---1
str-writex30k--10
str-writex30k--50
str-readx30k---1
str-readx30k--10
str-readx30k--50
str-readx30k-250
rr1c-200x30k---1
rr1c-200x30k--10
rr1c-200x30k--50
rr1c-200x30k-250
rr1c-200x1000---1
rr1c-200x1000--10
rr1c-200x1000--50
rr1c-200x1000-250
rr1c-1x1--10
rr1c-1x1--50
rr1c-1x1-250
–
rr1c-1x1---1
Linux on System z / Dept. 3235 System & Performance Evaluation
uperf – OSA Express4S 10GiB MTU1492 (cont.)
 Tremendous processor consumption savings
Benefit for all kinds of connection characteristics
Processor Consumption Savings
-10
Base: z196 ,
OE4s-10Gb-cable-1492
zEC12, OE4s-10Gb-cable-1492
-40
-50
-60
Workload
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
uperf – Hipersockets LPAR-LPAR MTU32k
 Throughput increased
–
Improvement with all types of connection and sizes measured
 Processor time savings similar as seen with OSA card
 Dedicated processors used in this measurement
 Measured with a SLES11-SP2 maintenance kernel
Deviation App-Throughput in %
80
70
26
50
40
Base z196 HS32k-HS32k-virtual-32k
LPAR-LPAR
zEC12 HS32k-HS32k-virtual-32k
LPAR-LPAR
30
20
str-writex30k---1
str-writex30k--10
str-writex30k--50
str-writex30k-250
str-readx30k---1
str-readx30k--10
str-readx30k--50
str-readx30k-250
rr1c-200x30k---1
rr1c-200x30k--10
rr1c-200x30k--50
rr1c-200x30k-250
0
rr1c-200x1000---1
rr1c-200x1000--10
rr1c-200x1000--50
rr1c-200x1000-250
10
rr1c-1x1---1
rr1c-1x1--10
rr1c-1x1--50
rr1c-1x1-250
Relative performance in %
60
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Description – Scalability Benchmark
 Re-Aim 7
–
–
Workload patterns describe system call ratios (patterns can be more
inter-process-communication, disk or calculation intensive)
The benchmark run
●
●
●
●
–
Often a good check for non-scaling interfaces
●
●
–
Starts with one job, continuously increases that number
Overall throughput usually increases until #threads ≈ #processors
Then threads are further increased until a drop in throughput occurs
Scales up to thousands of concurrent threads stressing the same components
Some interfaces don't scale at all (1 job throughput ≈ multiple jobs throughput,
despite >1 processors)
Some interfaces only scale in certain ranges
Measures the amount of jobs per minute a single thread and all the threads
can achieve
 Configuration
–
–
–
27
4, 8, 16 processors, 4 GiB memory
Using a journaled file system on an xpram device
Using fserver, new-db and compute workload patterns
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Re-Aim 'fserver' Workload Pattern
 More jobs per minute
– 25% to 30% more jobs after ramp up when using 4 and 8 processors
– Up to 80% more jobs after ramp up when using 16 processors
 Approximately 22% to 50% lower processor consumption, highest values with 16
processors
 More valid parallel running processes on the zEC12 before the benchmark stops
– Example 4352 versus 2523 parallel processes with 16 processors
Relative Performance
Deviation in percent
90
80
70
60
50
Using 4 Processors
Using 8 Processors
Using 16 Processors
40
30
20
10
0
0
500
1000
1500
2000
2500
3000
Number of processes
28
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Re-Aim 'new-db' Workload Pattern
 33% to 43 % more jobs per minute with 4, 8, and 16 processors after ramp up
 Approximately 25% to 30% lower processor consumption
 More parallel running processes on the zEC12 before the benchmark stops
– Example 3584 versus 2897 parallel processes with 16 processors
Relative performance
60
Deviation in percent
50
40
Using 4 Processors
Using 8 Processors
Using 16 Processors
30
20
10
0
0
500
1000
1500
2000
2500
3000
3500
Number of processes
29
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Re-Aim 'compute' Workload Pattern
 Jobs per minute with 4, 8, and 16 processors on average 31% higher after ramp up
 Approximately 20% to 25% lower processor consumption
 More parallel running processes on the zEC12 before the benchmark stops
– Example 1792 versus 1221 parallel processes with 16 processors
Relative performance
50
45
Deviation in percent
40
35
30
Using 4 Processors
Using 8 Processors
Using 16 Processors
25
20
15
10
5
0
0
200
400
600
800
1000
1200
1400
Number of processes
30
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Benchmark Description – DB2 Database BI Workload
 DB2 Query collection (IBM internal)
–
Ten complex database warehouse queries
●
–
Measured high hit / warm state
Provided by the IBM DB2 development (IBM Toronto lab)
 Configuration
–
–
–
–
31
128 GiB of main memory
16 processors
No I/O constraint
DB2 10.1 for Linux on System z
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
DB2 Database Complex BI Queries
 30 percent more throughput seen comparing zEC12 versus z196
 Close to factor two more throughput comparing zEC12 versus z10
DB2 BI Queries
Overall run time (lower is better)
25
Time in sec
20
z10
z196
zEC12
15
10
5
0
Testcase
32
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Benchmark Description – SysBench
 Scalability benchmark SysBench
–
–
–
–
SysBench is a multi-threaded benchmark tool (among others) for oltp
database loads
Can be run read-only and read-write
Clients can connect locally or via network to the database
Database level and tuning is important
●
–
33
We use Postgres 9.2.2 with configuration tuned for this workload in our
test
High/Low hit cases resemble different real world setup cases with
high or low cache hit ratios
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Benchmark description – SysBench (cont.)
 Configuration
–
–
–
Scaling – read-only load with 2, 8, 16 processors, 8 GiB memory, 4GiB
DB (High-Hit)
Scaling Net – read-only load with 2, 8, 16 processors, 8 GiB memory,
4GiB DB (High-Hit)
Scaling FICON / FCP High Hit ratio – read-write load with 8 processors,
8 GiB memory, 4GiB DB
●
–
Scaling FICON / FCP Low Hit ratio – read-write load with 8 processors, 4
GiB memory, 64GiB DB
●
–
This is also I/O bound to get the data into cache
All setups use
●
●
●
34
RW loads still need to maintain the transaction log, so I/O is still important
despite DB<MEM
HyperPAV (FICON) / Mulitpathing (FCP)
Disk-spread over the Storage Server as recommended + Storage Pool
Striping
Extra Set of disks for the WAL (Transaction Protocol)
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
SysBench – OLTP Read-Write Low-Hit SCSI Test
Processor consumption savings per transaction
Relative throughput
35
z196 OLTPReadWriteLowHit_scsi
30
25
20
15
zEC12 OLTPReadWriteLowHit_scsi
10
5
0
8
16 32 48 64 128 256
Number Of Threads
Relative performance in %
Relative performance in %
40
40
35
30
z196 OLTPReadWriteLowHit_scsi
25
20
15
zEC12 OLTPReadWrite-LowHit_
scsi
10
5
0
8
16
32
48
64 128 256
Number Of Threads
 Overall throughput increased by 35% after ramp-up
 About 35% less processor consumption
 Much lower latencies
35
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Benchmark Description – Disk I/O
 Flexible I/O Testing Tool (FIO)
–
–
–
Benchmarking and hardware verification / stress I/O devices
Open Source (GPLv2)
Easy to customize to individual needs
●
Provides information regarding throughput, latencies, system utilization
and much more
 Configuration
–
–
–
–
–
36
8 processors
512 MB main memory
z196 connected to DS8800 / zEC12 connected to a DS8870
FICON Express 8s
64 single disks, each in FICON and SCSI
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
FIO – Throughput Random Read DIO ASYNC
Random Read Throughput FICON
Random Read Throughput SCSI
1
2
4
8
16
32
Number of Jobs
64
z196 multipath multibus lvm Base DIO
ASYNCIO 8 SCSI
zEC12 multipath multibus lvm Base DIO
ASYNCIO 8 SCSI
Throughput
Throughput
z196 single disks
Base hpav DIO
ASYNCIO 8 FICON
zEC12 single disks
Base hpav DIO
ASYNCIO 8 FICON
1
2
4
8
16
32
64
Number of Jobs
 The z196 was connected to a DS8800, the zEC12 was connected to a
DS8870, so we expect increased throughput for newer hardware
–
–
Bigger caches
Faster disks
 Throughput charts intended to show that the same amount of data or
even more is transferred during a period of time on the zEC12 /
DS8870 combination
37
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
FIO – Processor Consumption Random Read DIO - ASYNC
 zEC12 needs much less processor time to read the same amount of
data using random access pattern
Savings in the FICON measurement 24% to 28%
Savings in the SCSI measurement 29% to 34%
–
Processor time per MB (lower is better)
Total processor consumption per MB FICON
z196 single disks
Base hpav DIO
ASYNCIO 8 FICON
zEC12 single disks
Base hpav DIO
ASYNCIO 8 FICON
1
2
4
8
16
Number of Jobs
38
32
64
Total processor consumption per MB SCSI
Processor time per MB (lower is better)
–
z196 multipath multibus lvm Base DIO
ASYNCIO 8 SCSI
zEC12 multipath
multibus lvm Base
DIO ASYNCIO 8 SCSI
1
2
4
8
16
32
64
Number of Jobs
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
FIO – Processor Consumption Random Write DIO - ASYNC
 zEC12 needs much less processor time to write the same amount of
data using random access pattern
–
Savings in the FICON measurement 24% to 29%
Savings in the SCSI measurement 28% to 31%
Processor time per MB (lower is better)
Total processor consumption per MB FICON
z196 single disks
Base hpav DIO
ASYNCIO 8 FICON
zEC12 single disks
Base hpav DIO
ASYNCIO 8 FICON
1
2
4
8
16
Number of Jobs
39
32
64
Total processor consumption per MB SCSI
Processor time per MB (lower is better)
–
z196 multipath multibus lvm Base DIO
ASYNCIO 8 SCSI
zEC12 multipath multibus lvm Base DIO
ASYNCIO 8 SCSI
1
2
4
8
16
32
64
Number of Jobs
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Benchmark description - Virtualization
 Industry standard virtualization benchmark
–
Uses a mix of real world applications in a virtualized server environment
●
●
●
●
–
Measured components in a customer-like scenario:
●
●
●
●
–
Virtualization z/VM
Virtualized Linux distribution
Several application / middleware software packages
Hardware
Allows to compare virtualized environments
●
●
●
–
Java application server with DB server connection
Web server with infrastructure server connection
Mail server
Uses many tiles consisting of six servers each
Usage patterns of single servers are pre-defined
Servers are not to be run at utilization limits
QoS is guaranteed with valid runs
Biggest and most challenging environment our performance team ever used
 Configuration
–
–
–
–
40
z/VM 6.2 uses 16 physical processors and 80 GiB of memory (one book)
Only four tiles (lightweight setup) using 4 x 6 = 24 virtual processors and 4 x 15.25
GiB = 61 GiB
z196 connected to DS8800 / zEC12 connected to a DS8870
z/VM Vswitch, OSA Expess 4s, FICON Express 8s
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Virtualization industry standard benchmark
Processor time real and virtual part
–
Overall result is
calculated of the results
for the several server
workloads and as
average for four tiles
Real CPU
 Savings in processor usage
when running on zEC12
Savings in virtualization
●
●
●
41
Real processor usage
(used by z/VM)
Virtual processor
usage (accounted for
by several virtual
Linux guests)
Total processor usage
(real + virtual)
Virtual CPU
Total processor time savings per server type
70.00%
60.00%
Savings
–
z196
zEC12
Time
 Reported benchmark result
is quite the same on zEC12
and z196
50.00%
Total Processor
savings
40.00%
30.00%
20.00%
10.00%
0.00%
Java-App DB2
Poll File-NFS Mail
Web
SUT Servers
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Summary of Measurements zEC12 versus z196
 Tremendous performance gains
–
–
Performance improvement seen in almost to all areas compared
Often combined with reduced processor consumption
Area
Kernel functions
Scaling CPU and processes
CPU intense
CPU and memory intense
Java
File server
Database
Database OLTP
Large DB2
Networking OSA Express4s
Networking HiperSockets
Disk I/O FICON Express8s
Virtualization
42
Benchmark
lmbench
dbench
Industry standard benchmark
reaim (CPU)
Industry standard benchmark
reaim (fserver)
reaim (new-DB)
sysbench
DB2 BI
uperf
uperf
FIO
Industry standard benchmark
Throughput
~45% (up to 250%), most
benefit at L3 and L4
40% to 70%
30%
31%
35%
25% to 30%
33% to 43%
35%
30%
5% to 50%
30% to 70%
flat
flat
CPU consumption savings
20% to 25%
22% to 50%
25% to 30%
30%
25% to 50%
25% to 50%
25% to 35%
33% to 68%
© 2013 IBM Corporation
Linux on System z / Dept. 3235 System & Performance Evaluation
Questions ?
 Further information is located at
–
Linux on System z – Tuning hints and tips
http://www.ibm.com/developerworks/linux/linux390/perf/index.html
–
Live Virtual Classes for z/VM and Linux
http://www.vm.ibm.com/education/lvc/
Mario Held
Linux on System z
System Software
Performance Engineer
IBM Deutschland Research
& Development
Schoenaicher Strasse 220
71032 Boeblingen, Germany
Phone +49 (0)7031–16–4257
Email [email protected]
43
© 2013 IBM Corporation
Fly UP