Re: Hyperthreading vs. SMP

From: Anne & Lynn Wheeler (lynn_at_garlic.com)
Date: 12/13/03


Date: Sat, 13 Dec 2003 02:15:55 GMT

wb <dead_email@nospam.com> writes:
> How is memory contention (cache coherence) maintained
> with these hyperthreading machines ? Does it require an
> external memory agent ? In a SMP or NUMA, a memory
> controller (MIP's called it an agent ) ensured that memory
> integrity was kept. Can each virtual process instance be
> making memory updates ?

hyperthreading just uses more than one instruction stream, typically
in an already superscaler processor ... sharing the same cache.

the superscaler processor has multiple instructions in flight already
... one of the purposes of superscaler is to compensate for cache
misses ... other instructions can proceed in parallel when one
instruction is stalled because of cache miss. The superscaler
processor may also have speculative execution when conditional
branches are encounterd .... i.e. assume that the direction of the
branch is to go one way ... and if it turns out not to ... back-out
all the instructions executed on the wrong path.

one of the first such efforts was a dual i-stream design for the
370/195 (some 30 years ago). 195 had 64 instruction pipeline ... but
w/o support for speculative executions ... so branches in the
instruction stream drained the pipeline. except for some specialized
codes, the 195s tended to run at half (or less) of theoritical thruput
because of the large number of conditional branches commonly found in
standard codes. the dual i-stream project defined two instruction
streams, a duplicate set of registers and a red/black bit flag tagging
each operation in the pipeline (indicating which instruction stream
the operation was associated with).

hyperthreading, in principle supports more than one instruction stream
concurrently within the context of already complex superscaler context
... using common processor cache.

in such configurations ... two or more physical/logical processors
(instruction streams) sharing the same cache won't have a cache
consistency problem (although they may have some serialization
issues). it is when there are multiple caches that the issue of
memory/cache consistency arises.

A possibly configuration is four physical processors with two physical
caches (where each physical cache supports two physical processors).
There are cache consistency issues involved in coordination between
the two caches (it is not between the four processors but between the
caches). If you add hyperthreading to each of the four physical
processors (say it now appears as eight logical instruction streams),
that change is possibly totally transparent to the cache operation and
the coherency operation between the two caches.

serialization of processors is typically done with automatic
operations like compare&swap ... but that is somewhat orthogonal to
the coherency implementation between caches (which can be totally
independent of the number/kind of instruction streams supported).

recent posting from somewhat related thread in comp.arch
http://www.garlic.com/~lynn/2003p.html#1 An entirely new proprietrary hardware strategy

misc. past threads mentioning 370/195 dual i-stream work from 30 years
ago:
http://www.garlic.com/~lynn/94.html#38 IBM 370/195
http://www.garlic.com/~lynn/99.html#73 The Chronology
http://www.garlic.com/~lynn/99.html#97 Power4 = 2 cpu's on die?
http://www.garlic.com/~lynn/2000g.html#15 360/370 instruction cycle time
http://www.garlic.com/~lynn/2001j.html#27 Pentium 4 SMT "Hyperthreading"
http://www.garlic.com/~lynn/2001n.html#63 Hyper-Threading Technology - Intel information.
http://www.garlic.com/~lynn/2002g.html#70 Pipelining in the past
http://www.garlic.com/~lynn/2002g.html#76 Pipelining in the past
http://www.garlic.com/~lynn/2003l.html#48 IBM Manuals from the 1940's and 1950's
http://www.garlic.com/~lynn/2003m.html#60 S/360 undocumented instructions?

lots of past mentions of compare and swap (again from 30 some years ago):
http://www.garlic.com/~lynn/93.html#0 360/67, was Re: IBM's Project F/S ?
http://www.garlic.com/~lynn/93.html#14 S/360 addressing
http://www.garlic.com/~lynn/94.html#28 370 ECPS VM microcode assist
http://www.garlic.com/~lynn/2000g.html#16 360/370 instruction cycle time
http://www.garlic.com/~lynn/2001d.html#42 IBM was/is: Imitation...
http://www.garlic.com/~lynn/2001e.html#73 CS instruction, when introducted ?
http://www.garlic.com/~lynn/2001f.html#41 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#61 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#69 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#70 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#73 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#74 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#75 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001f.html#76 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001g.html#4 Extended memory error recovery
http://www.garlic.com/~lynn/2001g.html#8 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001g.html#9 Test and Set (TS) vs Compare and Swap (CS)
http://www.garlic.com/~lynn/2001i.html#2 Most complex instructions (was Re: IBM 9020 FAA/ATC Systems from 1960's)
http://www.garlic.com/~lynn/2001i.html#34 IBM OS Timeline?
http://www.garlic.com/~lynn/2001k.html#8 Minimalist design (was Re: Parity - why even or odd)
http://www.garlic.com/~lynn/2001k.html#65 SMP idea for the future
http://www.garlic.com/~lynn/2001k.html#67 SMP idea for the future
http://www.garlic.com/~lynn/2001n.html#42 Cache coherence [was Re: IBM POWER4 ...]
http://www.garlic.com/~lynn/2001n.html#43 IBM 1800
http://www.garlic.com/~lynn/2002.html#52 Microcode?
http://www.garlic.com/~lynn/2002c.html#9 IBM Doesn't Make Small MP's Anymore
http://www.garlic.com/~lynn/2002f.html#13 Hardware glitches, designed in and otherwise
http://www.garlic.com/~lynn/2002h.html#45 Future architecture [was Re: Future micro-architecture: ]
http://www.garlic.com/~lynn/2002l.html#58 Spin Loop?
http://www.garlic.com/~lynn/2002l.html#59 Spin Loop?
http://www.garlic.com/~lynn/2002l.html#69 The problem with installable operating systems
http://www.garlic.com/~lynn/2003.html#12 cost of crossing kernel/user boundary
http://www.garlic.com/~lynn/2003.html#18 cost of crossing kernel/user boundary
http://www.garlic.com/~lynn/2003b.html#20 Card Columns
http://www.garlic.com/~lynn/2003c.html#75 The relational model and relational algebra - why did SQL become the industry standard?
http://www.garlic.com/~lynn/2003c.html#78 The relational model and relational algebra - why did SQL become the industry standard?
http://www.garlic.com/~lynn/2003d.html#17 CA-RAMIS
http://www.garlic.com/~lynn/2003e.html#67 The Pentium 4 - RIP?
http://www.garlic.com/~lynn/2003g.html#12 Page Table - per OS/Process
http://www.garlic.com/~lynn/2003g.html#15 Disk capacity and backup solutions
http://www.garlic.com/~lynn/2003g.html#30 One Processor is bad?
http://www.garlic.com/~lynn/2003h.html#5 IBM says AMD dead in 5yrs ... -- Microsoft Monopoly vs. IBM
http://www.garlic.com/~lynn/2003h.html#19 Why did TCP become popular ?
http://www.garlic.com/~lynn/2003h.html#20 UT200 (CDC RJE) Software for TOPS-10?
http://www.garlic.com/~lynn/2003j.html#58 atomic memory-operation question
http://www.garlic.com/~lynn/2003m.html#29 SR 15,15
http://www.garlic.com/~lynn/2003o.html#32 who invented the "popup" ?

-- 
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/ 
Internet trivia 20th anv http://www.garlic.com/~lynn/rfcietff.htm


Relevant Pages

  • Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs
    ... :This doesn't contradict your claim since main memory is not really involved. ... that gives the same not-very-real-world cache state for all iterations ... full, and the cpu stalls anyway. ... static instruction order makes it easiest for them, ...
    (freebsd-arch)
  • Re: [PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs
    ... :This doesn't contradict your claim since main memory is not really involved. ... that gives the same not-very-real-world cache state for all iterations ... full, and the cpu stalls anyway. ... static instruction order makes it easiest for them, ...
    (freebsd-current)
  • Re: Instruction And Data memory
    ... The difference is that instruction memory is exactly that: ... Cache efficiency. ... instructions, requiring an I-cache refill. ...
    (sci.electronics.design)
  • Re: Problem: Creating a raw binary string
    ... > While its true that a 64-bit cpu will move twice the data per instruction it ... > Memory bus width plays an important role here and unless it too is widened / ... You are forgetting the two levels of cache in the processor. ... The memory chips are addressed in Row col fashion. ...
    (alt.comp.lang.borland-delphi)
  • Re: Superstitious learning in Computer Architecture
    ... don't really eat up that much memory bandwidth. ... That's what instruction caches and Harvard architecture is for. ... about is a loop with a 100% hit in the instruction cache, ... There's also a processor+DRAM chip (Mitsubishi DN10000 series, ...
    (comp.arch.arithmetic)